How does NSPEC verify bugs?

An independent bug-verifier agent re-runs the repro in a fresh browser context, up to three times. Only bugs that reproduce, with a manual-grade confidence score, make it into the report.

Do you need access to my source code?

No. NSPEC tests the running UI. You give it a URL and optional login. It never reads your repo unless you opt in to git-diff based risk prioritization.

Which viewports are covered?

Six viewports at launch: desktop 1440, laptop 1280, tablet portrait and landscape, mobile portrait and landscape.

Yes, on Enterprise. Docker and Helm, with BYO LLM (OpenAI, Anthropic, or a local model). Artifacts never leave your network.

Why the verifier is a separate agent, not a retry loop

The cheapest thing we could have built was a retry loop. When an agent thinks it found a bug, run its repro three times in the same context, check if it happens every time, mark it verified if it does. That’s about fifteen lines of code. We chose not to build it. Here’s why.

Retries inherit the bias that made the finding

A retry runs in the same browser context, with the same warmed caches, the same session cookies, the same scroll position, and the same agent reasoning about what it just did. If the agent hallucinated a selector that works only because the page is in an unusual partial state, retrying the steps in that same state confirms the hallucination. Retries are a filter for flakiness, not for truth.

“Flaky” and “real bug” are two different axes. A retry loop conflates them.

What an independent verifier actually does

In NSPEC, every bug candidate is handed to a bug-verifiersubagent that starts from a fresh browser context with no memory of the original run. It receives a minimal brief: URL, repro steps, expected observable. Then:

It re-executes the steps up to three times, each time in a new page instance.
It records the outcome: reproduced, reproduced intermittently, or could not reproduce.
It scores manual-grade confidence: high, medium, or low.
It writes a short rationale in first-person: why it believes the bug is real, or why it can’t reproduce it.

Anything below medium is rejected. The rejected candidates are logged for learning but never become tickets.

Why confidence scoring is doing the real work

The binary “reproduced vs. not” is not enough. Some real bugs reproduce only on the second try because the first load raced an asset. Some hallucinations reproduce every time because the agent’s repro steps are structurally broken in a way that happens to coincide with a real UI state. The confidence score captures the difference.

The verifier’s rubric for high:

Reproduced at least once out of three attempts.
The observable matches the original finding, not a near-miss.
No third-party origin is on the critical path (font CDNs, telemetry endpoints, ad-tech domains drop the finding to low automatically).
No contradiction with a sibling agent’s capture of the same DOM.

If any one of those fails, the score drops and the server-side gates get involved.

The server-side gates

Past the verifier, a set of static rules run on the server side of the run. These are intentionally not in an agent’s judgment; we want them deterministic:

Empty-bug reject. Missing title, missing repro, missing selector, missing observable: dropped.
Third-party noise filter. Console errors whose origin is on a known third-party list (GA, Sentry, fonts, ad-tech domains) are not your bug.
Duplicate merge. Two agents, same DOM, same observable → one bug with N sources.
Contradiction detect. Two agents making incompatible claims about the same element → back to the verifier for another pass.
Project memory. Known flaky selectors and known false positives from prior runs are suppressed until the underlying surface demonstrably changed.

The 94% noise number

On the dozen reference apps we’ve run against during beta, roughly 94% of raw agent findings are rejected before they become tickets. The verifier does most of that work; the server-side gates clean up the remainder. The number you see in your tracker is the 6% that survived every filter.

If the verifier is too strict, we miss real bugs. If it’s too loose, we ship noise. The knob is the confidence threshold, and we tune it per-surface · a bug-verifier on a login form behaves differently from one on a chart component. That tuning is a solved problem because we record every rejected candidate and grade our own precision offline.

The one-paragraph version

Retries ask “is this reliable?”. A separate verifier asks “is this real?”. Only the second question belongs in your tracker.

Want to see it run against your app? Join the waitlist.