tags: [nl-testing, ambiguity, confidence, safety, playwright, tfidf, ui-contracts] related:
- packages/nlp/ui-resolver.js
- packages/testing/nl-runner.js
- packages/testing/tests/nl-runner.test.js
- decisions/nl-testing.md
- responses/RESPONSE_2026-05-16_nl-testing-p2-p3.md
- responses/RESPONSE_2026-05-16_nl-testing-p5.md status: current —
022 — The Confidence Gap as a Safety Gate
When you give an AI the ability to click buttons in a browser, something becomes important that was never important before: which button.
In a deterministic test — a Playwright test written by a human — the selector is exact. [data-action="create"][data-page="order"] matches one element or it matches nothing, and the test fails cleanly. The human wrote the selector knowing what they meant.
In a natural language test, the selector is derived. “Click the new order button” becomes a query against a TF-IDF index of UI contracts. The resolver scores every element in the contract index and returns the best match. That best match might be correct. It might also be the second-best match on a bad day.
The question is: what happens when the resolver isn’t sure?
The Wrong Answer
The wrong answer is: return the best match regardless and let the test proceed.
This looks fine when the corpus is small and the top match is obvious. It breaks quietly when the corpus grows. You add a second model with similar field names. You add a form that has both a “New Order” button and a “New Invoice” button. The resolver still returns a result — probably the right one, but you don’t know.
A test that might be clicking the wrong button is not a test. It’s a liability. It can pass consistently against the wrong element, give you green builds, and hide a broken flow for weeks.
The Right Answer: Throw on Ambiguity
The resolver in packages/nlp/ui-resolver.js returns an ambiguous flag when the second-best match is within 10% of the best:
// Line ~156 in ui-resolver.js
const ambiguous = !!(second && second.confidence >= best.confidence * 0.90);
That 10% gap is the confidence gap. When the gap is too small, the resolver has found two plausible answers and cannot decide between them on its own.
The NL runner translateNlStep() treats ambiguity as a hard error:
if (result.ambiguous) {
throw new NlAmbiguousError(
noun,
{ selector: result.selector, confidence: result.confidence },
{ selector: second.selector, confidence: second.confidence }
);
}
The error message names both candidates and their scores:
NlAmbiguousError: "new button" is ambiguous:
"[data-action='create'][data-page='order']" (0.612)
vs "[data-action='create'][data-page='invoice']" (0.591)
Use a more specific description or add a data-page qualifier.
The test stops. It does not guess.
Why This Is a Safety Gate, Not a Failure
An error that stops a test is useful information. It tells you exactly what the resolver found, which two elements it couldn’t distinguish, and that your test step needs to be more specific.
A test that silently picks the wrong element gives you none of that. It gives you a green build and a hidden bug.
The ambiguity error is a gate in the same way type errors are gates in a typed language. The type system isn’t failing when it rejects a bad call — it’s enforcing a contract. The NL resolver isn’t failing when it throws NlAmbiguousError — it’s enforcing that “the step is specific enough to be safe to execute.”
The gate has exactly one output: make the step more specific. Either rewrite the step text (“click new order button” instead of “click new button”) or improve the UI contract on the element (data-page="order" instead of nothing). Both outcomes improve the system permanently — the test corpus gets richer, or the UI gets more precisely annotated.
The 10% Threshold
Why 10%? The threshold is a tradeoff between two failure modes.
Too tight (e.g. 1%): you only flag ambiguity when two candidates are nearly identical. Most near-misses pass through as confident — the resolver guesses, sometimes wrong.
Too loose (e.g. 50%): you flag ambiguity constantly because most queries have a second result within half the score of the best. Tests become unusable; every step needs to be a perfect unique query.
10% is empirically reasonable for TF-IDF over UI contracts. At this threshold, the resolver is confident when there is a meaningful gap between the best and second-best — which means there is a clear winner. When the gap is small, the two elements have nearly identical token overlap with the query, and the top result is genuinely not trustworthy.
The threshold is configurable via opts.minConfidence. Different corpora at different scales may need different values. But the principle — throw when the gap is too small — doesn’t change.
What This Builds Toward
The NlAmbiguousError is the mechanism that keeps natural language testing from degrading into vague automation. Without it, NL tests would be fast to write and fragile to trust. With it, the confidence gap is the enforcer: every step either names its element precisely enough to act on, or it fails loudly and tells you exactly why.
That’s the bar any AI-driven testing layer needs to meet. Not “usually clicks the right thing” — but “refuses to act when it can’t be certain.”
The selector resolver is TF-IDF today. The ambiguity gate works the same way when you replace TF-IDF with local embeddings, because ambiguity isn’t a property of the algorithm — it’s a property of the query. When two things look similar from any angle, you need more information. The gate enforces that you get it.