The Missing State in Legal AI: Knowing When Not to Decide

A new Stanford paper gives a name to a legal-AI failure mode the market still understates: factual presumptuousness.

The term comes from Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication, posted on arXiv on April 21, 2026 by Mohamed Afane, Emily Robitschek, Derek Ouyang, and Daniel E. Ho.

The paper names a problem lawyers should recognize immediately. AI systems do not just get answers wrong. They often decide when the right answer is that the record is not ready for a decision.

That is a different failure mode than ordinary inaccuracy. The problem is not just a bad answer. It is a bad answer delivered in the wrong procedural posture.

And in legal work, that distinction matters.

The useful takeaway is not the benchmark

The Stanford team studies unemployment-insurance adjudication in collaboration with the Colorado Department of Labor and Employment. They built a benchmark that varies how much information the system receives. Some cases are complete. Others are missing outcome-determinative facts. The question is whether the model reaches the right result when all facts are present and whether it can recognize when more fact-finding is required before any result should be reached.

Most legal-AI discussion still treats the problem as answer quality. Is the summary good? Is the draft fluent? Is the citation right? Can the model handle the legal question?

The Stanford paper shifts the question one layer earlier:

Should the system be deciding at all?

Legal AI still skips that question too often.

Presumptuousness is adjacent to hallucination

In LLMs Do Not Reason. Legal AI Has to Account for That., I argued that hallucinations are not a bolt-on bug in an otherwise reasoning system. They are a natural consequence of probabilistic generation when the surrounding context does not constrain the output to something specifically true.

Presumptuousness is a related failure surface.

When a model hallucinates a citation, it acts as though it has authority it does not actually have.

When a model issues an eligibility determination on insufficient facts, it acts as though it has evidence it does not actually have.

The output looks different. The structural problem is similar. In both cases, the system is optimized to produce an answer. It is not naturally optimized to say: the record is incomplete, and more information is required before this should move forward.

That is why legal AI cannot be evaluated only by how impressive the answer looks. It also has to be evaluated by whether the system knows when not to answer.

What the paper finds

The Stanford results are stark.

Using four leading AI platforms with identical Colorado statutory and adjudication materials provided through retrieval-augmented generation, the authors found that baseline systems achieved an average of only 15% accuracy on inconclusive cases. In other words, when the correct response was that more facts were needed, the systems usually did not defer. They decided anyway.

The paper also shows how the failure can break in different directions. Claude Sonnet 4.5, for example, issued an outright denial in 68% of the missing-information cases the authors analyzed. Those cases did not contain disqualifying facts. The model was not just wrong. It was confidently premature.

A legal workflow does not become safe because the system occasionally reaches the right answer on complete records. The question is what it does when the record is not complete, because that is where legal work actually gets dangerous.

Better prompting helps, then creates a new problem

One of the most useful parts of the paper is that it does not stop at baseline failure.

The authors tested enhanced prompting and more elaborate prompting methods. Those approaches improved deferral on incomplete cases. But they often over-corrected. Systems got better at withholding judgment when facts were missing, then became worse at deciding clear cases.

That is a real tradeoff.

The paper describes it as a determination-deferral tradeoff. That language is worth keeping.

A system that always decides is presumptuous. A system that refuses too much is not usable either. Legal work needs something more disciplined than either extreme.

This is not just a prompting problem. It is a workflow problem.

The missing state

Most legal AI systems still operate as if there are only two meaningful states:

before the model answers
after the model answers

Everything important gets collapsed into the second state.

That collapse is the design mistake.

Legal work needs a third state:

inconclusive

Not error. Not refusal. Not generic uncertainty language. A real workflow state that means the system has checked the governing requirements against the available facts and concluded that the matter is not ready for determination.

The issue is not whether the model sounds cautious. The issue is whether the system has a disciplined way to represent that the file needs more facts before anyone should rely on the output.

That state is missing from a lot of legal AI products. And when it is missing, the system tends to do what language models are built to do: produce a plausible answer anyway.

What Stanford built instead

The Stanford team calls its framework SPEC: Structured Prompting for Evidence Checklists.

The important point is not the acronym. It is the structure.

The system first extracts the legal requirements from the governing materials. Then it compares the facts in the case against those requirements. Then it performs an independent supervisory review of that comparison. Only after that process does the system either produce a determination or return the case as inconclusive with specific missing information identified.

The structure matters more than any one prompt trick.

It forces the system to ask a question many legal-AI workflows still skip:

What facts are actually required for this decision, and do we have them?

That is a better question than "what does the model think the answer is?"

SPEC achieved 89% accuracy overall, including 89% on complete cases and 89% on inconclusive cases. The full significance of that result is not just the score. It is that the system escaped the usual tradeoff between being willing to decide and being willing to defer.

The other Stanford paper points in the same direction

This matters even more when read alongside the Stanford group's February paper, Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys.

That paper found:

Standard RAG: 70%
Westlaw AI: 58%
Lexis+ AI: 64%
STARA: 83%
STARA corrected: 92%

This is not an argument that commercial legal-research incumbents have bad data. It is an argument that authoritative data alone is not enough. The system built around the data still determines whether the output is usable.

That paper showed a custom statutory-research workflow outperforming commercial AI layers marketed for the same kind of work.

This new paper shows a structured adjudication workflow outperforming baseline and advanced prompting approaches on the question of when to decide.

The pattern is consistent.

In legal AI, model access is not the whole story. Authoritative legal content is not the whole story. The architecture that governs how the system retrieves, checks, defers, reviews, and moves work forward is often what determines whether the output is usable.

What buyers should ask now

The due-diligence questions for legal AI should be changing.

Not just:

what model does it use?
how large is the context window?
how polished is the demo?

But also:

does the system have a real inconclusive or needs-information state?
can it identify which legal requirements were checked?
can it show which facts were established and which were missing?
can it return work for more fact-finding instead of forcing a determination?
can a supervising attorney see why the system proceeded or why it deferred?

Vendors optimizing for demos tend to prefer workflows where the system always produces something.

Vendors optimizing for outcomes should be comfortable with a stricter standard: sometimes the right output is not a determination. It is a better specification of what still has to be learned before a determination is possible.

That is a more legal way to think about AI.

Why this matters beyond adjudication

The Stanford paper is about unemployment-insurance adjudication, but the design lesson travels.

Lawyers live inside incomplete records.

A matter arrives without the full correspondence chain. A demand arrives without the attachments. A client explains what happened but leaves out the fact that changes the whole issue. An intake form captures the event but not the date that controls the deadline. A document set suggests one conclusion until the missing exhibit appears.

Legal work is full of moments where the right move is not "answer now."

It is:

ask for more facts
separate what is known from what is inferred
hold the matter in a state that does not allow premature reliance
route the work to review before anything leaves the system

That is why this paper is useful well beyond public-benefits administration. It puts empirical support behind something legal AI products should already have been building around.

The professional standard

The paper cites a recent study finding that real unemployment adjudicators sought additional fact-finding in 87% of claims.

That should reset expectations.

In real legal work, deferral is not edge-case behavior. It is ordinary professional behavior. Good lawyers, judges, and adjudicators do not treat every incomplete file as a forced-choice exam. They know that one of the most important judgments in the process is recognizing when the record is not ready.

Legal AI needs that state too.

Not as vague uncertainty language. Not as a disclaimer after the answer. As a first-class workflow state.

That is still architecture.

Legal AI does not just need better answers.

It needs a trustworthy way to say: not enough information to decide.

Sources

Mohamed Afane, Emily Robitschek, Derek Ouyang, and Daniel E. Ho, Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication (arXiv:2604.19895, Apr. 21, 2026)
Mohamed Afane, Emaan Hariri, Derek Ouyang, and Daniel E. Ho, Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys (CSLAW 2026) https://dho.stanford.edu/wp-content/uploads/Stat_Surveys.pdf
Daniel E. Ho research page https://dho.stanford.edu/research/

The Missing State in Legal AI: Knowing When Not to Decide

The useful takeaway is not the benchmark

Presumptuousness is adjacent to hallucination

What the paper finds

Better prompting helps, then creates a new problem

The missing state

What Stanford built instead

The other Stanford paper points in the same direction

What buyers should ask now

Why this matters beyond adjudication

The professional standard

Sources

Related FlowCounsel posts

Related posts.

Orchestration Is Not the Category

The Front Door Matters

Trusted Sources Are Necessary but Not Sufficient

The infrastructure legal runs on.

Practice Areas

Platform

Resources

Company