One of the laziest ideas in legal AI is that better systems will come from larger prompts.
If the model needs more context, just give it more. Add more documents, more history, more background, more examples, more instructions, more prior work. Maybe then it will understand the matter.
That instinct is wrong often enough to be a product smell.
The better direction is bounded memory.
Bigger prompts are not the same thing as better systems
When legal AI systems struggle, the first reaction is usually to increase the amount of material sent into the model:
- more matter documents
- more historical examples
- more style instructions
- more prior conversation
- more extracted facts
That can help in narrow situations. As a system design strategy, it breaks down quickly.
Large prompts create four predictable problems:
- token cost rises fast
- latency gets worse
- irrelevant context dilutes the actual task
- the system becomes less legible because nobody can tell what really mattered
This is not just a technical inefficiency. In legal work, it becomes a trust problem.
If the system answers a question or drafts a document using a giant pile of mixed context, the lawyer cannot easily see what was loaded, what was relevant, and what should have stayed out.
The better model is bounded memory
Bounded memory means the system stores context persistently outside the model and loads only what a specific task needs.
That sounds obvious. It is still not how many AI products are built.
Instead of treating the model prompt as the primary memory surface, the system should have:
- persistent records
- scoped retrieval rules
- typed facts where appropriate
- task-specific context assembly
- visible boundaries around what reaches a given run
The practical goal is simple:
the system should know what not to load
That is what makes memory useful.
Recent research is moving in the same direction
This is not just a legal-tech opinion. The broader multi-agent systems literature is moving the same way.
Governed Memory
Governed Memory describes a production architecture for multi-agent workflows
with entity-scoped isolation, dual memory, and progressive context delivery.
The part that matters here is not the branding. It is the mechanism:
- persistent memory outside the model
- scoped retrieval before semantic search
- selective context delivery per step
The paper reports a 50.3% token reduction across a five-step workflow, with
some steps seeing 86–90% savings when the system sends only the delta instead
of reloading everything each time.
That is exactly the kind of result that should matter to legal AI buyers. Not because they care about token accounting in the abstract, but because lower context waste usually means lower cost, lower latency, and a clearer execution path.
Source:
Facts as First Class Objects
Facts as First Class Objects pushes the same idea in a different direction.
Its argument is that facts should not only survive as embedded text in a large prompt. They should be addressable, persistent objects.
That matters for legal AI because a system should not have to rediscover the same core matter facts from scratch in every run. Some context should be retrieved as facts, not re-explained as prose.
Source:
Multi-Agent Memory from a Computer Architecture Perspective
This paper reframes the problem correctly: multi-agent memory is a systems problem, not a prompt problem.
Its I/O / cache / memory hierarchy is especially useful.
The practical lesson is simple: context assembly should be treated like a cache layer with scope, freshness, and consistency rules, not as a giant string builder attached to a model call.
Source:
Why this matters more in legal work
In legal AI, bigger prompts are not just inefficient. They are dangerous in a very specific way.
Legal work is sensitive to:
- confidentiality
- privilege
- provenance
- review boundaries
- task specificity
A system that responds to complexity by shoving more documents and more matter history into the next prompt is not just spending more. It is increasing the surface area for confusion and overexposure.
That is why legal AI should care so much about bounded retrieval.
The right system should be able to answer questions like:
- what information was loaded for this run?
- why was it loaded?
- what remained outside the run?
- what persisted after the run?
Those are memory-boundary questions, not model-quality questions.
Bigger prompts also create weaker supervision
Lawyers and legal ops teams need to supervise the system, not just admire its output.
That gets harder as prompt construction gets broader and less disciplined.
If every run includes:
- the full matter history
- prior documents
- old examples
- style notes
- whatever else seemed plausibly relevant
then review becomes harder, not easier.
The supervising attorney cannot easily tell what drove the output. The system starts to feel smart in a vague way instead of trustworthy in a specific way.
That is the wrong direction for legal AI.
What buyers should be probing instead
Many legal AI buyers still get pulled toward context-window bragging rights.
That is a weak proxy for system quality.
The more revealing issues are:
- how the product stores context outside the model
- how it decides what to load
- how it avoids loading irrelevant or sensitive material unnecessarily
- how visible the boundary is around each run
The real win is not maximum context capacity. It is context discipline.
The direction of travel
The strongest systems are moving toward:
- scoped persistent memory
- typed facts where useful
- explicit context assembly
- progressive delivery instead of repeated full reloads
- clearer separation between persistent records and per-run context
That is good news for legal AI, because legal work needs exactly those properties.
The future of legal AI is not a bigger prompt window full of client documents.
It is a system that knows what belongs in memory, what belongs in the current run, and what should stay out.