An evaluation framework for legal AI platforms grounded in architectural properties (governed execution, firm-scoped retrieval, approval gates, memory model) rather than feature checklists. Categories firms can weigh before signing.
Evaluation Methodology·Five architectural criteria·Buyer-side framework
FlowCounsel Research · 2026-05-29
Abstract
Buyer guidance for legal AI is dominated by feature checklists. Those frames reward the longest feature list rather than the strongest architecture, and they do not surface the system properties that determine whether legal work survives review under sustained use. This paper proposes five architectural criteria grounded in published authority: the NIST AI Risk Management Framework (NIST, 2023; NIST, 2024), the ABA Model Rules of Professional Conduct, ABA Formal Opinion 512 (2024), and recent state-bar AI guidance (California State Bar, 2023; Florida Bar, 2024; New York State Bar, 2024). Each criterion is translated into a diagnostic vendor question, a pattern that satisfies it, and a pattern that does not. Three operational signals (migration friction, pricing transparency, contract flexibility) supplement the architectural ones.
Current buyer guidance for legal AI is dominated by feature checklists and head-to-head scoring rubrics: how many integrations, how many practice areas, how many AI features, what the per-seat price is. Those frames reward the vendor with the longest feature list, not the vendor with the best architecture, and they do not surface the system properties that the regulatory frame requires firms to evaluate (NIST, 2023; ABA Model Rule 1.1 Comment 8).
The architectural question is different. Does the system carry the properties that determine whether legal work survives review under sustained use? Most legal AI today is a single model wrapped in a workspace UI. That is a feature surface, not an operating layer (FlowCounsel™, "Compressed Output Is Not Compressed System," April 2026; FlowCounsel, "How to Tell If Your Legal AI Vendor Built a Product or Prompted One," April 2026).
This paper sets out five architectural criteria grounded in the NIST AI Risk Management Framework (NIST, 2023), the ABA Model Rules, ABA Formal Opinion 512 (2024), and the published state-bar guidance from California (2023), Florida (2024), and New York (2024). Each criterion is followed by a diagnostic vendor question, the architectural pattern that satisfies it, and the pattern that does not.
A 16-feature checklist can describe a glorified CRM with a chat box. The architectural questions reveal whether anything underneath has been governed.
The five criteria are derived from the NIST AI RMF Govern-Map-Measure-Manage functions (NIST, 2023), the GenAI-specific risk categories in the NIST GenAI Profile (NIST, 2024), and the supervision and confidentiality framing established in ABA Formal Opinion 512 (2024) and the state-bar opinions cited above. Each criterion answers a question the regulatory frame already obligates firms to answer.
Diagram · The five architectural criteria
The properties that separate a tool from an operating layer. Each criterion comes with a vendor question that resolves it.
A governed execution architecture treats every system action as a workflow_run with structured step_runs underneath. Each run carries the context_manifest used, the inputs, the tools called, the verifier results, and the produced artifact. The runs are auditable: a firm can replay any run and inspect what happened. This property maps directly to NIST AI RMF Manage 4.1 (continuous monitoring with structured records) and to the audit-trail expectations articulated in ABA Formal Op. 512 (2024).
A non-governed alternative looks similar from the outside. There is a chat session, an output is produced, and the firm trusts it. Underneath, the run is a sequence of API calls that left log lines but no structured record. There is no replay, no provenance, no inspectable manifest. The pattern is documented in FlowCounsel's prior critique of orchestration framing ("Orchestration Is Not the Category," May 2026).
Every retrieval the system performs is scoped: firm, client, matter, practice area, jurisdiction, specialist type, procedural stage. The context manifest records which records were eligible, why each was eligible, what was excluded, and which promotion or approval basis supports each loaded record. The discipline is required by ABA Model Rule 1.6 (confidentiality safeguarding obligation under "reasonable efforts") and by the data-integrity risk category in the NIST GenAI Profile (NIST, 2024).
The negative case: a single context window loaded with whatever embeddings happened to match a query, with no record of why each item was loaded. That is search-style retrieval applied to legal work. It fails the audit test under both ABA Op. 512 and NIST AI RMF Measure 2.7 (test and evaluation must produce traceable evidence).
Many vendors describe "human in the loop" as a feature. The phrase is too weak to do the work it is asked to do (FlowCounsel, "Why Review Boundaries Matter More Than Model Choice," April 2026). The architectural question is whether legal work can move into the world without a defined approval state transition. If the answer is "the user is expected to review the output," the protection is a policy hope. If the answer is "externally effective work cannot transition to approved without explicit state change and audited reviewer identity," the protection is architectural. That architectural form is what ABA Model Rule 5.3 and ABA Formal Op. 512 (2024) require lawyers to ensure when supervising nonlawyer assistance.
The difference matters because the policy version fails at scale. As a firm onboards more users and more workflows, "the user is expected to review" gets bypassed in practice. The architectural version cannot be bypassed because the state transition is part of the data model. This paper's companion paper, "Approval-Gated Execution as an Architectural Property," develops the boundary in detail.
Memory in legal AI is not "remember earlier in this chat." It is the firm's accumulated judgment: which clauses the partner refuses, which jurisdictional variations apply, which procedural sequences worked. The three-class taxonomy (episodic, semantic, procedural) is grounded in cognitive-science research on long-term memory systems and is treated in detail in the companion paper "Memory Class Architectures for Firm-Scoped Legal Intelligence."
For evaluation purposes the test is whether the vendor can articulate the promotion path from an attorney edit to a firm rule, the required states, and the approving actor. A "we learn from your edits" claim with no articulation of these structures is a marketing claim, not an architectural one.
Most legal AI splits between public discovery surfaces (intake forms, directory listings, lead capture) and private case management. Firms re-enter the same context across that split because the systems do not share an operating layer. The architecturally important property is whether the intake structure and source attribution travel with the matter from origin to resolution (FlowCounsel, "The Front Office and Back Office of Legal Are Becoming One System," May 2026).
The intake side of this boundary additionally implicates the consent, disclosure, and unauthorized-practice-of-law (UPL) constraints addressed in Florida Bar Ethics Opinion 24-1 (2024) for client-facing automated systems. A unified operating layer makes those constraints architecturally enforceable; a separate-products handshake leaves them as policy.
Three operational signals supplement the architectural criteria. They sit outside the regulatory frame but they reveal how the vendor will treat the firm once the firm signs.
A firm running an evaluation can apply the architectural questions before the feature questions. The answers resolve most of the vendor field quickly. The vendors who can answer them coherently are the small set worth feature-evaluating in depth.
Feature breadth is easier to add than architectural discipline. A platform with a strong operating layer and a thinner feature surface closes the feature gap in a year. A platform with a long feature list but no governed execution, no firm-scoped retrieval, and no memory architecture cannot retrofit those properties without rebuilding (FlowCounsel, "The Next Category in Legal AI Is Governed Execution," May 2026; FlowCounsel, "Trusted Sources Are Necessary but Not Sufficient," May 2026).
Feature breadth catches up in a year. Architectural discipline does not retrofit.
References
How to cite this paper
FlowCounsel Research (2026). Evaluating legal AI operating systems. FlowCounsel™. https://flowcounsel.com/research/evaluation-framework
The related papers in the FlowCounsel research program and the architectural thesis they translate into product.
Research · Intelligence
Memory class architectures for firm-scoped legal intelligence — the depth treatment of criterion 4.
Research · Governance
Approval-gated execution as an architectural property — the depth treatment of criterion 3.
Architecture
The architectural thesis this evaluation framework is grounded in.