Evaluating legal AI operating systems

Abstract

Buyer guidance for legal AI is dominated by feature checklists. Those frames reward the longest feature list rather than the strongest architecture, and they do not surface the system properties that determine whether legal work survives review under sustained use. This paper proposes five architectural criteria grounded in published authority: the NIST AI Risk Management Framework (NIST, 2023; NIST, 2024), the ABA Model Rules of Professional Conduct, ABA Formal Opinion 512 (2024), and recent state-bar AI guidance (California State Bar, 2023; Florida Bar, 2024; New York State Bar, 2024). Each criterion is translated into a diagnostic vendor question, a pattern that satisfies it, and a pattern that does not. Three operational signals (migration friction, pricing transparency, contract flexibility) supplement the architectural ones.

§1Thesis

Feature checklists are the wrong evaluation frame for legal AI.

Current buyer guidance for legal AI is dominated by feature checklists and head-to-head scoring rubrics: how many integrations, how many practice areas, how many AI features, what the per-seat price is. Those frames reward the vendor with the longest feature list, not the vendor with the best architecture, and they do not surface the system properties that the regulatory frame requires firms to evaluate (NIST, 2023; ABA Model Rule 1.1 Comment 8).

The architectural question is different. Does the system carry the properties that determine whether legal work survives review under sustained use? Most legal AI today is a single model wrapped in a workspace UI. That is a feature surface, not an operating layer (FlowCounsel™, "Compressed Output Is Not Compressed System," April 2026; FlowCounsel, "How to Tell If Your Legal AI Vendor Built a Product or Prompted One," April 2026).

This paper sets out five architectural criteria grounded in the NIST AI Risk Management Framework (NIST, 2023), the ABA Model Rules, ABA Formal Opinion 512 (2024), and the published state-bar guidance from California (2023), Florida (2024), and New York (2024). Each criterion is followed by a diagnostic vendor question, the architectural pattern that satisfies it, and the pattern that does not.

A 16-feature checklist can describe a glorified CRM with a chat box. The architectural questions reveal whether anything underneath has been governed.

§2Five architectural criteria

The properties that separate a tool from an operating layer.

The five criteria are derived from the NIST AI RMF Govern-Map-Measure-Manage functions (NIST, 2023), the GenAI-specific risk categories in the NIST GenAI Profile (NIST, 2024), and the supervision and confidentiality framing established in ABA Formal Opinion 512 (2024) and the state-bar opinions cited above. Each criterion answers a question the regulatory frame already obligates firms to answer.

Governed execution layer. Workflow runs, step runs, and context manifests as first-class records, not as logs. Grounded in NIST AI RMF Manage 4.1 (incident response and continuous monitoring requires structured records).
Firm-scoped retrieval boundary. Every retrieval reads only what is eligible for this firm, client, matter, jurisdiction, and specialist context. Grounded in ABA Model Rule 1.6 (confidentiality) and the data-integrity risk category in the NIST GenAI Profile (NIST, 2024).
Approval-gated execution as architecture, not policy. Externally effective work cannot leave the system without explicit review state transitions. Grounded in ABA Model Rule 5.3 and ABA Formal Op. 512 (2024) on lawyer supervisory duties.
Memory architecture. Three classes governed independently (episodic, semantic, procedural), with promotion discipline. Treated in depth in the companion paper "Memory Class Architectures for Firm-Scoped Legal Intelligence."
Public-layer integration. The public discovery surface and the private matter execution share one operating layer so intake structure and source attribution travel with the matter (FlowCounsel, "The Front Office and Back Office of Legal Are Becoming One System," May 2026).

Diagram · The five architectural criteria

The properties that separate a tool from an operating layer. Each criterion comes with a vendor question that resolves it.

Feature breadth catches up in a year. Architectural discipline does not retrofit. The vendors who can answer all five questions coherently are the small set worth feature-evaluating in depth.

§3Criterion 1

Governed execution: every run is a record, not a log line.

A governed execution architecture treats every system action as a workflow_run with structured step_runs underneath. Each run carries the context_manifest used, the inputs, the tools called, the verifier results, and the produced artifact. The runs are auditable: a firm can replay any run and inspect what happened. This property maps directly to NIST AI RMF Manage 4.1 (continuous monitoring with structured records) and to the audit-trail expectations articulated in ABA Formal Op. 512 (2024).

A non-governed alternative looks similar from the outside. There is a chat session, an output is produced, and the firm trusts it. Underneath, the run is a sequence of API calls that left log lines but no structured record. There is no replay, no provenance, no inspectable manifest. The pattern is documented in FlowCounsel's prior critique of orchestration framing ("Orchestration Is Not the Category," May 2026).

Diagnostic vendor question: "Show me the structured record of the last 24 hours of agent activity on a test matter. What did each agent load, what did it call, what did it produce, and what verifier results were checked?"
Architectural pattern that satisfies: workflow runs and step runs as records the firm can query, not as opaque trace data sold back as analytics.
Pattern that does not satisfy: "audit trail" sold as a feature where the surface is a list of timestamps without the inputs, context, or verifier results that led to each output.

§4Criterion 2

Firm-scoped retrieval: bounded context manifests, not blended context.

Every retrieval the system performs is scoped: firm, client, matter, practice area, jurisdiction, specialist type, procedural stage. The context manifest records which records were eligible, why each was eligible, what was excluded, and which promotion or approval basis supports each loaded record. The discipline is required by ABA Model Rule 1.6 (confidentiality safeguarding obligation under "reasonable efforts") and by the data-integrity risk category in the NIST GenAI Profile (NIST, 2024).

The negative case: a single context window loaded with whatever embeddings happened to match a query, with no record of why each item was loaded. That is search-style retrieval applied to legal work. It fails the audit test under both ABA Op. 512 and NIST AI RMF Measure 2.7 (test and evaluation must produce traceable evidence).

Diagnostic vendor question: "When the system retrieves firm patterns for a new run on Client A in Jurisdiction B, can the firm see exactly which records were loaded and why each was eligible?"
Architectural pattern that satisfies: context manifests as first-class records, retrieval scoped by structured policy, exclusions logged.
Pattern that does not satisfy: vector search across all firm content without scope discipline, "personalization" that cannot be inspected.

§5Criterion 3

Approval-gated execution as architecture, not as policy hope.

Many vendors describe "human in the loop" as a feature. The phrase is too weak to do the work it is asked to do (FlowCounsel, "Why Review Boundaries Matter More Than Model Choice," April 2026). The architectural question is whether legal work can move into the world without a defined approval state transition. If the answer is "the user is expected to review the output," the protection is a policy hope. If the answer is "externally effective work cannot transition to approved without explicit state change and audited reviewer identity," the protection is architectural. That architectural form is what ABA Model Rule 5.3 and ABA Formal Op. 512 (2024) require lawyers to ensure when supervising nonlawyer assistance.

The difference matters because the policy version fails at scale. As a firm onboards more users and more workflows, "the user is expected to review" gets bypassed in practice. The architectural version cannot be bypassed because the state transition is part of the data model. This paper's companion paper, "Approval-Gated Execution as an Architectural Property," develops the boundary in detail.

Diagnostic vendor question: "Is there a class of system action that produces externally effective output without an explicit approve state transition? Show me the data model."
Architectural pattern that satisfies: explicit approval state machine, externally-effective work gated behind reviewer-identity-bound transitions, audit trail of every approval (consistent with ABA Op. 512, 2024).
Pattern that does not satisfy: approve/reject UI without state-machine enforcement, "you can review before sending" framed as a feature rather than a system property.

§6Criterion 4

Memory architecture: three classes governed independently.

Memory in legal AI is not "remember earlier in this chat." It is the firm's accumulated judgment: which clauses the partner refuses, which jurisdictional variations apply, which procedural sequences worked. The three-class taxonomy (episodic, semantic, procedural) is grounded in cognitive-science research on long-term memory systems and is treated in detail in the companion paper "Memory Class Architectures for Firm-Scoped Legal Intelligence."

For evaluation purposes the test is whether the vendor can articulate the promotion path from an attorney edit to a firm rule, the required states, and the approving actor. A "we learn from your edits" claim with no articulation of these structures is a marketing claim, not an architectural one.

Diagnostic vendor question: "When the system learns from an attorney redline, what is the promotion path from the redline to a firm rule? What are the required states? Who approves promotion?"
Architectural pattern that satisfies: explicit memory classes, promotion discipline (candidate → under review → approved → rejected → superseded → recalled), revocable patterns, provenance from rule back to source diffs.
Pattern that does not satisfy: "we learn from your edits" with no articulation of which class of memory, what governance, or how to revoke a learned pattern that turns out to be wrong.

§7Criterion 5

Public-layer integration: the front door and the matter file on one operating layer.

Most legal AI splits between public discovery surfaces (intake forms, directory listings, lead capture) and private case management. Firms re-enter the same context across that split because the systems do not share an operating layer. The architecturally important property is whether the intake structure and source attribution travel with the matter from origin to resolution (FlowCounsel, "The Front Office and Back Office of Legal Are Becoming One System," May 2026).

The intake side of this boundary additionally implicates the consent, disclosure, and unauthorized-practice-of-law (UPL) constraints addressed in Florida Bar Ethics Opinion 24-1 (2024) for client-facing automated systems. A unified operating layer makes those constraints architecturally enforceable; a separate-products handshake leaves them as policy.

Diagnostic vendor question: "When a prospect comes in through the public surface, what data structure crosses into the matter file? Does the source attribution persist through retention and settlement?"
Architectural pattern that satisfies: one operating layer underneath the public discovery layer and the private matter execution layer, intake schema identical across the boundary, attribution as an architectural property.
Pattern that does not satisfy: separate "intake CRM" and "case management" products that handshake through a vendor integration, attribution as after-the-fact analytics.

§8Buying signals beyond architecture

Migration friction, pricing transparency, contract flexibility.

Three operational signals supplement the architectural criteria. They sit outside the regulatory frame but they reveal how the vendor will treat the firm once the firm signs.

Migration friction: who moves the firm's data, who pays for it, and how long it takes. A vendor that requires $10–25k in third-party consultants to migrate is signaling that adoption is not their problem. A vendor with an in-house migration team at no cost is signaling that they earn the relationship.
Pricing transparency: a single per-user rate with all features included reveals confidence. Per-feature unbundled pricing with seven add-ons rewards firms that consolidate into fewer tools but punishes growing firms with surprise per-seat bills.
Contract flexibility: month-to-month with cancel-anytime is the posture of a vendor that earns retention through product quality. Annual contracts required before day one are the posture of a vendor that knows the product cannot defend its own renewal.

§9How to apply this

The architectural questions resolve most of the field quickly.

A firm running an evaluation can apply the architectural questions before the feature questions. The answers resolve most of the vendor field quickly. The vendors who can answer them coherently are the small set worth feature-evaluating in depth.

Feature breadth is easier to add than architectural discipline. A platform with a strong operating layer and a thinner feature surface closes the feature gap in a year. A platform with a long feature list but no governed execution, no firm-scoped retrieval, and no memory architecture cannot retrofit those properties without rebuilding (FlowCounsel, "The Next Category in Legal AI Is Governed Execution," May 2026; FlowCounsel, "Trusted Sources Are Necessary but Not Sufficient," May 2026).

Feature breadth catches up in a year. Architectural discipline does not retrofit.

References

American Bar Association (2024). Formal Opinion 512: Generative Artificial Intelligence Tools. Standing Committee on Ethics and Professional Responsibility, July 29, 2024. https://www.americanbar.org/news/abanews/aba-news-archives/2024/07/aba-issues-first-ethics-guidance-ai-tools/
American Bar Association. Model Rule 1.1: Competence, Comment [8] (technology competence, added 2012). https://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_1_1_competence/comment_on_rule_1_1/
American Bar Association. Model Rule 1.6: Confidentiality of Information. https://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_1_6_confidentiality_of_information/
American Bar Association. Model Rule 5.3: Responsibilities Regarding Nonlawyer Assistance. https://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_5_3_responsibilities_regarding_nonlawyer_assistant/
California State Bar, Standing Committee on Professional Responsibility and Conduct (2023). Practical Guidance for the Use of Generative Artificial Intelligence in the Practice of Law. Approved by Board of Trustees, November 16, 2023. https://www.calbar.ca.gov/
FlowCounsel (2026, April 2). What ABA 512 and Heppner Together Require From Legal AI Systems. FlowCounsel Blog. https://flowcounsel.com/blog/what-aba-512-and-heppner-together-require-from-legal-ai-systems
FlowCounsel (2026, April 2). Why Review Boundaries Matter More Than Model Choice. FlowCounsel Blog. https://flowcounsel.com/blog/why-review-boundaries-matter-more-than-model-choice
FlowCounsel (2026, April 9). How to Tell If Your Legal AI Vendor Built a Product or Prompted One. FlowCounsel Blog. https://flowcounsel.com/blog/how-to-tell-if-your-legal-ai-vendor-built-a-product-or-prompted-one
FlowCounsel (2026, April 18). Compressed Output Is Not Compressed System. FlowCounsel Blog. https://flowcounsel.com/blog/compressed-output-is-not-compressed-system
FlowCounsel (2026, May 21). The Next Category in Legal AI Is Governed Execution. FlowCounsel Blog. https://flowcounsel.com/blog/governed-execution-is-the-category
FlowCounsel (2026, May 26). Orchestration Is Not the Category. FlowCounsel Blog. https://flowcounsel.com/blog/orchestration-is-not-the-category
FlowCounsel (2026, May 26). Trusted Sources Are Necessary but Not Sufficient. FlowCounsel Blog. https://flowcounsel.com/blog/trusted-sources-are-necessary-but-not-sufficient
Florida Bar (2024). Ethics Opinion 24-1: Lawyers' Use of Generative Artificial Intelligence. Florida Bar Board of Governors, January 19, 2024. https://www-media.floridabar.org/uploads/2024/01/FL-Bar-Ethics-Op-24-1.pdf
National Institute of Standards and Technology (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, January 26, 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
National Institute of Standards and Technology (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1, July 26, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
New York State Bar Association (2024). Report and Recommendations of the Task Force on Artificial Intelligence. House of Delegates approval, April 6, 2024. https://nysba.org/app/uploads/2022/03/2024-April-Report-and-Recommendations-of-the-Task-Force-on-Artificial-Intelligence.pdf

How to cite this paper

FlowCounsel Research (2026). Evaluating legal AI operating systems. FlowCounsel™. https://flowcounsel.com/research/evaluation-framework

Feature checklists are the wrong evaluation frame for legal AI.

The properties that separate a tool from an operating layer.

Governed execution: every run is a record, not a log line.

Firm-scoped retrieval: bounded context manifests, not blended context.

Approval-gated execution as architecture, not as policy hope.

Memory architecture: three classes governed independently.

Public-layer integration: the front door and the matter file on one operating layer.

Migration friction, pricing transparency, contract flexibility.

The architectural questions resolve most of the field quickly.

Where this research connects.

Practice Areas

Platform

Resources

Company