Courts Begin Testing AI Language Models as Evidence of Contractual Meaning
|

Courts Begin Testing AI Language Models as Evidence of Contractual Meaning

Large language models have slipped into the interpretive toolkit for contract disputes. Judges now ask AI systems to explain policy clauses and sentencing language, then cite the answers alongside dictionaries and corpora. Each time they do, model outputs move a little closer to the evidence courts rely on to decide what the parties meant.

The Judicial Experiments Begin

The most visible experiments in judicial use of large language models come from the United States Court of Appeals for the Eleventh Circuit. In Snell v. United Specialty Insurance Co., an insurance coverage case decided in May 2024, Judge Kevin Newsom wrote a separate concurrence that walked through his use of ChatGPT and other models to probe the meaning of “landscaping” in a commercial policy. He described running prompts about related scenarios and comparing the outputs with dictionary definitions and corpus data as one more way to “triangulate” ordinary meaning.

A few months later, in United States v. Deleon, Newsom again concurred to describe what he called a sequel to his Snell opinion, using multiple large language models to interpret the phrase “physically restrained” in the federal sentencing guidelines. He reported the prompts and outputs in the opinion and emphasized his view that models might serve a limited auxiliary role in resolving multiword phrases that dictionaries do not cover well. The specific large language model used, whether OpenAI’s ChatGPT, Anthropic’s Claude, or others like Google’s Gemini, can introduce variations, which is a core part of the scrutiny the judges’ methodologies have received.

These opinions have drawn significant academic and professional commentary. Commentators tend to agree on two points. First, the concurrences treat model outputs as nonbinding, supplemental evidence of ordinary meaning, not as a replacement for traditional analysis. Second, they make the methodology unusually transparent, which has invited scrutiny from both proponents and critics of generative interpretation.

The Academic Debate Takes Shape

The most developed argument for using artificial intelligence in contract interpretation comes from Yonathan Arbel and David Hoffman’s “Generative Interpretation”, published in 2024 in the New York University Law Review. They propose using large language models to estimate how ordinary readers would understand disputed contract terms, to quantify ambiguity, and to explore the space of plausible interpretations.

On their account, models can complement, rather than replace, traditional tools. The models can generate candidate readings, assign relative likelihoods, and even simulate how different types of parties might understand specific language. Contract law would still insist on objective intent, but courts could treat the model’s distribution of responses as a rough proxy for how language is used in the wild. In that sense, generative interpretation looks less like expert testimony and more like a high-powered corpus query.

That optimism has prompted a detailed response. In a 2025 draft titled “Generative Misinterpretation”, James Grimmelmann, Ben Sobel, and David Stein argue that large language models are pattern completers that reproduce training data idiosyncrasies, not windows into stable community meaning. They warn that using model outputs as evidence of ordinary meaning risks importing opaque biases and artifacts into judicial reasoning, especially when the training data do not match the legal audience whose understanding matters.

From Dictionaries to Data-Driven Meaning

Before anyone asked ChatGPT what a clause means, judges leaned heavily on dictionaries. That practice has drawn repeated criticism for enabling dictionary shopping, since different volumes offer different senses and do not always reflect contemporary usage. Corpus linguistics arrived as a response, inviting courts to look instead at large databases of real-world language to see how specific words and phrases are actually used.

In their influential article “Judging Ordinary Meaning”, Thomas Lee and Stephen Mouritsen argued that corpus methods can give courts a more empirical handle on ordinary meaning, while still leaving room for legal judgment. Kevin Tobia’s “Testing Ordinary Meaning” added experimental evidence, showing that judges, law students, and laypeople often converge on similar intuitions about contested terms, yet do not always track dictionary entries closely.

By the time large language models appeared, courts were already using frequency data and linguistic corpora in cases about vehicles, firearms, and other contested terms. Artificial intelligence tools did not invent the idea that meaning can be measured. They arrived in a legal culture that was already tempted to treat language as a dataset.

How Doctrinal Frameworks Shape AI’s Role

Modern common law contract interpretation starts from the objective theory of intent. Courts ask how a reasonable person in the position of the parties would understand the words, not what any individual secretly hoped they meant. In Canada, the Supreme Court’s decision in Sattva Capital Corp. v. Creston Moly Corp. reframed this inquiry as a mixed question of fact and law and made the surrounding circumstances, or factual matrix, central to the exercise.

Corner Brook (City) v. Bailey confirmed that releases are interpreted under the same general principles that govern other contracts. The Court emphasized that courts can look at objective background facts that were or reasonably ought to have been within the parties’ knowledge, while excluding evidence of subjective intention and ensuring that the written text remains primary.


This orthodoxy matters for artificial intelligence because it defines the categories where new tools can plausibly fit. A judge can use dictionaries, trade usage, course of dealing, and the parties’ conduct to infer intent. Any resort to language models has to be slotted somewhere within that framework, either as another way of sampling ordinary meaning or as something more ambitious.

Are Model Outputs Evidence Of Intent Or Just Wordplay?

Despite the headline appeal of judges using ChatGPT, reported decisions to date do not treat model outputs as direct evidence of what the actual parties to a contract intended. In Snell, the policy language had been drafted long before large language models existed, and the opinion uses model responses to illuminate how an ordinary English speaker might categorize activities like installing an in-ground trampoline, not to reconstruct the insurer’s subjective state of mind.

Doctrinally, that keeps models in the same family as dictionaries and corpora. They are tools for exploring ordinary meaning, which in turn informs the objective intent analysis. Under cases such as Sattva and Corner Brook, a court that looks at model outputs is still supposed to anchor its reasoning in the text, the factual matrix, and admissible contextual evidence, while excluding evidence of private intention.

The more difficult question is whether repeated reliance on large language models will gradually shift the center of gravity of contract law toward what models say about language, rather than what the parties did and knew in context. Grimmelmann and his coauthors argue that this would be a category error, because training data rarely mirror the specific subcommunities and institutional settings that matter in a contract dispute. Arbel and Hoffman, by contrast, suggest that with careful design and documentation, generative interpretation can remain tightly connected to the kind of reasonable reader inquiry that contract doctrine already endorses.

When AI Helps Write Or Execute The Contract

Even if courts currently treat model outputs as evidence about language rather than intent, artificial intelligence is also moving into the drafting and execution of contracts themselves. Document automation tools, some powered by generative models, already produce first drafts of commercial agreements, supply boilerplate provisions, and suggest revisions. Guidance from law societies and bar associations now routinely warns lawyers that they remain responsible for vetting AI-assisted drafting for coherence, conflicts, and alignment with client instructions.

Contract law has been grappling with automated agents for some time. The U.S. Electronic Signatures in Global and National Commerce Act and the Uniform Electronic Transactions Act both recognize that contracts can be formed through the action of electronic agents, even where no human reviews each step. Similar provisions appear in statutes such as British Columbia’s Electronic Transactions Act, which explicitly contemplates agreements formed by interacting software systems.

The Singapore Court of Appeal’s decision in Quoine Pte Ltd v. B2C2 Ltd confronted questions at the edge of this terrain. Trading algorithms executed cryptocurrency transactions at extreme prices after a systems failure. The court held that there was no unilateral mistake sufficient to void the trades and treated the programmer’s knowledge and intentions as the relevant mental state for doctrine that still turns on human actors.

As generative systems move from assisting in drafting to autonomously proposing clause language, courts will increasingly face disputes where ambiguous wording originates in a model. At that point, the line between using artificial intelligence to interpret contracts and using it to embody contractual choices will blur. The doctrinal tools for dealing with electronic agents and automated mistakes are already on the books, but they have not yet been tested in a context where both drafting and interpretation rely heavily on large language models.

Evidentiary Pathways And Professional Guardrails

If a party wants to rely on a model’s output in litigation, the first hurdle is evidentiary. A printed screenshot or prompt log would need to be authenticated and its relevance explained. If counsel treats the output as a demonstrative aid that illustrates a linguistic argument, it may be treated more like counsel’s own charts and diagrams. If an expert relies on model queries to support an opinion on ordinary meaning, then traditional reliability standards apply, along with questions about the training data, prompt design, and validation.

Professional regulators are also tightening expectations. The American Bar Association’s Formal Opinion 512, issued in July 2024, tells lawyers who use generative artificial intelligence that they must understand the technology well enough to assess its risks, protect confidentiality, verify outputs, and communicate material limitations to clients. Law societies in Canada and state bars in the United States including Florida and Pennsylvania have published similar guidance, with an emphasis on supervision and recordkeeping rather than prohibition.

Courts have already sanctioned lawyers for relying on hallucinated authorities. The widely reported decision in Mata v. Avianca, Inc. imposed monetary penalties on counsel who submitted briefing that cited nonexistent cases generated by ChatGPT. Subsequent orders in other jurisdictions have signaled that inaccurate or unverified AI-generated content in pleadings can trigger discipline or cost consequences. The risk of hallucination remains a central concern in professional guidance, with regulators emphasizing that lawyers cannot delegate verification responsibility to the technology itself.

Judicial policy documents are beginning to address artificial intelligence use from the bench. Guidance from the Canadian Judicial Council on the use of AI in Canadian courts, from the National Center for State Courts on AI and the courts, and from the judiciary of England and Wales on cautious judicial use of chatbots all adopt a similar line. This global regulatory convergence, further highlighted by the evolving landscape under the European Union’s AI Act, underscores the general consensus that AI tools may assist with drafting and administrative tasks, but judges remain fully responsible for analysis and must not rely on opaque systems for determinative legal reasoning.

Practical Takeaways For Lawyers

For practitioners, the emerging picture suggests a cautious but usable role for large language models in contract disputes. Models can help brainstorm possible readings of contested language, surface analogies to case law, and flag ambiguities in draft agreements. They can also assist in building more traditional interpretive evidence, for example by generating candidate search terms for corpus queries or helping structure surveys that test how particular audiences understand a clause.

What they cannot safely do is substitute for evidence. If counsel wants to present a model output in court, it should be as one data point among many. Every key proposition about ordinary meaning or intent still needs support from sources that a judge can evaluate, such as text, factual matrix, trade usage, course of dealing, and empirical studies of language. Arbel and Hoffman’s generative interpretation proposal can be read as an invitation to use models as an internal research accelerator, then translate their results into more conventional evidence before stepping into the courtroom.

The same discipline should apply when artificial intelligence has helped draft or negotiate the contract. Lawyers who use models in front-end deal work should document their prompts and outputs, treat them as preliminary suggestions, and ensure that final language reflects human judgment grounded in the client’s objectives. If something later goes wrong, they should anticipate that courts will focus on the signed text and admissible context, not on what a model once proposed in a private chat window.

Not Yet Intent, No Longer Invisible

As of late 2025, there is no reported decision in which a court has treated a language model’s output as direct evidence of the parties’ subjective contractual intent. The most visible judicial uses, such as the concurrences in Snell and Deleon, treat artificial intelligence as a new kind of dictionary or corpus, useful for checking the ordinary meaning of disputed phrases and for illustrating reasoning in a transparent way.

That does not make generative tools trivial. Once courts become accustomed to citing models for linguistic support and once lawyers rely on them for initial contract drafting, it will be harder to insist that artificial intelligence remains just a tool with no evidentiary weight. The pressure to treat model outputs as part of the objective intent inquiry will grow, especially in disputes with little contextual evidence beyond the text. The immediate task for judges and practitioners is to learn from the early experiments, apply rigorous evidentiary and ethical standards, and avoid letting predictive systems stand in for the human intentions that contract law still claims to honor.

Sources

This article was prepared for educational and informational purposes only. It does not constitute legal advice and should not be relied upon as such. All cases and sources cited are publicly available through court filings and reputable media outlets. Readers should consult professional counsel for specific legal or compliance questions related to AI use.

See also: Building AI Governance from the Ground Up in Small Law Firms

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *