A luxury handbag is not primarily a functional object. Any bag will carry your things. The expensive one is bought to be seen with, admired, shown off. That is not a criticism. It is the point.
I look at most legal search engines and I see the same thing. Bought to be shown off. Rarely used.
For about a dozen years now I have been watching tech platforms demonstrate how they are going to revolutionise legal search. The demo is always the same. Clean question, clean answer. Look how fast that was. Thirty seconds. The pitch deck claims that previously this would have taken a paralegal seven years of full-time research, or some other implausible figure. The product saves seven years per query. It only costs two million dollars a month. Seven years of your time is worth twenty million dollars. The product pays for itself many times over.
The pitch works on venture capitalists. It seems to work on large firms too. I do not think it really does, because what large firms are buying when they buy one of these products is a handbag. Something to show off. Announcements get made on LinkedIn, probably with pictures. And then the product gets quietly shelved, because no one is using it. At least a handbag might have a resale market.
So why is no one using these amazing search engines?
The answer is simple. No one in tech understands how lawyers actually search.
The benchmarks have the same problem. Every few months a new one is published showing that some legal AI product has cleared another bar. LegalBench, CaseHOLD, the Vals reports, the Stanford and Yale hallucination work, the vendor accuracy studies from Thomson Reuters, LexisNexis and Harvey. The numbers improve. The press releases get longer. Firms point to them when they justify procurement decisions to their boards.
And then practitioners use the products and find them disappointing in ways the benchmarks did not predict.
That is a category problem, not a measurement problem.
The Four Stages
Legal research is at least four different things. Each has different success criteria. Each has different failure modes. No current benchmark covers more than one and a half of them.
These are not phases of a workflow. They are cognitive operations. A tool can be excellent at one and dangerous at another.
Take a single fact pattern through the four stages: a client wants to alter the terms of their discretionary trust and asks whether there will be tax issues.
**Stage 1, Issue Formulation.** The user has facts but not a framed question. The work is recognising that an amendment to a trust might raise a resettlement question, a CGT event E1 or E2 question, a Division 7A question, a state duty question, or several together. The output is a framed legal issue. Not an answer. Just the right question to ask, and the next useful research step.
Experienced lawyers do this brilliantly, but largely unconsciously. Domain knowledge does the work. The value of AI at this stage is at the edges of practice, or for junior lawyers who do not yet have the pattern library to recognise what they are looking at.
**Stage 2, Governing Authority.** The issue is framed. The user needs the highest authority on point. For a resettlement question, that is *Commissioner of Taxation v Clark*, TD 2012/21, and Division 104 of the ITAA 1997. The work is locating the governing source, including legislation, cases, and rulings, and distinguishing it from supporting commentary. The output is a small set of authorities, presented in a hierarchy that makes their relative weight obvious.
**Stage 3, Application and Analogy.** The authority is known. The user needs to see how it has been applied to fact patterns like the one in front of them. For the trust amendment matter, that means retrieving similar amendments, similar deeds, ATO rulings applying *Clark* or TD 2012/21 to comparable changes, AAT decisions, and practitioner commentary on edge cases. The output is a body of applied reasoning that helps predict how the principle will work on these specific facts.
**Stage 4, Comprehensive Collection.** The matter requires a file-ready or advice-ready body of material. This is not exploration. It is assembly. The principal body of relevant materials, organised so that nothing major has been missed. The standard is recall, not relevance.
These mix in practice. Stage 3 often reveals that Stage 1 was wrong. Stage 4 sometimes surfaces an authority that displaces Stage 2. But the operations are conceptually distinct, and the distinction is the whole point.
What the Benchmarks Actually Measure
The standard legal AI benchmark presents a question and evaluates the answer. That is Stage 2. Clean question, governing authority, check.
Stage 2 is also the stage where current LLMs perform most reliably. A well-trained model with good retrieval can locate the leading authority on a settled question with reasonable accuracy. The benchmark measures this, the product is built around this, and the demo shows exactly this.
Stages 1, 3, and 4 are not measured, or are measured poorly.
Stage 1 is not question answering. It is question generation. No benchmark currently evaluates whether a model correctly identifies all the legal issues latent in a novel fact pattern, because constructing a ground-truth answer set for that task requires expert lawyers to agree on what the issues are, which they often do not.
Stage 3 requires semantic retrieval of fact-similar authorities, which is genuinely hard to evaluate at scale. The analogical quality of a retrieved case is a judgment call, not a binary score.
Stage 4 requires recall measurement across a complete corpus, which requires knowing what the complete corpus contains. Most benchmark providers do not have that.
The result is that every published benchmark is, structurally, a Stage 2 benchmark. The products optimised against those benchmarks are Stage 2 products. And the practitioners who needed Stage 3 or Stage 4 are the ones who quietly shelve the product after the first month.
What the Tech Companies Get Wrong
They are not being dishonest. The better explanation is that they do not understand what legal research is.
The question-and-answer paradigm is native to how AI products are built and evaluated. It maps cleanly onto consumer search, onto enterprise knowledge retrieval, onto customer service automation. It maps onto Stage 2 of legal research. It does not map onto Stages 1, 3, or 4.
A product manager who has never done legal research, evaluating a product against a benchmark constructed by people who have also never done legal research, will build a very good Stage 2 product and not know that three other stages exist.
That is the design failure. Not deception. Genuine incomprehension of the task.
What You Are Actually Buying
When you are procuring a legal research tool, the right question is not “how does it score on the benchmark.” It is “which stage am I actually using this for.”
A general-purpose LLM is a reasonable Stage 1 tool for practitioners working at the edges of their practice area, or for juniors building their issue-spotting instincts. It is a dangerous Stage 2 tool because it hallucinates citations with confidence.
A specialist commentary database, a textbook, a practitioner guide, is the right Stage 2 tool. It is not glamorous. It does not have a demo. It has been the right Stage 2 tool for decades.
A semantic search engine, built on a legal corpus, is the right Stage 3 tool. This is where the AI investment has the most legitimate upside, and also where most products oversell.
A traditional Boolean search database with a complete and well-maintained corpus is the right Stage 4 tool. Austlii. The major commercial providers. Nothing in the new wave of legal AI replaces them for comprehensive recall, and the products that claim otherwise have not been tested against that standard.
You probably need more than one of these. You almost certainly do not need the expensive one for every stage.
The Procurement Question
The firms that buy the handbag and then look for one product to do all four stages are going to be disappointed, because the product does not exist. The benchmarks have been measuring the work in a way that suggests it might, which is part of how the handbag gets sold in the first place.
When you are trying to present a proposition accurately before a court, or find that elusive authority that matches your facts, or make sure that every contradicting case that could be used against you has been found and addressed, you do not need a handbag. You need a rugged backpack for hiking in the wilderness, the kind that carries all your gear and does not break. Nobody is going to take photos of it. Nobody is going to invest in it. But it will still work, even if it is raining.
At least a handbag might be worth something second-hand. The legal AI product you just shelved is worth nothing.