How to Implement AI in Your Firm

On 20 August last year, a headline ran that the founder of Google’s generative AI team said don’t bother getting a law or medical degree, because AI will destroy both careers before you can graduate. You have heard the genre. AI will replace the accountant, replace the lawyer, and replace the tax adviser. I am not replaced yet.

On the same day, MIT reported that 95% of corporate generative AI pilots delivered no return. That matches my experience.

Two reputable sources, opposite claims, the same day. They are not contradicting each other. They describe different economies.

Three economies

There are three.

The hype economy is where the first headline lives. It is built by people selling something: shares in an AI company, a book on the end of the professions (which always ends, unless you buy the next book), a newspaper column, a change-management engagement. The pitch needs AI to be magic and the professions to be doomed. Look how big the accounting industry is; therefore, my AI must be worth that much. I spent the better part of a decade trying to sell legal tech and never met the buyer who falls for this. Perhaps they are all in America.

The software economy is real, and it is where most of the genuine returns sit. Large language models are built by coders and used by coders. Software is the one domain where output can be objectively tested: write hello world, a web page, a chess game, and you can check whether it works. Easy to test means easy to train, and easy to train means genuinely useful. Developers are early adopters because they build the thing. They are running at five and ten times their old output. New companies that once needed a million dollars and twenty staff now ship a viable product in a weekend. This is not twenty people put out of work; it is twenty new companies each building something. Software ate the world, and now AI is eating the software.

The third economy is the one you and I live in. The real world. This is where pilots get implemented, fail, and everyone feels sold a pup. The reason is simple. In the real world, there is no objective test of good tax advice the way there is for code. Our systems are complex, heavily regulated, full of humans and government and competing rules. And the tolerance for error is different. AI that gets the tax answer right 90% of the time is not 90% good. Brakes that work 90% of the time are failed brakes. A parachute has to open every time. Advice that is wrong one time in ten is negligent. This is not chess, and it is not a website.

On this sits the Dunning-Kruger effect. The people doing the hyping do not understand the law, or the technology, or both. Some very smart people have raised very large sums and believe they are domain experts in everything. Engineers are clever, and they are also the worst clients, because every problem looks simple to the cleverest person in the room. They run the big software companies, they make the predictions, and they apply that same confidence to a world they do not work in. The louder the hype, the thinner the understanding.

A short illustration. A mathematician, a physicist and an engineer are each told to cross a room to a beer in steps that halve the remaining distance. The mathematician and the physicist refuse: it is an asymptote, you never arrive. The engineer takes a few steps, reaches out, and says close enough. AI is engineering. It works because it works. On an old build of mine there was a core search variable set to 7. Nobody could tell me what it was called or what it did. Set it to 8 and the thing broke; 6 and it broke; at 7 it worked. That is the register we are in.

Why Your AI Sucks, Part One: You Set It up Wrong

You bought the subscription, you were promised the world, you implemented it, and nobody uses it. There are two reasons. Here is the first.

You cannot hand an LLM a bucket of documents and say use this. Ask an off-the-shelf model a tax question and it will take a decent stab from training. But retaining your material and searching it is a different task to generating plausible text, and generation runs on a great deal of randomness. Search is not generation.

So you decide to upload the tax law into the chatbot yourself. First, you have to scrape it. Then you cannot dump in a thousand files all called document. You have to label them. Then you have to tag them with metadata so they can actually be searched: section headings, case citations, the structure a human reads without thinking. Toss in an undifferentiated pile and expect useful retrieval, and you will be disappointed.

You also have to teach the model the structure you already carry in your head. Say you have a beautiful semantic search, and you have scraped the entire ATO corpus and all the private rulings. Someone at a conference proudly told me he had done exactly that, vibe-coded the lot, tossed it in. Lovely. But how does the system know a public ruling outweighs a private one? On a pure semantic search it weighs them equally and returns whatever matches the words. It does not know that the High Court sits above the AAT. You know that. You have to put it in the structure.

None of this is hard. It just has to be done. The same setup applies to your own private data. A bucket is not a system.

Set it up properly and the output changes character. Here is a properly configured retrieval-augmented model answering a question I will come back to. It opens on Clark in the Federal Court, hits TD 2012/21, then works through trust splitting, the new-trust and no-new-trust views, the Commissioner’s view and where it is internally inconsistent, and the substratum criticism. Five pages of it. That is senior-associate work, off an off-the-shelf model that has simply been configured correctly. Twenty dollars a month, not five hundred.

The Test I have Used for a Dozen years

How do you tell whether an AI is actually working? Use a question you know cold. Mine, which I have put to every law clerk who wanted to work for me for fifteen years, is this: if I remove a potential beneficiary of a discretionary trust, does that trigger a resettlement for capital gains tax purposes? Then variations, for trust law, for stamp duty.

It is a good test because it is real, it carries consequences clients pay me to sign off on, and it moves across legislation, public rulings, private rulings, and case law whose interpretation has shifted over time; the 2001 Statement of Principles was withdrawn after Clark. I score it out of five. A good clerk gets three to five: continuity in Clark, TD 2012/21 applied correctly, the right answer, a valid exercise of power, Commercial Nominees.

Run it across the public models and you get a spread. Grok finds the legislation, says you need a power, gestures at TD 2012/21. Gemini answers the elements more precisely and is the better of the two. GPT-5, in ChatGPT and Copilot, was better still on this question. Claude was an abysmal failure. None of them, as base models, matched my better clerks, though at times they have come close. These results change week to week, so this is illustrative, not a ranking. The point is that the configured model beats all of them, because it has the structure.

Why your AI sucks, Part two: Nobody Trained Anyone

This week’s Daily Cartland, my parody paper, ran the headline: IT training session derailed when senior partner asks how to double-click. Some of you groaned. Some of you wondered whether it is one left click or two. The joke makes a real point. If one person in the room does not understand the fundamentals while everyone else is bored and waiting for the free lunch, your training is not designed.

Do not use off-the-shelf training for AI. I have not seen one that is fit for purpose. A proper program starts by watching what people actually do. What a senior partner wants from AI is nothing like what the research associate wants, which is nothing like what the secretary wants. Skip that and sit everyone through a generic session, and they stare at the wall and wait for lunch. If you think I am wrong, and your IT sessions are a delight, tell me. In my experience they are awful, because nobody asked what each person needed.

Skip the training and you get shadow technology. The same MIT report found only 40% of companies had bought an official LLM subscription, while staff at over 90% reported regular use of personal AI tools for work. Almost everyone is using something. Your people know the upside is real, so they reach for their own consumer tools, which are not enterprise grade, have none of the security or confidentiality, and may be quietly costing you legal professional privilege. Your twenty-one-year-old is photographing documents and asking a consumer chatbot to explain them. That is the exposure.

Shadow tech is not new. Have you ever saved to your desktop? Open it now and count the files named final, final2, final-final-v2. You were given a filing system and you went around it, because the system was not set up for the way you work.

There is a park that explains this. The designers lay turf across the whole space and run a paved path around the edge, because the grass is beautiful. People want to get from one side to the other, so they cut across and wear a track in the lawn. Then a sign goes up: keep off the grass. The fault is not with the walkers. It is with the council, who paved the path in the wrong place. Build the park, watch where people walk, then pave that line.

So shadow tech is a signal, and a positive one. People are working around a problem to get their work done. Give that behaviour an outlet. Find them an enterprise-grade tool they are allowed to play with. Because learning to use AI is learning to ride a bike. You can explain the physics of a bicycle perfectly, and it will be perfectly useless. You learn by getting on, wobbling, finding your balance, pushing forward. It is a system of play. It needs a supportive environment and coaching tailored to the actual workflows.

How I Did it in my Own Firm

I am the wrong example. I have been building AI for twelve years and I do not need a seminar; give me a login and I will work it out. The right example is my admin. She is conservative, diligent, reliable, trustworthy, and sceptical of new tech. She does not want a new tool. If that describes someone in your office, that is the person to win over, not the twenty-one-year-old already playing with the latest toy.

Start with something fun and personal, off the clock. I began with diet and calorie tracking, photographing what I ate. Travel planning. Baby names, which beat the baby-name apps. Twenty-five years of hand-typed Dungeons and Dragons notes uploaded so it could write the next campaign, where a hallucination is the entire point. Low stakes, real play.

Then find one work use case that helps. We run a tax training session every Friday morning, free and open, and have for over a decade. Each week someone had to write the case description: presenter, title, intro, summary, discussion points, sign-off. Two or three hours. We built a template, the AI drafts it from the case, she checks it with me before it goes out, I change things. The time collapsed. And then she came back asking what else she could do. Can it generate images for the site instead of stock photos. Can it check this email. That is the moment. One real use case and the person starts finding the rest themselves, bottom up, which is the only way it sticks. The cost of all this is about twenty dollars a month a head on a properly set up enterprise system.

One of the small wins was the famous tax quote we run each week: take a real quote, amend it in square brackets to make it about tax, attribute it. The first was Conan: what is best in life is to crush your taxes, see them driven before you, and hear the lamentation of the tax office. I have the model always return the original quote and source, so I can check it. Even at that scale we are building in the habit of catching hallucinations.

Hallucinations are a Model T crash

People say they cannot use AI because it hallucinates. Hallucination is almost entirely misuse. When the Model T first hit the roads people drove drunk into things at thirty kilometres an hour and marvelled at the speed. Then we worked out we needed road rules, indicators, and brakes that work more than 90% of the time. AI is new. People are crashing it constantly, because nobody taught them to drive.

One rule of thumb keeps you out of the wall. Do not use AI for anything you could not do yourself given time. Could I do tax research myself, given time? Yes. Will AI speed it up? Yes. Good. I can ask it to lay out how it searched and what it reviewed, the way I would ask a junior to come back with an audit trail, and I can see where it went wrong and where I can rely on it.

Here is where I did not follow my own rule, and it mattered less because I knew it. I was hiring staff on an award I did not know, so I asked a consumer chatbot for a vibe check on the rate. I was never going to rely on it. Award rates matter, so I paid a real employment lawyer. He came back: I was close, but not right. Could I, a tax specialist, have worked through the Fair Work legislation myself? Probably not well. I would not cut myself open to take out my own appendix and be satisfied that I had produced some red bits. It matters that they are the right red bits.

Fake citations should never happen. There is a database of these cases maintained by Damien Charlotin; as I write it lists 892 instances of lawyers or self-represented advocates penalised for putting invented citations to a court, and there are surely more. A US case from a few days ago helpfully sets out the prompt. The advocate wrote, in substance: taking the role of a judge, write an order denying the motion to strike, with case law support for the proposition that where an expert report is criticised for an inadvertent, immaterial, incomplete claim construction, the remedy is not to strike the report. He articulated the legal question well. Then he told the model to take the role of a judge, which changes its tone and not its accuracy. You can ask it to write like Lord Denning; it will not make the law true. The prompt was built to fail, and it produced hallucinated citations. It was avoidable.

A Prompt that Does Not Hallucinate

The fix is a multi-phase prompt built on instructional verbs, not generative ones.

Search. Tell it to search, which requires a model connected to the internet or to the right database: search for the cases and provide a link, search the ATO database, search the legislation, find the links.

Extract. Tell it to extract the relevant quotes and the exact text. Extract makes it copy and paste, and copy and paste is not a hallucination. The limit varies: Copilot extracts freely; ChatGPT caps at about 25 words for copyright, others less.

Read and summarise. Now have it read the case carefully and summarise it. This is not for reliance. It is to speed up your search, so you can see what each source is actually about.

Read everything you cite. This is the step that is not optional and not delegable. Even a citation from a reputable textbook can be wrong; I do not lift one and drop it in. You read the source.

I run the same workflow over my own writing. Checking a draft of Federal Tax Disputes, the model picked up a passage on the ATO’s power to issue estimate notices for unpaid PAYG withholding, SGC and GST, found the primary legislation, linked to it on the ATO site, extracted the exact text so I could eyeball it against section 268-15, then summarised it and tested it against what I had written. Its verdict: partially correct, but unlimited scope was overstated. Not wrong, just too strong. A perfect audit trail and a faster path to being right. Had I read section 268-15? Not that day. Yes, when I wrote the book.

The same move turns a firm’s knowledge into a tool. I have written papers on the defects in trust deeds, the ones that cause critical failures, and turned that into a prompt. Run a deed through it and it flags a formation risk, a sub-trust and unpaid-present-entitlement drafting problem of the kind now before the High Court in Bendel, the issues I think matter. It is not an answer. It is a fast first review, the summary I would have asked a junior to bring me. I will not give it to a client. It is a good first draft.

Doing the Accounts from a Shoebox

Last thing, and the one I am least qualified for. I am a tax lawyer, not an accountant, and I have been doing accounting for six months.

I act on tax disputes. I had clients, cafe owners, who, to nobody’s surprise, had not paid all their tax, for reasons that turned out to be complicated. The accounts were a candidate for the Pulitzer for fiction. (Details changed; this is a composite, no client information disclosed.) No usable accounts, a short deadline, aggressive officers, and if we did not respond with correct numbers the tax office would issue default assessments and pursue the clients. I had to build thirty entities of accounts across four years from nothing: three to six bank accounts per entity, and a pile of receipts. The honest assessment a year ago would have been that this was impossible.

I have produced a year of accounts from that nothing in three days of law-clerk time. Their accountant would have charged ten to fifteen thousand dollars for it. My clerks had no accounting experience; I hired them for this and taught them the workflow in about an hour. Three days of clerk time for fully completed accounts with a perfect audit trail, at roughly 10 to 15% of the usual cost. Not a trim. Ninety percent. To be clear, I am not competing with my clients; I want to be in practice for another forty years, and you do not stay in practice by competing with the people who feed you. We are a law firm and do not want this work. We did it because there was no other option, and when we finished we taught the client’s accountant to do it, who is now doing it for everyone else.

The workflow runs in stages.

Primary documents. Start with bank statements or invoices. If you have a clean bank feed, skip ahead. If not, you OCR and extract, and that step needs a human audit for hallucination. You cannot pour a hundred thousand documents in; it falls over. Chunk it, a month at a time, one task per context window, several windows running in parallel.

Extract and clean. Pull the data off the statement into ordered form. Plenty of tools do this into Xero; I do not mind which. The same extraction works on invoices and everything else. Order is the whole point.

Mapping logic. List the rules and let the model learn from them. An EFTPOS settlement maps to sales; a Shopify payout to income; the e-bike importer to cost of goods sold; bank fees to charges. You can train this on prior work or write it out. This inverts ordinary practice. Normally you code each entry one at a time. Here you categorise everything, write the rules once, then apply them across the whole month.

Apply and reconcile. Applying the rules produces the trial balance, the general ledger, the journals, and a suspense account for anything the rules did not catch. Review the suspense account. If it is small, clear it by hand; if it is large, fix the mapping and rerun. Nothing gets miscoded, because what is not understood goes to suspense rather than to a wrong account.

The reason there are no hallucinations here is that every verb is instructional. The model is taking data, copying it, moving it, transforming it. None of that requires generation, so none of it hallucinates. Push it too far and it will try to please you and start inventing, so you watch it, and accounting is unusually good at this because it reconciles. The same approach runs GST reconciliations for the BAS: extract the invoice text, apply the input-taxed and GST rules, a box of peaches is food, a crate of motorbikes carries GST, out comes a reconciliation with supplier, invoice, GST and method, in about fifteen minutes.

The output is what matters to the tax office. Every transaction links back to a base record. I can hand it over and say here is what we claim and here is the audit trail, not here are someone’s calculations.

Will AI take your Jobs?

You have just watched me cut the cost of core accounting work by 80 or 90%, so the natural fear is that you fire 80% of your staff. The economics say the opposite.

Demand for accounting services has not moved. The demand curve sits where it sat, sloping down: price on one axis, quantity on the other. What AI does is shift the supply curve to the right, because production is now cheaper. You slide along the demand curve to a new intersection: lower price, greater quantity. You do more work in the same time. And the extra work is work that was previously impossible. Thirty entities of accounts from a shoebox in a couple of months was impossible a year ago; I would have told the client to take the beating. Now it gets done.

Here is the part worth acting on. Prices adjust slowly. The market does not yet know how cheaply this can be produced. You know, because I have just told you, and the person who is not in the room does not. While the supply curve moves and prices lag, the early mover produces at the new low cost and still sells at the old high price. There is a window to make super profits, two to four years by my guess, while the market catches up.

Picking a Model

Which AI is right for you is like asking which is the best ice cream. It is personal. The right model varies by user, by task, by workspace, and mine changes week to week as the models update. So keep several current tools on hand and switch between them: a cutting-edge tax research tool for research, a general subscription for everything else, and permission to play. Do not let a salesman tell you there is one answer for you. If he does, he is probably not a professional.

Be wary of the blow-ins. The industry is full of people who did marketing for ten years and now intend to revolutionise law, and who will be out of business when the base model ships their product as a feature. Claude released a legal product this week and put a swathe of legal-tech startups out of business. Build on someone who actually works in this space and can say a thing works because they use it.

If something is not working, the answer is rarely that the tool is junk. It is almost always that it was not set up or trained properly, and that there is something genuinely useful underneath once it is.

In short

Play. Give your people a safe space to play, because that is how anyone learns to ride a bike. Set the thing up properly, with the structure you carry in your head made explicit. Train to the work each person actually does. Read everything you cite. Use AI only for what you could do yourself given time, and slot it in at the bottom of a workflow you already trust, never in place of the review that makes your work non-negligent.

This Article Was Created By,

Adrian Cartland

Principal Solicitor at Cartland Law

Adrian Cartland, the 2017 Young Lawyer of the Year, has worked as a tax lawyer in top tier law firms as well as boutique tax practices. He has helped people overcome harsh tax laws, advised on and designed tax efficient transactions and structures, and has successfully resolved a number of difficult tax disputes against the ATO and against State Revenue departments. Adrian is known for his innovative advice and ideas and also for his entertaining and insightful professional speeches.