If No One Pays for Proof, Everyone Will Pay for the Loss
This post was initially written in French, _Si personne ne paie pour la preuve, tout le monde paiera pour le sinistre_
Let’s start with a truism. In ordinary life, just as in economic life, we have to make decisions without ever knowing everything. Every decision involves some uncertainty, and therefore some risk. Some risks are small, manageable, and we barely notice them anymore. Others can have financial consequences large enough that we would rather transfer them to a third party, by paying a premium so that an insurer will bear them for us. That is, at bottom, one of the most concrete functions of insurance. But an equally interesting question arises when that transfer becomes impossible, or at least impossible at a reasonable price. That is what we call _uninsurability_. We already encounter it with certain natural risks, when losses become too correlated, too massive, too difficult to mutualize, as I discussed in _Insurers and AI, a systemic risk_ and in _Insuring AI. New risks? New models?_. And apologies for using this umbrella term, “AI,” which I do not like very much, but I need to simplify a little or I would never finish this post…
The day before yesterday, Thomas Claburn wrote in _AI still doesn’t work very well in business, businesses are faking it, and a reckoning is coming_:
> Another looming problem is that large insurers have become wary of underwriting policies that cover companies against AI risk.
(Thanks to @flomaraninchi and @ugo for pointing it out to me.) I have the feeling that it is important to understand exactly what this means. If major insurers are becoming reluctant to cover AI-related uses, this is probably not just one more market anecdote, nor simply another legal precaution. It is a signal. An important signal for anyone who builds predictive models, or who is interested in uncertainty and ambiguity, because insurers do not need to be prophets to become cautious. Perhaps that is what risk culture is. It is enough for them to conclude that they do not understand the risk well enough, that they cannot observe it properly, that they cannot reconstruct the chain of responsibility behind it, or that they doubt they can carry it at a sustainable price. In other words, if insurers are stepping back, that should force us to ask whether AI is really under control. What we are dealing with here are very classical questions in the economics of information, imperfect measurement, misaligned incentives, and insufficient proof, much more than a simple dispute about the current level of the technology.
## 1. The productivity narrative often begins with a measurement error
The first problem with AI in business may not be the technical error itself. More often, I think, it is something deeper. We measure poorly what we claim to improve. As soon as an organization adopts a new tool, it looks for quick indicators to objectify its effectiveness. We count the number of lines of code, the number of slide decks produced, the number of tickets closed, the time saved on a task, the volume of generated text, and so on. These are metrics that are easy to feed into a dashboard, and just as easy to present to senior management. But they are often very poor measures of the real quality of the work delivered. In _The Consequences of Goodhart’s Law_, I pointed out that a measure ceases to be a good measure once it becomes a target, because it feeds back into behavior and ends up distorting the very process it was supposed to describe. Once the target has been set, people learn to improve the indicator rather than the underlying reality it only imperfectly summarized.
That is exactly the intuition one finds in Thomas Claburn’s article. In _AI still doesn’t work very well in business, …_, the key point, in my view, is not that AI produces a lot of code or a lot of content. The key point is that we do not really know which indicators to look at in order to decide whether this extra output actually improves the outcomes that matter. The people cited in the article remind us that lines of code, or the number of pull requests, are not measures of engineering excellence. What matters instead are variables such as deployment frequency, lead time to production, change failure rate, recovery time, and incident severity. So part of today’s optimism may stem from a very simple confusion: we confuse what is easy to count with what is actually important. The two may be correlated, of course, but that is precisely the problem with proxies that fail to capture causal relationships, admittedly a subtler issue, see _Correlation and Causality_. Here, insurance is the canary in the coal mine, as I already argued in “ _Our house is burning and…_ ” in the context of natural disasters. An insurer does not merely observe output volume. It wants to understand the likely frequency of errors, their cost, their correlation, their traceability, and the possibility of identifying after the fact who did what, under which conditions. In that sense, insurers often look at a system using better metrics than those who celebrate it in a kind of naïve techno-optimism.
## 2. What looks locally correct may degrade the system as a whole
The second problem extends the first. Even when a result looks correct on a small scale, that tells us almost nothing about its effects on the system as a whole. Here again, the example discussed in _AI still doesn’t work very well in business, …_ is interesting. A piece of code may look right, pass tests, and seem clean at first glance, yet still be deeply flawed once it is placed back into a real environment. The article mentions a rewrite of SQLite in Rust using AI that produced far more lines of code while performing dramatically worse than the original. The precise details of the example do not matter much here. What matters is the general idea. Local validation does not guarantee global performance. A merely plausible output is not proof of robustness. A convincing deliverable is not necessarily a good deliverable. Once we forget that, evaluation is replaced by a mere aesthetics of output.
Something along these lines also appears in _Announcing the 2024 DORA report_. The report suggests that AI adoption may improve some intermediate dimensions, such as documentation or perceived speed on a few tasks, while still coinciding with a decline in delivery throughput and overall system stability. In other words, local signals may improve while global performance deteriorates. This tension is essential if we want to understand what is happening today. AI can give teams the feeling that they are moving faster, producing more, and streamlining certain tasks, while at the same time degrading the properties of the system that actually matter when we reason in terms of production, liability, or insurance. That is also why the issue cannot be reduced to the technical quality of a model alone. It concerns the way an organization observes its own results, the way it sometimes confuses the apparent conditions of performance with performance itself, and its inability to see that it has simply moved the cost of work downstream, into maintenance, verification, incidents, and ultimately the loss itself.
## 3. Many AI-produced services look like credence goods
Once overall quality becomes difficult to observe, we enter a type of market that is well known in the economics of information. Rudolf Kerschbamer and Matthias Sutter remind us in Economics of Credence Goods that some goods and services have a distinctive property. The buyer does not always know which quality would suit them best, and may even remain unable, after the fact, to verify the quality actually received. This is the typical case for healthcare, certain repair services, legal advice, or financial advice. This category, _credence goods_, is fairly close to George Akerlof’s “lemons”, except that in Akerlof’s framework the central problem is adverse selection between good and bad quality goods, whereas with _credence goods_ the issue lies more in the expert relationship itself: the seller not only knows more about the quality of the good, but also diagnoses what the buyer needs, recommends a service, and then charges for it. This category is useful because it describes quite explicitly what many professional uses of generative AI are becoming. A strategy memo, a briefing note, a compliance recommendation, a sales presentation, a preliminary opinion, or even a chunk of code buried inside a larger system may all look convincing without their real quality being immediately observable. So we are not merely talking about a tool that produces faster. We may instead be talking about a device that multiplies deliverables whose appearance is easy to judge, while their effective value becomes harder to establish. That is exactly the definition of a credence good, in the sense of Darby and Karni in _Free Competition and the Optimal Amount of Fraud_, when quality remains costly to evaluate even after use.
And to return to Thomas Claburn’s article, _AI still doesn’t work very well in business, …_, one can read it as saying something more general than a simple critique of the current moment. When he explains that major consulting firms are already using AI at scale to produce PowerPoints and other deliverables, and that errors in office work may be harder to detect than in code because there are no comparable benchmark tests, he is describing something more than a technical weakness. He is describing a service economy in which quality becomes opaque to the client, and sometimes even to the organization selling the service. That is precisely where insurance becomes interesting. Insurers do not cover a polished appearance or a promise of smoothness. They cover the consequences of an error once it ends up causing harm. The harder it is to observe quality at the moment of delivery, the greater the risk of a late-emerging loss, and the more central the question of proof becomes. AI therefore does not merely accelerate production. It also increases the probability that hard-to-evaluate services will be circulated at scale, only revealing their flaws once they have already been integrated into a decision, a contract, or a workflow.
## 4. The main problem is not just the error, but the incentive not to see it
When quality is difficult to observe, the economic question is no longer simply the technical reliability of the tool. It becomes a question about the incentives of those who use it. The good news is that we already have some answers. In particular, in that same article, _AI still doesn’t work very well in business, …_, Dorian Smiley, co-founder and CTO of Codestrap, describes a rather simple chain of incentives: the partner wants more margin and more revenue, the director wants to move faster with fewer interactions and less junior work, and the associate wants to finish earlier. In such a system, no one is naturally paid to reread the outputs of a model carefully. The point is not that some individual is behaving badly. It is simply that the entire organization can become rationally blind to declining quality, because the immediate cost of vigilance falls on those who must deliver quickly, while the cost of the damage is pushed into the future, sometimes onto other teams, sometimes onto the client, and sometimes onto the insurer. So the problem is not just technological. It is above all organizational, accounting-related, and economic.
The literature on credence goods helps give this intuition a more general form. Kerschbamer and Sutter show that information asymmetries create incentives toward overprovision, underprovision, and overcharging. They also remind us that, in theory, there are two main institutional correctives: liability and verifiability. But they immediately add that verifiability is often difficult to achieve in practice, which matters a great deal if we want to think seriously about AI in business. Many organizations now believe they have solved the problem by adding a few _checklists_ , some documentation, and the reassuring phrase _human in the loop_. But this kind of documentary patchwork does not always produce genuine responsibility. It sometimes allows people to claim that a control exists without ensuring that anyone will actually be accountable, in the strong sense of having to answer for the final quality of the service delivered. As long as no credible chain of responsibility exists, error is not only possible, it becomes structurally tempting to ignore. And that obviously creates problems if one wants to insure the risk: a risk becomes very difficult to cover when neither the client, nor the provider, nor the insurer knows clearly who was supposed to catch the error, when, and according to which procedure.
## 5. The hidden cost of AI is not generation, but verification
AI is often presented as a machine for producing faster. But that way of framing things rests on a systematic omission. We willingly count the time saved in producing the first draft, but much less the time required to turn that first draft into something genuinely reliable. And that is probably where the real cost lies. In _AI still doesn’t work very well in business, …_, Thomas Claburn reminds us that a model can produce a plausible output without having the capacity to verify its own work, something I already mentioned earlier. But he also insists that a model does not know whether its answer is correct, and that millions of lines of AI-generated code will never actually be reviewed by humans. Once that is clear, every apparent gain in speed is really a shift of burden downstream. Someone still has to reread, compare, test, contextualize, and sometimes rewrite. And if no one seriously takes on that work, the cost does not disappear. It reappears later in the form of errors, urgent fixes, loss of trust, and eventually litigation. What is presented as a productivity gain is often just an accounting displacement. We save at the beginning on production, only to spend later on control, or on the consequences of not having exercised it.
But again, none of this is entirely new. Several people had already warned us. In _Closing the AI accountability gap…_, Inioluwa Deborah Raji, Andrew Smart, and their coauthors remind us that external audits often arrive only after deployment, once the system has already produced negative effects. Their conclusion is that if we truly want to prevent rather than merely observe, then we have to accept an internal audit process that is, in their own words, “ _boring, slow, meticulous and methodical_.” And, quietly, that is the exact opposite of the dominant narrative of rapid iteration. Proof, in the strong sense, cannot be reduced to a few checklists or a vague _human in the loop_. It requires a documentary infrastructure, the collection of artifacts, testing procedures, moments of reflection, in short an entire organizational apparatus that slows things down. This slowing down is not a flaw in the system. It is the condition of its credibility. As long as an organization celebrates gains in speed without seriously funding this verification infrastructure, it necessarily underestimates the full cost of the system. And without quite realizing it, it creates the conditions for a risk that becomes less and less insurable.
## 6. Technical debt quickly becomes governance debt
David Sculley et al.’s paper, _Hidden Technical Debt in Machine Learning Systems_, shows that the debt of learning systems does not reside only in the code, but in the entire web of relationships surrounding the model. The authors remind us that developing and deploying a machine learning system is often relatively fast and apparently inexpensive, while maintaining it over time becomes difficult and costly. And this debt is dangerous precisely because it accumulates silently, at the scale of the whole system. It appears in the CACE principle, _Changing Anything Changes Everything_ , in _undeclared consumers_ that quietly reuse a model’s outputs, in _pipeline jungles_ that turn data preparation into an opaque tangle, and even in process debt or cultural debt when organizations continue to reward speed and local accuracy more than reproducibility and stability. I think that point matters because it reminds us that the problem is not simply an imperfect model. It is a system that gradually becomes harder to understand, and therefore harder to correct cleanly.
And in that case, technical debt becomes governance debt. As dependencies multiply, as outputs are reused elsewhere without explicit declaration, as fixes pile up on top of other fixes, and as processing chains become more opaque, the question is no longer merely whether the system works today. The question becomes who will be able, tomorrow, to explain why an error occurred, what change triggered it, which other components were affected, and who will have to answer for it. Closing the AI accountability gap complements David Sculley et al.’s text _Hidden Technical Debt …_ remarkably well. Inioluwa Deborah Raji and her coauthors define _accountability_ as being responsible, or _answerable for a system, its behavior and its potential impacts_. They also remind us that algorithms themselves cannot be held accountable. Only organizations can, through their governance structures. Once we understand that, we understand the question of insurability. The thicker the technical debt becomes, the more fragile the explanatory chain becomes. The more fragile that explanatory chain becomes, the blurrier the attribution of responsibility becomes. And the blurrier responsibility becomes, the harder the risk is to bear, to price, and sometimes even simply to understand.
## 7. Automation does not eliminate error, it displaces it
Another important point is that we often say automation reduces human error. That is sometimes true, but the formula is misleading because it suggests that error disappears once the machine steps in. Raja Parasuraman and Victor Riley suggest almost the exact opposite in _Humans and Automation: Use, Misuse, Disuse, Abuse_. Their point is not that automation is inherently good or bad. The matter is necessarily subtler than that. Automation reconfigures the human role, shifts attention, transforms the conditions of vigilance, and gives rise to new forms of overuse, underuse, and organizational abuse. They define _misuse_ as overreliance on automation, _disuse_ as underuse, and _abuse_ as implementation decisions taken by designers or managers without sufficient regard for their consequences for human performance. More generally, they remind us that errors associated with automation may come from the operator, the designer, or even management.
That is worth keeping in mind when thinking about the deployment of generative AI in business. When a tool produces quickly, with an appearance of coherence and a form that is often highly persuasive, it does not eliminate the need for judgment. It shifts that need into more diffuse and more thankless tasks: monitoring, doubting, rereading, stepping in only when something seems strange. Parasuraman and Riley also remind us in _Humans and Automation_ that human monitoring works poorly in high-workload environments, in highly autonomous systems, or when operators have too few opportunities to practice the manual task themselves. They add that automation often replaces the operator with the designer, and sometimes even with the manager. From that point on, an error in presentation, recommendation, prioritization, or reasoning is not necessarily the fault of an inattentive user. It may be the product of a system that has deprived humans of the very conditions for effective control, while still expecting them to bear final responsibility. For the insurer, this is a difficult situation because it blurs both damage prevention and the attribution of fault.
## 8. Without internal auditing, proof always comes too late
And what about auditing? Not auditing as an ethical add-on or a cosmetic gesture, but auditing as the minimum condition for credible proof. In _Closing the AI accountability gap…_, Inioluwa Deborah Raji, Andrew Smart, and their coauthors explain that external audits often occur only after deployment, once the system has already produced negative effects, a point I already mentioned, and that they are further limited by their lack of access to internal processes, intermediate models, or training data, often protected as trade secrets. Their proposal is therefore to think about an internal audit conducted throughout the development cycle, in such a way as to produce documents, traces, evaluation criteria, and more broadly a genuine audit report capable of identifying earlier the gaps between what a system was supposed to do and what it may actually end up doing.
I have the feeling that in their article they reverse the usual hierarchy of values. They remind us that a system may be technically reliable in the narrow sense of classical quality assurance, while still failing to satisfy the broader expectations the organization claims to uphold. That is why internal auditing is not only meant to verify a system’s performance before launch. It is also meant to make the development process itself auditable, to create a genuine _transparency trail_ , and to make visible the stakeholders involved, the decisions taken, the deviations observed, and the trade-offs made. Without this slow and documentary work, proof always remains ex post. It appears at the moment of the incident, the dispute, or the loss, which is precisely when it is most expensive and least helpful for prevention. And that is why, from the standpoint of insurance, an organization without robust internal auditing very quickly starts to look like an organization that is difficult to insure at any reasonable price.
## 9. AI-related losses will often be delayed, diffuse, and therefore hard to attribute
I touched on this a little in last week’s post, Fukushima: 15 years later…, but the image matters when we think about catastrophe. We often imagine technological risk as a spectacular accident, immediate and visible to everyone. I said there that, when it comes to nuclear risk, things are more complicated, and in fact a large share of the damage linked to AI in business may well take a much more discreet form. This is one of the most interesting points in _AI still doesn’t work very well in business, …_. Thomas Claburn explains that the problems will probably be harder to spot in office work than in code, because there are no comparable benchmark tests for evaluating hallucinated advice, plausible but false presentations, or deliverables whose quality no one is really tracking. The article also mentions quality failures that may only become visible after eight or nine months among heavy users, followed by litigation once bad advice has actually produced its effects. The damage therefore does not necessarily appear at the moment of generation. It appears later, when a report has been used to make a decision, when a recommendation has been followed, when a document has been circulated, or when a piece of code has been integrated into a production chain. The risk is not only the risk of error. It is the risk created by the delay between production, circulation of the deliverable, its appropriation by the organization, and the concrete manifestation of the damage.
When I wrote _Insurers and AI, a systemic risk_, I was mostly emphasizing accumulation, correlation, and the possibility of thousands of simultaneous losses when the same technological dependency spreads everywhere. Thomas Claburn reminds us that this is not only about a massive and synchronized shock. It is also about a long tail, about flaws that remain invisible for a long time because they are embedded in ordinary work, banal documents, and apparently acceptable professional recommendations. And that, it seems to me, is precisely what makes the issue so uncomfortable for insurance. A delayed, diffuse risk, difficult to detect and even harder to attribute, complicates prevention, pricing, the construction of proof, and ultimately the settlement of losses. We encounter, in a different form, the same insurance discomfort. AI adds to scenarios of correlated losses an entire family of slow-moving, opaque losses that are costly to reconstruct after the fact.
## 10. The market is already sending a signal of doubt that public discourse still refuses to acknowledge
At last, we can return to the quotation I used in the introduction, because if AI really were a clear, robust, measurable, and controllable productivity gain, then the market should reward it unambiguously. But that is not what we are seeing. To be sure, there is the stock market’s extraordinary overvaluation of these companies. But what should we make of downward price pressure when clients learn that a provider is using AI to produce its deliverables, as mentioned in _AI still doesn’t work very well in business, …_? The article also discusses the possible multiplication of lawsuits when bad advice actually causes damage. And above all, it discusses insurers trying to withdraw or limit coverage when AI is involved without a clear chain of responsibility. The same technology can be presented internally as a productivity gain, interpreted by the client as a reduction in value, and analyzed by the insurer as an increase in risk. We can see, and that was part of the point of this post, that this is only a false paradox. The more a firm explains that it has automated its production, the more it may undermine at the same time the perceived value of its work, the exposure of its professional liability, and the insurability of the very service it is selling.
In _Insuring AI. New risks? New models?_ , I pointed out that many AI-related risks are already present in insurance portfolios in the form of silent coverages, and that without real financial liability, auditing, certification, and compliance may become little more than box-ticking exercises. I also stressed a very classical insurance point, namely that an insurer can observe prudence only at the cost of verification, and that verification costs which are too high can make a risk practically uninsurable. And _AI still doesn’t work very well in business, …_ allows me to close the loop, at least temporarily. Because if clients want to pay less, if providers are exposing themselves to more litigation, and if insurers are trying to exit the risk, then the market is already sending a very clear signal of doubt. Behind the public narrative of efficiency, there is a doubt that is rarely expressed openly, but is translated in monetary, contractual, and insurance terms. And that is perhaps the doubt we should take seriously, because prices, exclusions, deductibles, and refusals to cover generally say more honestly what people really believe than triumphant speeches about the revolution underway.
In short, we may have talked a great deal about “new risks” in connection with artificial intelligence, but as is so often the case, and this was also the conclusion of _Insuring AI. New risks? New models?_, once you dig a little deeper, you realize that this is above all an old story about the economics of information, bad metrics, imperfect control, and diffuse responsibility…
* * *
OpenEdition suggests that you cite this post as follows:
Arthur Charpentier (March 19, 2026). If No One Pays for Proof, Everyone Will Pay for the Loss. _Freakonometrics_. Retrieved March 19, 2026 from https://freakonometrics.hypotheses.org/89367
* * *
* * * * *