The Handover of AI Standard

The public bodies that are supposed to set the standards for AI regulation have, for the most part, not done it yet. AI regulations on both sides of the Atlantic require providers to certify or document that their systems meet general requirements (such as accuracy, fairness, robustness, human oversight). But they leave much of the specification over what those requirements mean to bodies that have not yet produced requirements that match the systems being regulated.
The European Union’s AI Act delay is a visible example. Under the Act, providers of high-risk AI systems are supposed to certify their systems against harmonized technical standards written by independent bodies in Brussels, but those bodies missed their August 2025 deadline to issue the standards, and the European Commission proposed postponing parts of the Act’s application to 2027 and 2028 because of that delay. In the meantime, providers are working out their own definitions of what compliance requires with, at most, sectoral guidance from non-AI regulators and their own interpretations of general legal requirements. The standard-setting work that the AI Act assumed public bodies and regulators would do, in other words, is being done by the companies whose systems are being regulated. This pattern, as detailed below, is not specific to the AI Act.
Why Setting the Standards is Difficult
The pattern is the short-term consequence of a long-term problem: the standard-setting that AI regulation requires sits between two communities. The technical and legal vocabularies that AI regulation depends on often do not match in a way that lets regulators push back substantively on self-assessments from providers.
Two knowledge communities write the rules for AI, but they operate at substantial distance from each other. The AI safety community, whose members are trained in computer science and engineering, thinks in terms of how systems fail under different conditions, what counts as effective testing, and how to measure risk before deployment. The AI regulation community, whose members are trained in law and rights-based governance, thinks in terms of who is responsible when something goes wrong, what process people are owed before a decision affects them, and what rights they have to challenge it after the fact. Standards bodies, sectoral regulators, and domain experts need to be familiar with both disciplinary poles. But people who operate natively at both poles are rare, and the questions that fall between the two overlap with those that AI regulations have struggled most to specify.
This distance shows up in the texts. For example, the AI Act requires that high-risk systems be designed and developed so that they can be overseen by natural persons (Article 14): a requirement drafted in language the law community recognizes. But what to do when the optimal system for accuracy, robustness, and cybersecurity mandated elsewhere in the Act (Article 15(1)) is the one humans cannot effectively oversee is specification work of the kind the Act delegates to standards that do not yet exist. A similar pattern appears in U.S. executive orders. The most detailed provisions of the 2023 executive order 14110 (since revoked) focused on testing AI systems for dangerous capabilities, a concept native to the safety community, without specifying how the results of those tests bear on questions about who is liable when the systems cause harm. Each of these provisions treats one community’s framing as primary and the other’s as friction. What falls between the framings is the content that standards are supposed to supply.
Who Sets the Standards
When the applicability of regulatory texts depends on synthesized technical-legal vocabulary that has not been built, the work of specifying what compliance means falls on whichever actor is closest to the systems being regulated. In current AI regulation, that actor is the provider. Many legal frameworks use general language that gestures at the technical domain, as is the case with the AI Act. When the technical specifications that would let regulators evaluate compliance are delayed or never arrive, providers in practice certify their own systems against legislative texts and partial standards whose language does not always match how they actually build and test their systems.
What providers are doing in the gap is standard-setting. When a provider certifies that its system meets a general regulatory requirement without a published specification of what counts as adequate, the provider is effectively producing the substantive content of what the regulation requires. In the absence of published standards, that documentation tends to become the operational meaning of the regulation, and the first cases that test the meaning are likely to refer back to what providers wrote.
The AI Act conformity assessment is one instance of this. The Act lists, in its Annex III, categories of high-risk AI, including systems used in employment, education, public benefits, essential services, law enforcement, and border control. It delegates the work of specifying how a provider should evaluate whether their system is performing acceptably within each category to harmonized technical standards that, for most categories, do not yet exist.
Picture a government agency deploying AI to screen for fraud in a regulatory environment where there is no standard for what counts as an acceptable false-positive rate, adequate human review of flagged cases, or sufficient accuracy across demographic groups. The agency or its vendor will produce internal documentation describing how the system handles each of these. When the system wrongly flags people and the harm becomes visible, that documentation will be central to what courts and regulators look at to determine compliance. That documentation will de facto begin to fill in the answer to what the law requires.
The Dutch childcare benefits scandal showed what happens when an agency deploys an algorithmic system without external benchmarks to check its performance. The Tax and Customs Administration used a risk-scoring algorithm to flag benefit claims for fraud, wrongly accused about 26,000 families, and disproportionately targeted parents with immigrant backgrounds. The absence of a published standard for what counted as an acceptable error rate or adequate human review meant there was no external tripwire that forced a correction before the harm accumulated.
U.S. employment law shows the same pattern. Title VII forbids hiring practices that produce unjustified disparities across race, sex, and other protected categories. It does not say what counts as a disparity, what counts as a fair test, or how an employer should prove its hiring tool is not discriminatory. Those substantive specifications were left to a 1978 set of federal guidelines, written for paper-and-pencil aptitude tests, that defined the technical content of Title VII requirements: how to measure pass rates by group, what threshold counts as a problem, and what kind of study shows the test actually predicts job performance. The guidelines were the kind of two-community synthesis AI regulation needs. Employers running AI hiring tools today operate inside that handoff even though the 1978 specifications do not squarely fit AI systems. Neither the U.S. Equal Employment Opportunity Commission (EEOC) nor Congress have produced specifications that do for machine learning hiring tools what the 1978 guidelines did for paper-and-pencil tests. Instead, the companies running AI hiring tools are filling in the blanks themselves, and their internal documentation is becoming the working answer to what Title VII requires of an AI system.
The problem is not that the technology is moving faster than the law. The problem is rather that the law delegated its substantive content to technical specifications that depend on the state of technology, and neither the agencies that issued the original guidelines (the EEOC, Department of Labor, Department of Justice, and what is now the Office of Personnel Management) nor Congress nor the courts filled the gap with specifications that match the systems being regulated.
The pattern is also visible in legislation that did not pass. It killed Canada’s Artificial Intelligence and Data Act (AIDA). AIDA, introduced in 2022 as part of Bill C-27, died in 2025 after sustained criticism from industry, civil society, labour organisations, and academics. The shared criticism was about the Act’s structure. AIDA defined “high-impact” AI systems, harm, and compliance obligations in general terms and delegated the substantive specification of those terms to regulations that Innovation, Science and Economic Development Canada would write later, without further parliamentary deliberation. Critics argued that they could not evaluate the Act because the rules that it would actually impose did not exist yet. Technical substance had to come from outside the Act because the Act itself did not contain meaningfully operative content. AIDA’s failure is a case where the delegation to technical content was visible enough at the legislative stage to attract sustained pushback (not due to the delegation itself, but due to the absence of specification that would have let people evaluate what the Act’s language meant in practice). The same delegation, less visible because it sits inside enacted texts and unfilled regulatory mandates, is operating in the AI Act and U.S. sectoral regulation.
These cases share a failure mechanism. A regulatory text uses technical-legal language that the relevant institutions have not operationalized for AI. The provider is closest to the system and has the strongest incentive to produce a workable specification, so the specification gets produced there. Whatever the provider produces may become the standard courts evaluate against because nothing else is available. Providers are setting the standards that their own systems are then measured against.
What This Asks of Regulators
The test for whether an AI rule or standard is doing its work is whether a regulator can use it to evaluate a provider’s compliance against something other than the provider’s own choices. If the rule does not specify enough for that, the provider is filling in the standard by default. Meeting that test requires someone who can determine, from the rule’s text, what it requires of a real system in a real deployment. Otherwise, the work of setting the rule’s operation falls on private actors with strong incentives to do so in their own favor. A rule that reads coherently only from inside one community is a rule that delegates its hardest standard-setting choices to whoever has the most resources to make them.
The gap is partly a capacity problem. Standards bodies tasked with the specification work are often underfunded relative to the task. And it is in part a political problem: industry participants in standards working groups have incentives to keep specifications loose. The vocabulary gap, which is the part of the problem institutional processes are least equipped to address, compounds the other two.
Two changes follow from the diagnosis. First, regulatory drafting processes should bring technical and legal staff together at the level of provision design, since sequential drafting (where one community produces a text and the other reviews it) risks locking in whichever vocabulary went first. Co-drafting does not solve the capacity or incentive problems, but it produces regulatory texts that give standards bodies clearer instructions about what their standards need to do. Second, research funders could encourage, as part of deliverables, the development of frameworks usable by both communities, which forces conceptual mapping to happen during research, when commitments are still being formed. This addresses the capacity problem, producing shared categories that standards bodies can draw on. Both changes aim at the same output: shared technical-legal categories that regulators can use to push back substantively on provider self-assessment. Neither would close the standards gap on its own.
The gap will narrow with or without these changes. Shared categories will develop either through deliberate work or through a first generation of standards whose failures are repaired after major cases force the issue. The current trajectory points toward the second. Providers, in that sense, do bear a cost in writing standards without settled benchmarks, because what they produce may not survive judicial or regulatory review. But the same work hands them the advantage of defining what the regulation operationally requires. The most significant cost, ultimately, falls on the people on the receiving end of those decisions. They have no way of knowing that the standard their case will be measured against is one the provider wrote, and need an institutional check to test whether that standard adequately serves interests beyond the provider’s own.

Sentinel — Human