CDT Submits Comments on NIST’s Draft Guidance for Automated Benchmark Evaluations of Language Models

AI Policy & Governance, CDT AI Governance Lab
CDT Submits Comments on NIST’s Draft Guidance for Automated Benchmark Evaluations of Language Models
The Center for Democracy & Technology (CDT) submitted comments in response to the Center for AI Standards and Innovation (CAISI) at the National Institute of Standards and Technology’s (NIST) request for comment on their draft guidance on Practices for Automated Benchmark Evaluations of Language Models (AI 800-2). The CAISI draft provides guidance on how developers can conduct robust automated evaluations of large language models (LLM), covering critical aspects of the measurement process, such as defining measurement constructs, implementing evaluations, and debugging common issues.
The draft guidance strikes a thoughtful balance between conceptual foundations and practical implementation. By emphasizing the importance of clearly defining evaluation goals and aligning methods with those goals, it encourages practitioners to design assessments that fit their specific use cases, rather than defaulting to popular or convenient approaches that may not meaningfully serve their objectives. The document also highlights several “emerging practices” in evaluation therefore providing a source not only of established best practices but also of forward-looking methods that are likely to become increasingly important as AI systems evolve. For example, the draft guidance makes suggestions for how to manage “evaluation awareness,” a phenomenon whereby models “recognize” that they are being evaluated, potentially undermining the validity of evaluation results.
Our comments identify several areas where CAISI could further strengthen this guidance in future iterations:
Evaluation as an iterative process. The current document presents evaluation implementation as a four-step linear process: (1) designing the evaluation, (2) writing the evaluation code, (3) running the evaluation and tracking results, and (4) debugging the evaluation, which implies that each of these steps is taken successively. We recommend that NIST more explicitly frame evaluation development as iterative, with guidance on prototyping evaluations, how recommended practices may differ across early and later development phases, and what indicators suggest an evaluation is (or is not) mature enough for full-scale deployment.
Integration with existing documentation. We also encourage CAISI to provide guidance on how practitioners can incorporate documentation on evaluations into existing artifacts like model cards and system cards. Future versions could offer more direction on integrating updated evaluation results into these artifacts while maintaining version control and traceability, as well as human-centered guidance to help non-technical stakeholders interpret evaluation results and their associated uncertainty.
Clarity on subjective evaluations. The document initially defines automated evaluations as assessments with known or automatically verifiable solutions, but also discusses methods that use LLMs to approximate human preferences, which is a fundamentally different kind of measurement. We urge CAISI to clarify whether and under what conditions these inherently subjective evaluations fall within the scope of this guidance, and if so, to more explicitly address their distinctive limitations and debugging requirements.
LLM-as-a-judge limitations. In some cases, practitioners use LLMs to evaluate another model’s outputs, a practice commonly referred to as “LLM-as-a-judge.” This approach can improve scalability when large-scale human annotation is impractical. While the guidance discusses the use of LLMs as judges and acknowledges some of the method’s weaknesses, its treatment of the associated risks and limitations is relatively limited. Prior research has documented systematic biases in LLM-as-a-judge methods, including egocentric bias (models favoring their own outputs), length bias (preferring longer responses regardless of quality), and sensitivity to implementation details. In our own work, we have found that commonly proposed mitigation strategies do not reliably improve performance. We therefore recommend that future versions include more detailed guidance on identifying, diagnosing, and managing these challenges.

CDT Submits Comments on NIST’s Draft Guidance for Automated Benchmark Evaluations of Language Models

Facts Only

Executive Summary

Full Take

Sentinel — Human