The Evaluation Design Lifecycle: From Business Need to Valid Metrics

When teams deploy LLMs that fail in production, the root cause is rarely the metrics they chose—it’s that they skipped the process of determining which metrics matter in the first place. You can have ROUGE scores, BERTScore, and even sophisticated LLM-as-a-judge evaluations, yet still build the wrong thing if you haven’t connected measurement to actual business requirements.
This blog post introduces the evaluation design lifecycle: a systematic process for translating stakeholder needs into valid, actionable metrics. We will not be discussing specifc evaluation metrics. We will establishes the foundational process that makes those metrics meaningful. The meta of evaluations.
The Missing Layer in Evaluation
Most AI evaluation discussions jump straight to frameworks, metrics, or tools: “Should we use RAGAS? DeepEval?”, “Should we use Langsmith or Braintrust”. But this skips a critical question: What are you trying to evaluate and why are you evaluating at all?
The evaluation design lifecycle fills this gap. It’s the process that helps you design and select any specific metric, ensuring that when you do measure, you’re measuring what actually matters. Whether you’re comparing competing systems, validating a single candidate, or assessing a component within a compound AI system, the fundamental question remains: does this system meet stakeholder needs?
Notice the word “stakeholder,” not “end user.” The person who needs the evaluation—the customer of your evaluation—is rarely the person who will use the system daily. A hospital administrator evaluating a medical chatbot has different concerns than the emergency room nurses who will use it. This misalignment between evaluation customer and end user creates complexity that a disciplined lifecycle must navigate.
The Seven Phases of Evaluation Design
The evaluation design lifecycle consists of seven phases that progress from high-level purpose to concrete measurement. Each phase builds on the previous, creating a traceable path from business need to validated metric. Understanding this lifecycle is essential because it reveals where evaluation efforts typically fail: not in the metrics themselves, but in the foundational decisions that determine which metrics are appropriate.
The lifecycle applies regardless of which specific metrics you ultimately choose. Whether you’ll use statistical metrics like ROUGE, semantic metrics like BERTScore, or LLM-as-a-judge approaches, you must first complete this design process. Think of the lifecycle as the scaffolding that ensures your chosen metrics actually measure what matters.
Here are the seven phases:
Phase 1: Clarify Evaluation Purpose and Scope
The lifecycle begins with purpose. What decision will this evaluation inform? What exactly are you evaluating? The full compound AI system? An agent? An MCP server? An MCP Tool? A workflow? A component of the workflow? or something in between?
Consider a machine translation component embedded in a multilingual information retrieval system. Are you evaluating just the translation engine? The entire retrieval system? The system within the specific context of hospital administrators searching medical records? Each scope demands different evaluation criteria and, ultimately, different metrics.
Define your boundaries explicitly. Does “the system” include the user interface? The training required to use it effectively? The humans who will interpret its outputs? These questions seem obvious, yet teams routinely stumble by leaving them unanswered.
Phase 2: Build a Task Model
With scope defined, the next phase identifies who will use your system and what they’re trying to accomplish. AI systems don’t exist in a vacuum—they serve specific purposes for specific people under specific constraints.
Returning to our information retrieval example: Will trained librarians use the system? Students conducting research? Emergency room staff under time pressure? Each user type brings different needs, skills, and tolerances for error. Your evaluation must account for these differences, as they directly influence which quality characteristics matter most.
Phase 3: Identify Quality Characteristics
With a clear task model, you can now identify which system attributes matter for your use case. Start with a framework of quality characteristics: functionality, reliability, efficiency, portability, and similar attributes that define system quality.
Treat these as a checklist, not a mandate. Not every characteristic carries equal weight. In a time-critical environment like an operating room, response time might trump everything else. In a legal context, reliability and auditability could be paramount. Your task model from Phase 2 should inform which characteristics matter most.
This phase produces a prioritized list of quality characteristics. The next phase decomposes these into measurable requirements.
Phase 4: Decompose into Measurable Requirements
This is where evaluation design becomes concrete. You can’t just say “the system should be accurate”—you need to define what accuracy means in your context and how you’ll measure it. Phase 4 transforms high-level quality characteristics into specific, measurable attributes.
This decomposition often requires building a hierarchy of attributes and sub-attributes. “Translation quality,” for instance, has historically fragmented into accuracy, fluency, intelligibility, fidelity, and information preservation—each an attempt to find something objectively measurable.
The key insight: rarely does a single attribute determine system success. You’re almost always balancing multiple requirements, which means you need multiple measurements. This multiplicity is why the evaluation design lifecycle matters—without systematic decomposition, teams pick metrics that measure something, but not necessarily what matters.
Phase 4 outputs a list of specific, measurable requirements. Each requirement must be concrete enough that Phase 5 can define a valid metric for it.
Phase 5: Define Valid Metrics and Methods
For each requirement from Phase 4, you must now define both what you’ll measure and how you’ll measure it. This is where specific metrics—BLEU, ROUGE, BERTScore, human evaluation, LLM-as-a-judge—enter the picture. The evaluation design lifecycle ensures you select these metrics deliberately, not arbitrarily.
This phase is where many evaluations fail, not because teams lack metrics, but because they lack valid metrics. Consider the cautionary tale of ALPAC’s intelligibility metric from early machine translation evaluation. Evaluators asked humans to rate translations using scales with descriptions like “perfectly clear and intelligible” or “hopelessly unintelligible.” The problem? Without agreement on what these terms mean, the metric couldn’t be valid. Circular definitions don’t produce reliable measurements.
Metric validity requires two components:
- The measure itself: What specific value or score will you compute?
- The measurement method: What procedure will produce that measure reliably?
For reference-based metrics like ROUGE (Chapter 2) or BERTScore (Chapter 3), the measure is a numerical score and the method involves comparing system output to reference texts. For LLM-as-a-judge approaches (Chapter 7), the measure might be a categorical rating and the method involves prompt design and model selection.
Once you have valid metrics, establish threshold scores. What constitutes success? What’s acceptable? What fails? These cut-offs flow directly from your task model (Phase 2) and business requirements (Phase 1).
Note that not all metrics require complex evaluation protocols. If your budget caps at $500 and the cheapest system costs $20,000, you can skip subsequent phases entirely. Price is a perfectly valid—and easily measured—attribute. The lifecycle doesn’t demand complexity; it demands appropriateness.
Phase 6: Design Evaluation Execution
With metrics defined, Phase 6 plans the actual evaluation logistics. Who will conduct measurements? When and where? What test materials do you need? How will you present results to support decision-making?
This phase transforms your evaluation design from concept to actionable plan. It’s also your final opportunity to catch design flaws before investing time and resources in execution. Questions to address include:
- What test data or scenarios will you use?
- How many samples do you need for statistical significance?
- Who performs the evaluation (automated systems, human raters, domain experts)?
- What format will results take (quantitative scores, qualitative reports, comparative rankings)?
The outputs from Phase 6 are an execution plan and any necessary test materials.
Phase 7: Execute and Report
The final phase executes the evaluation plan. Collect measurements, compare results against predetermined thresholds from Phase 5, and synthesize findings into a clear report that supports the decision identified in Phase 1.
This phase is what most people think of as “the evaluation.” But as the lifecycle reveals, execution is the culmination of six prior phases of deliberate design. Skip that foundation, and your measurements—however precise—may answer the wrong questions entirely.
The output of Phase 7 is an evaluation report that traces from measurements back through the lifecycle: these metrics were chosen because of these requirements, which decomposed from these quality characteristics, which mattered because of this task model, which served this business purpose. This traceability is what separates rigorous evaluation from measurement theater.
The Lifecycle as Living Process
A final reality: evaluation requirements aren’t static. As you execute your evaluation, you may discover that no available system meets all requirements, or that a system offers capabilities you hadn’t considered. Requirements evolve as understanding deepens and circumstances change.
This doesn’t diminish the lifecycle’s value—it makes it essential. The seven phases provide a structured foundation for principled adaptation. When requirements shift, you can trace implications systematically rather than making ad-hoc adjustments. Should a new requirement emerge in Phase 6, you can walk back through Phases 4 and 5 to ensure your metrics still align. The lifecycle creates traceability even as conditions evolve.
From Process to Metrics
The evaluation design lifecycle establishes what to measure before addressing how to measure it. This ordering is deliberate. Without Phase 1 through Phase 4, even the most sophisticated metric measures something arbitrary. With them, metrics become instruments of purpose rather than exercises in measurement.
The subsequent chapters of this book dive deep into specific evaluation methods:
- Classical reference-based metrics (BLEU, ROUGE) that compare outputs to gold standards
- Semantic similarity metrics (BERTScore, COMET) that capture meaning beyond surface form
- Human evaluation protocols that ground metrics in actual user judgments
- LLM-as-a-judge approaches that scale evaluation using language models themselves
- Alignment techniques (RLHF, Constitutional AI) that connect evaluation to system improvement
Each method has strengths and limitations. Each makes different assumptions about what “quality” means. The evaluation design lifecycle ensures you select methods that align with your specific business needs rather than defaulting to whatever seems most sophisticated or most commonly used.
When you encounter ROUGE-L in Chapter 2, you’ll understand not just the formula for computing n-gram overlap, but why you might choose ROUGE-L over other metrics based on your task model and quality requirements. When you learn about BERTScore in Chapter 3, you’ll recognize when semantic similarity matters more than lexical overlap. When you implement LLM-as-a-judge in Chapter 7, you’ll know which quality characteristics it measures well and which it doesn’t.
The lifecycle transforms metrics from black boxes into deliberate choices. That transformation is what makes evaluation meaningful rather than merely measurable.
@article{
leehanchung_databricks_reynold_xin,
author = {Lee, Hanchung},
title = {Databricks' Strategic Playbook: Reynold Xin on Growth, AI, and the Future of Data Infrastructure},
year = {2025},
month = {11},
day = {06},
howpublished = {\url{https://leehanchung.github.io}}
url = {https://leehanchung.github.io/blogs/2025/11/06/raynold-xin-databricks/}
}

The Evaluation Design Lifecycle: From Business Need to Valid Metrics

Facts Only

Executive Summary

Full Take

Sentinel — Human