March 10, 202610 MIN READ

Evaluating Claude Performance: The Evals Guide

By Dorian Laurenceau

Part ofModule 0 — Prompting Fundamentals→

📅 Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.

🔗 Pillar article: Claude API: Complete Guide

Why Evaluate Claude?

AI is probabilistic, the same request can produce different results. Evaluations allow you to:

→Measure quality of responses objectively
→Compare prompts to find the best formulation
→Validate changes to models without production surprises
→Detect regressions during API updates
→Justify decisions to stakeholders with data

The uncomfortable truth about LLM evaluation, repeated across r/MachineLearning and r/ChatGPTCoding threads: most teams shipping with Claude don't have real evals — they have "I ran it five times and it looked fine." That is not a failure of discipline as much as it is a failure of tooling: traditional CI assumptions (deterministic outputs, pass/fail assertions) break the moment you're grading natural language. The pragmatic teams that actually close this gap treat eval sets like regression tests — small (30-200 examples), versioned with the repo, hand-labeled once, and re-run on every prompt change.

Where the community correctly pushes back: "LLM-as-judge" scoring is not a substitute for human-graded ground truth; it is a cost-efficient proxy that only works after you've calibrated it. Research like "Judging LLM-as-a-Judge" documented what practitioners keep rediscovering — judge models inherit the biases of their training data, favor verbose answers, and can be gamed by adversarial phrasing. The fix isn't to abandon LLM-judges; it's to validate them against a small gold set before you trust their verdict on the other 99%.

Pragmatic operating rule: if you can't point to the eval set that caught the last regression, you don't have evals. You have hope. The Anthropic evaluations documentation is a fine starting place; Promptfoo and DeepEval are the open-source tools most teams end up using.

The 3 Types of Evaluations

1. Automated Evaluations

Scores calculated by code, fast, reproducible, no human judgment needed.

Method	Description	When to use
Exact match	The response is exactly the expected text	Factual QA, classification
Contains	The response contains a key word/phrase	Information extraction
Regex	The response matches a pattern	Structured formats (dates, emails)
JSON schema	The response follows a JSON schema	Structured outputs
Code assertion	A script checks conditions	Complex business logic

import json
import re

def eval_exact_match(response, expected):
    """Check for an exact match."""
    return response.strip().lower() == expected.strip().lower()

def eval_contains(response, keywords):
    """Check that the response contains the keywords."""
    response_lower = response.lower()
    return all(kw.lower() in response_lower for kw in keywords)

def eval_json_schema(response, required_fields):
    """Check that the response is valid JSON with required fields."""
    try:
        data = json.loads(response)
        return all(field in data for field in required_fields)
    except json.JSONDecodeError:
        return False

def eval_regex(response, pattern):
    """Check that the response matches a regex pattern."""
    return bool(re.search(pattern, response))

# Usage examples
assert eval_exact_match("Paris", "Paris")
assert eval_contains("Python is an interpreted language", ["python", "language"])
assert eval_json_schema('{"name": "Alice", "age": 30}', ["name", "age"])
assert eval_regex("The price is $29.99", r"\$\d+\.\d{2}")

2. Human Evaluations

Human annotators evaluate responses based on defined criteria. The gold standard in terms of quality.

Criterion	Scale	Description
Accuracy	1-5	Are the facts correct?
Relevance	1-5	Is the response suited to the question?
Completeness	1-5	Are all aspects covered?
Clarity	1-5	Is the response well-structured and understandable?
Usefulness	1-5	Can the user act on this response?

Recommended process:

→Create an annotation guide with examples for each score
→Train 2-3 annotators on the criteria
→Have each response evaluated by at least 2 annotators
→Calculate inter-annotator agreement (Cohen's Kappa)
→Resolve disagreements through discussion

3. Model-Graded Evaluations

An LLM evaluates the responses of another LLM. Scalable and fast, with a good proxy for human evaluation.

import anthropic

client = anthropic.Anthropic()

def model_graded_eval(question, response, criteria):
    """Use Claude to evaluate a response."""
    eval_prompt = f"""Evaluate this response based on the given criteria.

Question asked: {question}
Response to evaluate: {response}

Evaluation criteria:
{criteria}

Return a JSON with:
- "score": rating from 1 to 5
- "reasoning": justification in 2-3 sentences
- "issues": list of identified problems (or empty list)
"""
    
    eval_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    
    return json.loads(eval_response.content[0].text)

# Usage
result = model_graded_eval(
    question="What are the advantages of TypeScript over JavaScript?",
    response="TypeScript adds static typing, which reduces bugs...",
    criteria="""
    - Technical accuracy (are the claims correct?)
    - Completeness (are the main advantages covered?)
    - Examples (are concrete examples provided?)
    """
)
print(f"Score: {result['score']}/5")
print(f"Reason: {result['reasoning']}")

Designing Test Cases

Test Case Structure

test_case = {
    "id": "tc-001",
    "category": "extraction",
    "input": "Marie Dupont, 35 years old, developer at TechCorp in London.",
    "expected_output": {
        "name": "Marie Dupont",
        "age": 35,
        "job": "developer",
        "company": "TechCorp",
        "city": "London"
    },
    "eval_method": "json_schema",
    "tags": ["extraction", "structured_output"]
}

Test Case Categories

Category	Examples	Recommended coverage
Happy path	Standard cases, well-formed inputs	40% of cases
Edge cases	Boundary inputs, unusual formats	25% of cases
Adversarial	Misleading, contradictory inputs	15% of cases
Multilingual	Inputs in multiple languages	10% of cases
Empty/invalid	Empty, null, corrupted inputs	10% of cases

Complete Evaluation Script

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

def run_eval_suite(test_cases, model, system_prompt, eval_fn):
    """Run a complete evaluation suite."""
    results = []
    passed = 0
    
    for tc in test_cases:
        # Call Claude
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": tc["input"]}]
        )
        
        output = response.content[0].text
        
        # Evaluation
        score = eval_fn(output, tc["expected_output"])
        passed += 1 if score else 0
        
        results.append({
            "id": tc["id"],
            "input": tc["input"],
            "expected": tc["expected_output"],
            "actual": output,
            "passed": score,
            "tokens": response.usage.input_tokens + response.usage.output_tokens
        })
    
    # Report
    total = len(test_cases)
    print(f"\n{'='*50}")
    print(f"Results: {passed}/{total} ({passed/total*100:.1f}%)")
    print(f"Model: {model}")
    print(f"Date: {datetime.now().isoformat()}")
    print(f"{'='*50}")
    
    # Failed cases
    failed = [r for r in results if not r["passed"]]
    if failed:
        print(f"\n❌ {len(failed)} failed cases:")
        for r in failed[:5]:
            print(f"  - {r['id']}: expected '{r['expected']}', got '{r['actual'][:100]}...'")
    
    return results

# Execution
test_cases = [
    {"id": "tc-001", "input": "What is the capital of France?", "expected_output": "Paris"},
    {"id": "tc-002", "input": "What is the capital of Germany?", "expected_output": "Berlin"},
    {"id": "tc-003", "input": "What is the capital of Japan?", "expected_output": "Tokyo"},
]

results = run_eval_suite(
    test_cases=test_cases,
    model="claude-sonnet-4-20250514",
    system_prompt="Reply only with the city name, no sentence.",
    eval_fn=lambda output, expected: expected.lower() in output.lower()
)

A/B Testing Prompts

Systematically compare two prompt versions to identify the best one.

def ab_test_prompts(test_cases, prompt_a, prompt_b, model, eval_fn):
    """Compare two prompts on the same test cases."""
    results_a = run_eval_suite(test_cases, model, prompt_a, eval_fn)
    results_b = run_eval_suite(test_cases, model, prompt_b, eval_fn)
    
    score_a = sum(1 for r in results_a if r["passed"]) / len(results_a)
    score_b = sum(1 for r in results_b if r["passed"]) / len(results_b)
    
    tokens_a = sum(r["tokens"] for r in results_a)
    tokens_b = sum(r["tokens"] for r in results_b)
    
    print(f"\n📊 A/B Test Results")
    print(f"{'Metric':<20} {'Prompt A':<15} {'Prompt B':<15} {'Winner':<10}")
    print(f"{'─'*60}")
    print(f"{'Score':<20} {score_a:.1%}{'':>9} {score_b:.1%}{'':>9} {'A' if score_a > score_b else 'B'}")
    print(f"{'Total tokens':<20} {tokens_a:<15} {tokens_b:<15} {'A' if tokens_a < tokens_b else 'B'}")
    
    return {"prompt_a": score_a, "prompt_b": score_b}

# Example
ab_test_prompts(
    test_cases=test_cases,
    prompt_a="You are a geography assistant. Reply with the city name only.",
    prompt_b="Reply in a single word: the name of the capital asked about.",
    model="claude-sonnet-4-20250514",
    eval_fn=lambda output, expected: expected.lower() in output.lower()
)

Anthropic Console Evaluation Tool

The Anthropic console offers a built-in evaluation tool with a graphical interface.

Features

Feature	Description
Eval sets	Reusable collections of test cases
Automatic scoring	Exact match, contains, model-graded
Side-by-side comparison	Visualize results of 2 prompts
History	Track score evolution over time
Export	Download results as CSV

Console Workflow

→Create an Eval Set: Add your test cases (input + expected output)
→Configure Scoring: Choose your evaluation method
→Run the Eval: Select the model and prompt
→Analyze Results: View scores, identify failures
→Iterate: Modify the prompt and rerun

Benchmarking Between Models

Compare the performance of different Claude models on your specific use case.

models = [
    "claude-haiku-3-5-20241022",
    "claude-sonnet-4-20250514",
    "claude-opus-4-20250918"
]

for model in models:
    print(f"\n📊 Evaluating {model}")
    run_eval_suite(test_cases, model, system_prompt, eval_fn)

Selection criterion	Haiku	Sonnet	Opus
Speed-first (chatbot)	✅ Best	👍 Good	⚠️ Slow
Quality-first (analysis)	⚠️ Basic	👍 Good	✅ Best
Cost-first (volume)	✅ Best	👍 Good	⚠️ Expensive
Reasoning (code, math)	⚠️ Limited	👍 Good	✅ Best

Common Eval Mistakes

Mistake	Impact	Solution
Too few test cases	Non-significant results	Minimum 50-100 cases
No reproducibility	`temperature=1` gives variable results	Set `temperature=0` for evals
Eval on training set	Overestimation of performance	Use never-seen cases
Single criterion	Partial view of quality	Combine multiple metrics
No baseline	Impossible to measure progress	Always save current results

→AI Hallucination and Bias Detection, Techniques for identifying and preventing model errors

GO DEEPER — FREE GUIDE

Module 0 — Prompting Fundamentals

Build your first effective prompts from scratch with hands-on exercises.

Explore the Module

Dorian Laurenceau

Full-Stack Developer & Learning Designer

Full-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.

Prompt EngineeringLLMsFull-Stack DevelopmentLearning DesignReact

Published: March 10, 2026Updated: April 24, 2026

Newsletter

Weekly AI Insights

Tools, techniques & news — curated for AI practitioners. Free, no spam.

Free, no spam. Unsubscribe anytime.

FAQ

Why evaluate Claude's performance?+

Evaluations allow you to objectively measure the quality of Claude's responses for your use case. They are essential for comparing prompts, validating model changes, and ensuring reliability in production.

What types of evaluations exist?+

Three main types: automated evals (string matching, regex, code-based), human evals (annotators who rate responses), and model-graded evals (an LLM evaluating another LLM's responses).

How do I use the Anthropic console evaluation tool?+

Access the Anthropic console, create an eval set with test cases, define scoring criteria, run the evaluation, and compare results between different prompts or models.

How many test cases are needed for a reliable evaluation?+

A minimum of 50-100 test cases is recommended for statistically significant results. For critical production cases, aim for 200+ test cases covering edge cases.

Can Claude be used to evaluate its own responses?+

Yes, that's the principle of model-graded eval. A Claude model evaluates the responses of another Claude call based on defined criteria. It's more scalable than human evals but potentially biased.