Back to all articles
9 MIN READ

Evaluating Claude Performance: The Evals Guide

By Learnia Team

Evaluating Claude Performance: The Evals Guide

๐Ÿ“… Last updated: March 10, 2026 โ€” Covers automated, human, model-graded evals, and the Anthropic console tool.

๐Ÿ”— Pillar article: Claude API: Complete Guide


Why Evaluate Claude?

AI is probabilistic โ€” the same request can produce different results. Evaluations allow you to:

  1. โ†’Measure quality of responses objectively
  2. โ†’Compare prompts to find the best formulation
  3. โ†’Validate changes to models without production surprises
  4. โ†’Detect regressions during API updates
  5. โ†’Justify decisions to stakeholders with data

The 3 Types of Evaluations

1. Automated Evaluations

Scores calculated by code โ€” fast, reproducible, no human judgment needed.

MethodDescriptionWhen to use
Exact matchThe response is exactly the expected textFactual QA, classification
ContainsThe response contains a key word/phraseInformation extraction
RegexThe response matches a patternStructured formats (dates, emails)
JSON schemaThe response follows a JSON schemaStructured outputs
Code assertionA script checks conditionsComplex business logic
import json
import re

def eval_exact_match(response, expected):
    """Check for an exact match."""
    return response.strip().lower() == expected.strip().lower()

def eval_contains(response, keywords):
    """Check that the response contains the keywords."""
    response_lower = response.lower()
    return all(kw.lower() in response_lower for kw in keywords)

def eval_json_schema(response, required_fields):
    """Check that the response is valid JSON with required fields."""
    try:
        data = json.loads(response)
        return all(field in data for field in required_fields)
    except json.JSONDecodeError:
        return False

def eval_regex(response, pattern):
    """Check that the response matches a regex pattern."""
    return bool(re.search(pattern, response))

# Usage examples
assert eval_exact_match("Paris", "Paris")
assert eval_contains("Python is an interpreted language", ["python", "language"])
assert eval_json_schema('{"name": "Alice", "age": 30}', ["name", "age"])
assert eval_regex("The price is $29.99", r"\$\d+\.\d{2}")

2. Human Evaluations

Human annotators evaluate responses based on defined criteria. The gold standard in terms of quality.

CriterionScaleDescription
Accuracy1-5Are the facts correct?
Relevance1-5Is the response suited to the question?
Completeness1-5Are all aspects covered?
Clarity1-5Is the response well-structured and understandable?
Usefulness1-5Can the user act on this response?

Recommended process:

  1. โ†’Create an annotation guide with examples for each score
  2. โ†’Train 2-3 annotators on the criteria
  3. โ†’Have each response evaluated by at least 2 annotators
  4. โ†’Calculate inter-annotator agreement (Cohen's Kappa)
  5. โ†’Resolve disagreements through discussion

3. Model-Graded Evaluations

An LLM evaluates the responses of another LLM. Scalable and fast, with a good proxy for human evaluation.

import anthropic

client = anthropic.Anthropic()

def model_graded_eval(question, response, criteria):
    """Use Claude to evaluate a response."""
    eval_prompt = f"""Evaluate this response based on the given criteria.

Question asked: {question}
Response to evaluate: {response}

Evaluation criteria:
{criteria}

Return a JSON with:
- "score": rating from 1 to 5
- "reasoning": justification in 2-3 sentences
- "issues": list of identified problems (or empty list)
"""
    
    eval_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    
    return json.loads(eval_response.content[0].text)

# Usage
result = model_graded_eval(
    question="What are the advantages of TypeScript over JavaScript?",
    response="TypeScript adds static typing, which reduces bugs...",
    criteria="""
    - Technical accuracy (are the claims correct?)
    - Completeness (are the main advantages covered?)
    - Examples (are concrete examples provided?)
    """
)
print(f"Score: {result['score']}/5")
print(f"Reason: {result['reasoning']}")

Designing Test Cases

Test Case Structure

test_case = {
    "id": "tc-001",
    "category": "extraction",
    "input": "Marie Dupont, 35 years old, developer at TechCorp in London.",
    "expected_output": {
        "name": "Marie Dupont",
        "age": 35,
        "job": "developer",
        "company": "TechCorp",
        "city": "London"
    },
    "eval_method": "json_schema",
    "tags": ["extraction", "structured_output"]
}

Test Case Categories

CategoryExamplesRecommended coverage
Happy pathStandard cases, well-formed inputs40% of cases
Edge casesBoundary inputs, unusual formats25% of cases
AdversarialMisleading, contradictory inputs15% of cases
MultilingualInputs in multiple languages10% of cases
Empty/invalidEmpty, null, corrupted inputs10% of cases

Complete Evaluation Script

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

def run_eval_suite(test_cases, model, system_prompt, eval_fn):
    """Run a complete evaluation suite."""
    results = []
    passed = 0
    
    for tc in test_cases:
        # Call Claude
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": tc["input"]}]
        )
        
        output = response.content[0].text
        
        # Evaluation
        score = eval_fn(output, tc["expected_output"])
        passed += 1 if score else 0
        
        results.append({
            "id": tc["id"],
            "input": tc["input"],
            "expected": tc["expected_output"],
            "actual": output,
            "passed": score,
            "tokens": response.usage.input_tokens + response.usage.output_tokens
        })
    
    # Report
    total = len(test_cases)
    print(f"\n{'='*50}")
    print(f"Results: {passed}/{total} ({passed/total*100:.1f}%)")
    print(f"Model: {model}")
    print(f"Date: {datetime.now().isoformat()}")
    print(f"{'='*50}")
    
    # Failed cases
    failed = [r for r in results if not r["passed"]]
    if failed:
        print(f"\nโŒ {len(failed)} failed cases:")
        for r in failed[:5]:
            print(f"  - {r['id']}: expected '{r['expected']}', got '{r['actual'][:100]}...'")
    
    return results

# Execution
test_cases = [
    {"id": "tc-001", "input": "What is the capital of France?", "expected_output": "Paris"},
    {"id": "tc-002", "input": "What is the capital of Germany?", "expected_output": "Berlin"},
    {"id": "tc-003", "input": "What is the capital of Japan?", "expected_output": "Tokyo"},
]

results = run_eval_suite(
    test_cases=test_cases,
    model="claude-sonnet-4-20250514",
    system_prompt="Reply only with the city name, no sentence.",
    eval_fn=lambda output, expected: expected.lower() in output.lower()
)

A/B Testing Prompts

Systematically compare two prompt versions to identify the best one.

def ab_test_prompts(test_cases, prompt_a, prompt_b, model, eval_fn):
    """Compare two prompts on the same test cases."""
    results_a = run_eval_suite(test_cases, model, prompt_a, eval_fn)
    results_b = run_eval_suite(test_cases, model, prompt_b, eval_fn)
    
    score_a = sum(1 for r in results_a if r["passed"]) / len(results_a)
    score_b = sum(1 for r in results_b if r["passed"]) / len(results_b)
    
    tokens_a = sum(r["tokens"] for r in results_a)
    tokens_b = sum(r["tokens"] for r in results_b)
    
    print(f"\n๐Ÿ“Š A/B Test Results")
    print(f"{'Metric':<20} {'Prompt A':<15} {'Prompt B':<15} {'Winner':<10}")
    print(f"{'โ”€'*60}")
    print(f"{'Score':<20} {score_a:.1%}{'':>9} {score_b:.1%}{'':>9} {'A' if score_a > score_b else 'B'}")
    print(f"{'Total tokens':<20} {tokens_a:<15} {tokens_b:<15} {'A' if tokens_a < tokens_b else 'B'}")
    
    return {"prompt_a": score_a, "prompt_b": score_b}

# Example
ab_test_prompts(
    test_cases=test_cases,
    prompt_a="You are a geography assistant. Reply with the city name only.",
    prompt_b="Reply in a single word: the name of the capital asked about.",
    model="claude-sonnet-4-20250514",
    eval_fn=lambda output, expected: expected.lower() in output.lower()
)

Anthropic Console Evaluation Tool

The Anthropic console offers a built-in evaluation tool with a graphical interface.

Features

FeatureDescription
Eval setsReusable collections of test cases
Automatic scoringExact match, contains, model-graded
Side-by-side comparisonVisualize results of 2 prompts
HistoryTrack score evolution over time
ExportDownload results as CSV

Console Workflow

  1. โ†’Create an Eval Set: Add your test cases (input + expected output)
  2. โ†’Configure Scoring: Choose your evaluation method
  3. โ†’Run the Eval: Select the model and prompt
  4. โ†’Analyze Results: View scores, identify failures
  5. โ†’Iterate: Modify the prompt and rerun

Benchmarking Between Models

Compare the performance of different Claude models on your specific use case.

models = [
    "claude-haiku-3-5-20241022",
    "claude-sonnet-4-20250514",
    "claude-opus-4-20250918"
]

for model in models:
    print(f"\n๐Ÿ“Š Evaluating {model}")
    run_eval_suite(test_cases, model, system_prompt, eval_fn)
Selection criterionHaikuSonnetOpus
Speed-first (chatbot)โœ… Best๐Ÿ‘ Goodโš ๏ธ Slow
Quality-first (analysis)โš ๏ธ Basic๐Ÿ‘ Goodโœ… Best
Cost-first (volume)โœ… Best๐Ÿ‘ Goodโš ๏ธ Expensive
Reasoning (code, math)โš ๏ธ Limited๐Ÿ‘ Goodโœ… Best

Common Eval Mistakes

MistakeImpactSolution
Too few test casesNon-significant resultsMinimum 50-100 cases
No reproducibilitytemperature=1 gives variable resultsSet temperature=0 for evals
Eval on training setOverestimation of performanceUse never-seen cases
Single criterionPartial view of qualityCombine multiple metrics
No baselineImpossible to measure progressAlways save current results

GO DEEPER โ€” FREE GUIDE

Module 0 โ€” Prompting Fundamentals

Build your first effective prompts from scratch with hands-on exercises.

Newsletter

Weekly AI Insights

Tools, techniques & news โ€” curated for AI practitioners. Free, no spam.

Free, no spam. Unsubscribe anytime.

FAQ

Why evaluate Claude's performance?+

Evaluations allow you to objectively measure the quality of Claude's responses for your use case. They are essential for comparing prompts, validating model changes, and ensuring reliability in production.

What types of evaluations exist?+

Three main types: automated evals (string matching, regex, code-based), human evals (annotators who rate responses), and model-graded evals (an LLM evaluating another LLM's responses).

How do I use the Anthropic console evaluation tool?+

Access the Anthropic console, create an eval set with test cases, define scoring criteria, run the evaluation, and compare results between different prompts or models.

How many test cases are needed for a reliable evaluation?+

A minimum of 50-100 test cases is recommended for statistically significant results. For critical production cases, aim for 200+ test cases covering edge cases.

Can Claude be used to evaluate its own responses?+

Yes, that's the principle of model-graded eval. A Claude model evaluates the responses of another Claude call based on defined criteria. It's more scalable than human evals but potentially biased.