Evaluating Claude Performance: The Evals Guide
By Dorian Laurenceau
๐ Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.
๐ Pillar article: Claude API: Complete Guide
Why Evaluate Claude?
AI is probabilistic, the same request can produce different results. Evaluations allow you to:
- โMeasure quality of responses objectively
- โCompare prompts to find the best formulation
- โValidate changes to models without production surprises
- โDetect regressions during API updates
- โJustify decisions to stakeholders with data
The uncomfortable truth about LLM evaluation, repeated across r/MachineLearning and r/ChatGPTCoding threads: most teams shipping with Claude don't have real evals โ they have "I ran it five times and it looked fine." That is not a failure of discipline as much as it is a failure of tooling: traditional CI assumptions (deterministic outputs, pass/fail assertions) break the moment you're grading natural language. The pragmatic teams that actually close this gap treat eval sets like regression tests โ small (30-200 examples), versioned with the repo, hand-labeled once, and re-run on every prompt change.
Where the community correctly pushes back: "LLM-as-judge" scoring is not a substitute for human-graded ground truth; it is a cost-efficient proxy that only works after you've calibrated it. Research like "Judging LLM-as-a-Judge" documented what practitioners keep rediscovering โ judge models inherit the biases of their training data, favor verbose answers, and can be gamed by adversarial phrasing. The fix isn't to abandon LLM-judges; it's to validate them against a small gold set before you trust their verdict on the other 99%.
Pragmatic operating rule: if you can't point to the eval set that caught the last regression, you don't have evals. You have hope. The Anthropic evaluations documentation is a fine starting place; Promptfoo and DeepEval are the open-source tools most teams end up using.
The 3 Types of Evaluations
1. Automated Evaluations
Scores calculated by code, fast, reproducible, no human judgment needed.
| Method | Description | When to use |
|---|---|---|
| Exact match | The response is exactly the expected text | Factual QA, classification |
| Contains | The response contains a key word/phrase | Information extraction |
| Regex | The response matches a pattern | Structured formats (dates, emails) |
| JSON schema | The response follows a JSON schema | Structured outputs |
| Code assertion | A script checks conditions | Complex business logic |
import json
import re
def eval_exact_match(response, expected):
"""Check for an exact match."""
return response.strip().lower() == expected.strip().lower()
def eval_contains(response, keywords):
"""Check that the response contains the keywords."""
response_lower = response.lower()
return all(kw.lower() in response_lower for kw in keywords)
def eval_json_schema(response, required_fields):
"""Check that the response is valid JSON with required fields."""
try:
data = json.loads(response)
return all(field in data for field in required_fields)
except json.JSONDecodeError:
return False
def eval_regex(response, pattern):
"""Check that the response matches a regex pattern."""
return bool(re.search(pattern, response))
# Usage examples
assert eval_exact_match("Paris", "Paris")
assert eval_contains("Python is an interpreted language", ["python", "language"])
assert eval_json_schema('{"name": "Alice", "age": 30}', ["name", "age"])
assert eval_regex("The price is $29.99", r"\$\d+\.\d{2}")
2. Human Evaluations
Human annotators evaluate responses based on defined criteria. The gold standard in terms of quality.
| Criterion | Scale | Description |
|---|---|---|
| Accuracy | 1-5 | Are the facts correct? |
| Relevance | 1-5 | Is the response suited to the question? |
| Completeness | 1-5 | Are all aspects covered? |
| Clarity | 1-5 | Is the response well-structured and understandable? |
| Usefulness | 1-5 | Can the user act on this response? |
Recommended process:
- โCreate an annotation guide with examples for each score
- โTrain 2-3 annotators on the criteria
- โHave each response evaluated by at least 2 annotators
- โCalculate inter-annotator agreement (Cohen's Kappa)
- โResolve disagreements through discussion
3. Model-Graded Evaluations
An LLM evaluates the responses of another LLM. Scalable and fast, with a good proxy for human evaluation.
import anthropic
client = anthropic.Anthropic()
def model_graded_eval(question, response, criteria):
"""Use Claude to evaluate a response."""
eval_prompt = f"""Evaluate this response based on the given criteria.
Question asked: {question}
Response to evaluate: {response}
Evaluation criteria:
{criteria}
Return a JSON with:
- "score": rating from 1 to 5
- "reasoning": justification in 2-3 sentences
- "issues": list of identified problems (or empty list)
"""
eval_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": eval_prompt}]
)
return json.loads(eval_response.content[0].text)
# Usage
result = model_graded_eval(
question="What are the advantages of TypeScript over JavaScript?",
response="TypeScript adds static typing, which reduces bugs...",
criteria="""
- Technical accuracy (are the claims correct?)
- Completeness (are the main advantages covered?)
- Examples (are concrete examples provided?)
"""
)
print(f"Score: {result['score']}/5")
print(f"Reason: {result['reasoning']}")
Designing Test Cases
Test Case Structure
test_case = {
"id": "tc-001",
"category": "extraction",
"input": "Marie Dupont, 35 years old, developer at TechCorp in London.",
"expected_output": {
"name": "Marie Dupont",
"age": 35,
"job": "developer",
"company": "TechCorp",
"city": "London"
},
"eval_method": "json_schema",
"tags": ["extraction", "structured_output"]
}
Test Case Categories
| Category | Examples | Recommended coverage |
|---|---|---|
| Happy path | Standard cases, well-formed inputs | 40% of cases |
| Edge cases | Boundary inputs, unusual formats | 25% of cases |
| Adversarial | Misleading, contradictory inputs | 15% of cases |
| Multilingual | Inputs in multiple languages | 10% of cases |
| Empty/invalid | Empty, null, corrupted inputs | 10% of cases |
Complete Evaluation Script
import anthropic
import json
from datetime import datetime
client = anthropic.Anthropic()
def run_eval_suite(test_cases, model, system_prompt, eval_fn):
"""Run a complete evaluation suite."""
results = []
passed = 0
for tc in test_cases:
# Call Claude
response = client.messages.create(
model=model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": tc["input"]}]
)
output = response.content[0].text
# Evaluation
score = eval_fn(output, tc["expected_output"])
passed += 1 if score else 0
results.append({
"id": tc["id"],
"input": tc["input"],
"expected": tc["expected_output"],
"actual": output,
"passed": score,
"tokens": response.usage.input_tokens + response.usage.output_tokens
})
# Report
total = len(test_cases)
print(f"\n{'='*50}")
print(f"Results: {passed}/{total} ({passed/total*100:.1f}%)")
print(f"Model: {model}")
print(f"Date: {datetime.now().isoformat()}")
print(f"{'='*50}")
# Failed cases
failed = [r for r in results if not r["passed"]]
if failed:
print(f"\nโ {len(failed)} failed cases:")
for r in failed[:5]:
print(f" - {r['id']}: expected '{r['expected']}', got '{r['actual'][:100]}...'")
return results
# Execution
test_cases = [
{"id": "tc-001", "input": "What is the capital of France?", "expected_output": "Paris"},
{"id": "tc-002", "input": "What is the capital of Germany?", "expected_output": "Berlin"},
{"id": "tc-003", "input": "What is the capital of Japan?", "expected_output": "Tokyo"},
]
results = run_eval_suite(
test_cases=test_cases,
model="claude-sonnet-4-20250514",
system_prompt="Reply only with the city name, no sentence.",
eval_fn=lambda output, expected: expected.lower() in output.lower()
)
A/B Testing Prompts
Systematically compare two prompt versions to identify the best one.
def ab_test_prompts(test_cases, prompt_a, prompt_b, model, eval_fn):
"""Compare two prompts on the same test cases."""
results_a = run_eval_suite(test_cases, model, prompt_a, eval_fn)
results_b = run_eval_suite(test_cases, model, prompt_b, eval_fn)
score_a = sum(1 for r in results_a if r["passed"]) / len(results_a)
score_b = sum(1 for r in results_b if r["passed"]) / len(results_b)
tokens_a = sum(r["tokens"] for r in results_a)
tokens_b = sum(r["tokens"] for r in results_b)
print(f"\n๐ A/B Test Results")
print(f"{'Metric':<20} {'Prompt A':<15} {'Prompt B':<15} {'Winner':<10}")
print(f"{'โ'*60}")
print(f"{'Score':<20} {score_a:.1%}{'':>9} {score_b:.1%}{'':>9} {'A' if score_a > score_b else 'B'}")
print(f"{'Total tokens':<20} {tokens_a:<15} {tokens_b:<15} {'A' if tokens_a < tokens_b else 'B'}")
return {"prompt_a": score_a, "prompt_b": score_b}
# Example
ab_test_prompts(
test_cases=test_cases,
prompt_a="You are a geography assistant. Reply with the city name only.",
prompt_b="Reply in a single word: the name of the capital asked about.",
model="claude-sonnet-4-20250514",
eval_fn=lambda output, expected: expected.lower() in output.lower()
)
Anthropic Console Evaluation Tool
The Anthropic console offers a built-in evaluation tool with a graphical interface.
Features
| Feature | Description |
|---|---|
| Eval sets | Reusable collections of test cases |
| Automatic scoring | Exact match, contains, model-graded |
| Side-by-side comparison | Visualize results of 2 prompts |
| History | Track score evolution over time |
| Export | Download results as CSV |
Console Workflow
- โCreate an Eval Set: Add your test cases (input + expected output)
- โConfigure Scoring: Choose your evaluation method
- โRun the Eval: Select the model and prompt
- โAnalyze Results: View scores, identify failures
- โIterate: Modify the prompt and rerun
Benchmarking Between Models
Compare the performance of different Claude models on your specific use case.
models = [
"claude-haiku-3-5-20241022",
"claude-sonnet-4-20250514",
"claude-opus-4-20250918"
]
for model in models:
print(f"\n๐ Evaluating {model}")
run_eval_suite(test_cases, model, system_prompt, eval_fn)
| Selection criterion | Haiku | Sonnet | Opus |
|---|---|---|---|
| Speed-first (chatbot) | โ Best | ๐ Good | โ ๏ธ Slow |
| Quality-first (analysis) | โ ๏ธ Basic | ๐ Good | โ Best |
| Cost-first (volume) | โ Best | ๐ Good | โ ๏ธ Expensive |
| Reasoning (code, math) | โ ๏ธ Limited | ๐ Good | โ Best |
Common Eval Mistakes
| Mistake | Impact | Solution |
|---|---|---|
| Too few test cases | Non-significant results | Minimum 50-100 cases |
| No reproducibility | temperature=1 gives variable results | Set temperature=0 for evals |
| Eval on training set | Overestimation of performance | Use never-seen cases |
| Single criterion | Partial view of quality | Combine multiple metrics |
| No baseline | Impossible to measure progress | Always save current results |
- โAI Hallucination and Bias Detection, Techniques for identifying and preventing model errors
Module 0 โ Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.
Dorian Laurenceau
Full-Stack Developer & Learning DesignerFull-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.
Weekly AI Insights
Tools, techniques & news โ curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
โRelated Articles
FAQ
Why evaluate Claude's performance?+
Evaluations allow you to objectively measure the quality of Claude's responses for your use case. They are essential for comparing prompts, validating model changes, and ensuring reliability in production.
What types of evaluations exist?+
Three main types: automated evals (string matching, regex, code-based), human evals (annotators who rate responses), and model-graded evals (an LLM evaluating another LLM's responses).
How do I use the Anthropic console evaluation tool?+
Access the Anthropic console, create an eval set with test cases, define scoring criteria, run the evaluation, and compare results between different prompts or models.
How many test cases are needed for a reliable evaluation?+
A minimum of 50-100 test cases is recommended for statistically significant results. For critical production cases, aim for 200+ test cases covering edge cases.
Can Claude be used to evaluate its own responses?+
Yes, that's the principle of model-graded eval. A Claude model evaluates the responses of another Claude call based on defined criteria. It's more scalable than human evals but potentially biased.