Evaluating Claude Performance: The Evals Guide
By Learnia Team
Evaluating Claude Performance: The Evals Guide
๐ Last updated: March 10, 2026 โ Covers automated, human, model-graded evals, and the Anthropic console tool.
๐ Pillar article: Claude API: Complete Guide
Why Evaluate Claude?
AI is probabilistic โ the same request can produce different results. Evaluations allow you to:
- โMeasure quality of responses objectively
- โCompare prompts to find the best formulation
- โValidate changes to models without production surprises
- โDetect regressions during API updates
- โJustify decisions to stakeholders with data
The 3 Types of Evaluations
1. Automated Evaluations
Scores calculated by code โ fast, reproducible, no human judgment needed.
| Method | Description | When to use |
|---|---|---|
| Exact match | The response is exactly the expected text | Factual QA, classification |
| Contains | The response contains a key word/phrase | Information extraction |
| Regex | The response matches a pattern | Structured formats (dates, emails) |
| JSON schema | The response follows a JSON schema | Structured outputs |
| Code assertion | A script checks conditions | Complex business logic |
import json
import re
def eval_exact_match(response, expected):
"""Check for an exact match."""
return response.strip().lower() == expected.strip().lower()
def eval_contains(response, keywords):
"""Check that the response contains the keywords."""
response_lower = response.lower()
return all(kw.lower() in response_lower for kw in keywords)
def eval_json_schema(response, required_fields):
"""Check that the response is valid JSON with required fields."""
try:
data = json.loads(response)
return all(field in data for field in required_fields)
except json.JSONDecodeError:
return False
def eval_regex(response, pattern):
"""Check that the response matches a regex pattern."""
return bool(re.search(pattern, response))
# Usage examples
assert eval_exact_match("Paris", "Paris")
assert eval_contains("Python is an interpreted language", ["python", "language"])
assert eval_json_schema('{"name": "Alice", "age": 30}', ["name", "age"])
assert eval_regex("The price is $29.99", r"\$\d+\.\d{2}")
2. Human Evaluations
Human annotators evaluate responses based on defined criteria. The gold standard in terms of quality.
| Criterion | Scale | Description |
|---|---|---|
| Accuracy | 1-5 | Are the facts correct? |
| Relevance | 1-5 | Is the response suited to the question? |
| Completeness | 1-5 | Are all aspects covered? |
| Clarity | 1-5 | Is the response well-structured and understandable? |
| Usefulness | 1-5 | Can the user act on this response? |
Recommended process:
- โCreate an annotation guide with examples for each score
- โTrain 2-3 annotators on the criteria
- โHave each response evaluated by at least 2 annotators
- โCalculate inter-annotator agreement (Cohen's Kappa)
- โResolve disagreements through discussion
3. Model-Graded Evaluations
An LLM evaluates the responses of another LLM. Scalable and fast, with a good proxy for human evaluation.
import anthropic
client = anthropic.Anthropic()
def model_graded_eval(question, response, criteria):
"""Use Claude to evaluate a response."""
eval_prompt = f"""Evaluate this response based on the given criteria.
Question asked: {question}
Response to evaluate: {response}
Evaluation criteria:
{criteria}
Return a JSON with:
- "score": rating from 1 to 5
- "reasoning": justification in 2-3 sentences
- "issues": list of identified problems (or empty list)
"""
eval_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": eval_prompt}]
)
return json.loads(eval_response.content[0].text)
# Usage
result = model_graded_eval(
question="What are the advantages of TypeScript over JavaScript?",
response="TypeScript adds static typing, which reduces bugs...",
criteria="""
- Technical accuracy (are the claims correct?)
- Completeness (are the main advantages covered?)
- Examples (are concrete examples provided?)
"""
)
print(f"Score: {result['score']}/5")
print(f"Reason: {result['reasoning']}")
Designing Test Cases
Test Case Structure
test_case = {
"id": "tc-001",
"category": "extraction",
"input": "Marie Dupont, 35 years old, developer at TechCorp in London.",
"expected_output": {
"name": "Marie Dupont",
"age": 35,
"job": "developer",
"company": "TechCorp",
"city": "London"
},
"eval_method": "json_schema",
"tags": ["extraction", "structured_output"]
}
Test Case Categories
| Category | Examples | Recommended coverage |
|---|---|---|
| Happy path | Standard cases, well-formed inputs | 40% of cases |
| Edge cases | Boundary inputs, unusual formats | 25% of cases |
| Adversarial | Misleading, contradictory inputs | 15% of cases |
| Multilingual | Inputs in multiple languages | 10% of cases |
| Empty/invalid | Empty, null, corrupted inputs | 10% of cases |
Complete Evaluation Script
import anthropic
import json
from datetime import datetime
client = anthropic.Anthropic()
def run_eval_suite(test_cases, model, system_prompt, eval_fn):
"""Run a complete evaluation suite."""
results = []
passed = 0
for tc in test_cases:
# Call Claude
response = client.messages.create(
model=model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": tc["input"]}]
)
output = response.content[0].text
# Evaluation
score = eval_fn(output, tc["expected_output"])
passed += 1 if score else 0
results.append({
"id": tc["id"],
"input": tc["input"],
"expected": tc["expected_output"],
"actual": output,
"passed": score,
"tokens": response.usage.input_tokens + response.usage.output_tokens
})
# Report
total = len(test_cases)
print(f"\n{'='*50}")
print(f"Results: {passed}/{total} ({passed/total*100:.1f}%)")
print(f"Model: {model}")
print(f"Date: {datetime.now().isoformat()}")
print(f"{'='*50}")
# Failed cases
failed = [r for r in results if not r["passed"]]
if failed:
print(f"\nโ {len(failed)} failed cases:")
for r in failed[:5]:
print(f" - {r['id']}: expected '{r['expected']}', got '{r['actual'][:100]}...'")
return results
# Execution
test_cases = [
{"id": "tc-001", "input": "What is the capital of France?", "expected_output": "Paris"},
{"id": "tc-002", "input": "What is the capital of Germany?", "expected_output": "Berlin"},
{"id": "tc-003", "input": "What is the capital of Japan?", "expected_output": "Tokyo"},
]
results = run_eval_suite(
test_cases=test_cases,
model="claude-sonnet-4-20250514",
system_prompt="Reply only with the city name, no sentence.",
eval_fn=lambda output, expected: expected.lower() in output.lower()
)
A/B Testing Prompts
Systematically compare two prompt versions to identify the best one.
def ab_test_prompts(test_cases, prompt_a, prompt_b, model, eval_fn):
"""Compare two prompts on the same test cases."""
results_a = run_eval_suite(test_cases, model, prompt_a, eval_fn)
results_b = run_eval_suite(test_cases, model, prompt_b, eval_fn)
score_a = sum(1 for r in results_a if r["passed"]) / len(results_a)
score_b = sum(1 for r in results_b if r["passed"]) / len(results_b)
tokens_a = sum(r["tokens"] for r in results_a)
tokens_b = sum(r["tokens"] for r in results_b)
print(f"\n๐ A/B Test Results")
print(f"{'Metric':<20} {'Prompt A':<15} {'Prompt B':<15} {'Winner':<10}")
print(f"{'โ'*60}")
print(f"{'Score':<20} {score_a:.1%}{'':>9} {score_b:.1%}{'':>9} {'A' if score_a > score_b else 'B'}")
print(f"{'Total tokens':<20} {tokens_a:<15} {tokens_b:<15} {'A' if tokens_a < tokens_b else 'B'}")
return {"prompt_a": score_a, "prompt_b": score_b}
# Example
ab_test_prompts(
test_cases=test_cases,
prompt_a="You are a geography assistant. Reply with the city name only.",
prompt_b="Reply in a single word: the name of the capital asked about.",
model="claude-sonnet-4-20250514",
eval_fn=lambda output, expected: expected.lower() in output.lower()
)
Anthropic Console Evaluation Tool
The Anthropic console offers a built-in evaluation tool with a graphical interface.
Features
| Feature | Description |
|---|---|
| Eval sets | Reusable collections of test cases |
| Automatic scoring | Exact match, contains, model-graded |
| Side-by-side comparison | Visualize results of 2 prompts |
| History | Track score evolution over time |
| Export | Download results as CSV |
Console Workflow
- โCreate an Eval Set: Add your test cases (input + expected output)
- โConfigure Scoring: Choose your evaluation method
- โRun the Eval: Select the model and prompt
- โAnalyze Results: View scores, identify failures
- โIterate: Modify the prompt and rerun
Benchmarking Between Models
Compare the performance of different Claude models on your specific use case.
models = [
"claude-haiku-3-5-20241022",
"claude-sonnet-4-20250514",
"claude-opus-4-20250918"
]
for model in models:
print(f"\n๐ Evaluating {model}")
run_eval_suite(test_cases, model, system_prompt, eval_fn)
| Selection criterion | Haiku | Sonnet | Opus |
|---|---|---|---|
| Speed-first (chatbot) | โ Best | ๐ Good | โ ๏ธ Slow |
| Quality-first (analysis) | โ ๏ธ Basic | ๐ Good | โ Best |
| Cost-first (volume) | โ Best | ๐ Good | โ ๏ธ Expensive |
| Reasoning (code, math) | โ ๏ธ Limited | ๐ Good | โ Best |
Common Eval Mistakes
| Mistake | Impact | Solution |
|---|---|---|
| Too few test cases | Non-significant results | Minimum 50-100 cases |
| No reproducibility | temperature=1 gives variable results | Set temperature=0 for evals |
| Eval on training set | Overestimation of performance | Use never-seen cases |
| Single criterion | Partial view of quality | Combine multiple metrics |
| No baseline | Impossible to measure progress | Always save current results |
Read Next
- โEvaluations with Promptfoo: Practical Guide โ Implement a complete evaluation framework with Promptfoo
- โAI Hallucination and Bias Detection โ Techniques for identifying and preventing model errors
Module 0 โ Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.
Weekly AI Insights
Tools, techniques & news โ curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
โRelated Articles
FAQ
Why evaluate Claude's performance?+
Evaluations allow you to objectively measure the quality of Claude's responses for your use case. They are essential for comparing prompts, validating model changes, and ensuring reliability in production.
What types of evaluations exist?+
Three main types: automated evals (string matching, regex, code-based), human evals (annotators who rate responses), and model-graded evals (an LLM evaluating another LLM's responses).
How do I use the Anthropic console evaluation tool?+
Access the Anthropic console, create an eval set with test cases, define scoring criteria, run the evaluation, and compare results between different prompts or models.
How many test cases are needed for a reliable evaluation?+
A minimum of 50-100 test cases is recommended for statistically significant results. For critical production cases, aim for 200+ test cases covering edge cases.
Can Claude be used to evaluate its own responses?+
Yes, that's the principle of model-graded eval. A Claude model evaluates the responses of another Claude call based on defined criteria. It's more scalable than human evals but potentially biased.