Claude Computer Use: Controlling a Computer with AI
By Dorian Laurenceau
๐ Last reviewed: April 24, 2026. Updated with April 2026 findings and community feedback.
๐ Pillar article: Claude API: Complete Guide
What is Computer Use?
Computer Use is a groundbreaking feature that allows Claude to interact directly with a computer via screenshots. Unlike classic Tool Use (which calls functions), Computer Use allows Claude to:
- โSee the computer screen via screenshots
- โUnderstand graphical interfaces (buttons, menus, forms)
- โAct on the computer (click, type, scroll)
- โVerify the result of its actions via new screenshots
The Interaction Loop
1. Your code โ Screenshot โ Claude
2. Claude analyzes the screen โ Decides on an action
3. Claude โ Action (click, type...) โ Your code
4. Your code executes the action โ New screenshot โ Back to step 1
This loop continues until the task is completed or Claude determines it cannot continue.
Computer Use is the feature that most clearly illustrates the gap between what models can do in a sandbox and what they should do in production. Anthropic's Computer Use announcement was honest about its own beta status, and practitioners on r/ClaudeAI and r/LocalLLaMA have been even more honest: the demo of Claude booking a hotel works; the production workflow of Claude navigating your actual SaaS stack does not, and the reasons are not purely technical.
What the community correctly pushes back on: the marketing framing of "digital worker" implies reliability that the underlying loop does not yet provide. Screen-based agents are constrained by brittle selectors (a CSS class changes, the agent clicks the wrong button), by long latencies (each screenshot + decision cycle takes seconds), and by the absence of structured output (the agent has to re-parse a page from pixels on every turn). Research like WebArena quantifies this โ state-of-the-art agents solve only a fraction of real web tasks reliably, and that number has been moving slowly.
The pragmatic framing: Computer Use shines for exploratory or one-off automations where brittleness is acceptable (QA walkthroughs, UI regression screenshots, scraping a site that refuses to expose an API). It fails for critical paths that already have an API, an MCP server, or a scripting surface โ and using Computer Use there is a form of technical cosplay. Start with the API, fall back to the browser, and treat "agent sees screen" as the last resort, not the first.
Quickstart: First Example
import anthropic
client = anthropic.Anthropic()
# Computer Use tool configuration
tools = [{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 0 # Main screen
}]
# First request with a screenshot
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": take_screenshot() # Your screenshot function
}
},
{
"type": "text",
"text": "Open Chrome browser and go to google.com"
}
]
}]
)
# Process the actions returned by Claude
for block in response.content:
if block.type == "tool_use" and block.name == "computer":
action = block.input
execute_action(action) # Execute the action on the computer
Available Actions
Claude can perform the following actions on the computer:
| Action | Description | Parameters |
|---|---|---|
click | Mouse click | x, y, button (left/right/middle) |
double_click | Double click | x, y |
type | Type text | text |
key | Press a key | key (e.g., "Enter", "Tab", "ctrl+c") |
scroll | Scroll | x, y, direction (up/down), amount |
move | Move mouse | x, y |
screenshot | Request a new screenshot | , |
drag | Drag and drop | start_x, start_y, end_x, end_y |
Action Examples
# Click on a button
{"action": "click", "coordinate": [960, 540]}
# Double click to open a file
{"action": "double_click", "coordinate": [200, 300]}
# Type text
{"action": "type", "text": "Hello World"}
# Keyboard shortcut
{"action": "key", "key": "ctrl+s"}
# Scroll down
{"action": "scroll", "coordinate": [960, 540], "direction": "down", "amount": 3}
# Request a new screenshot
{"action": "screenshot"}
Complete Implementation
Computer Use Agent Loop
import anthropic
import base64
import subprocess
client = anthropic.Anthropic()
def take_screenshot():
"""Capture the screen and return base64."""
# Linux with scrot
subprocess.run(["scrot", "/tmp/screenshot.png"])
with open("/tmp/screenshot.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def execute_action(action):
"""Execute a Computer Use action on the computer."""
action_type = action.get("action")
if action_type == "click":
x, y = action["coordinate"]
subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "1"])
elif action_type == "type":
subprocess.run(["xdotool", "type", "--clearmodifiers", action["text"]])
elif action_type == "key":
subprocess.run(["xdotool", "key", action["key"]])
elif action_type == "scroll":
x, y = action["coordinate"]
direction = action["direction"]
button = "4" if direction == "up" else "5"
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
for _ in range(action.get("amount", 3)):
subprocess.run(["xdotool", "click", button])
elif action_type == "screenshot":
pass # Will be captured on the next loop iteration
def computer_use_loop(task, max_steps=20):
"""Main Computer Use loop."""
messages = [{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/png",
"data": take_screenshot()
}},
{"type": "text", "text": task}
]
}]
tools = [{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 0
}]
for step in range(max_steps):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
# Task completed
for block in response.content:
if block.type == "text":
print(f"โ
Done: {block.text}")
return
# Execute actions
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
execute_action(block.input)
# New screenshot after the action
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": take_screenshot()
}
}]
})
messages.append({"role": "user", "content": tool_results})
print("โ ๏ธ Maximum number of steps reached")
# Usage
computer_use_loop("Open Firefox, go to wikipedia.org and search for 'artificial intelligence'")
Use Cases
1. UI Testing
Automate UI tests without writing fragile selectors.
computer_use_loop("""
Test the registration form:
1. Fill the "Name" field with "John Smith"
2. Fill the "Email" field with "john@test.com"
3. Fill the "Password" field with "SecurePass123!"
4. Check the "I accept the terms" checkbox
5. Click "Sign Up"
6. Verify that a success message appears
""")
2. Data Entry in Legacy Systems
| Scenario | Traditional approach | With Computer Use |
|---|---|---|
| ERP without API | Custom connector development (weeks) | Script in a few hours |
| Legacy Windows app | RPA with fragile rules | Adaptive vision |
| Complex web forms | Selenium with selectors that break | Claude adapts to UI changes |
3. Workflow Automation
computer_use_loop("""
Monthly reporting workflow:
1. Open the accounting application
2. Export last month's sales report as CSV
3. Open Excel and import the CSV
4. Create a pivot table by region
5. Save the file to the Desktop
""")
Security: Critical Points
Identified Risks
| Risk | Level | Mitigation |
|---|---|---|
| Unintended actions | โ ๏ธ High | Sandbox environment, supervision |
| Injection via screen content | โ ๏ธ Medium | Don't navigate untrusted websites |
| Access to sensitive data | โ ๏ธ High | Limit the user account's permissions |
| Infinite action loop | โ ๏ธ Medium | Limit max_steps, global timeout |
Security Checklist
- โIsolated environment, Dedicated VM or Docker container
- โNon-privileged account, The user must not be admin
- โLimited network, Restrict network access to the strictly necessary
- โHuman supervision, Always have a way to interrupt the session
- โLimited duration, Set a maximum timeout for each task
- โComplete logs, Record all screenshots and actions
- โNo credentials, Never ask Claude to enter real passwords
Docker Quickstart (Recommended)
FROM ubuntu:22.04
# Minimal graphical environment
RUN apt-get update && apt-get install -y \
xvfb x11vnc fluxbox \
firefox scrot xdotool \
python3 python3-pip
# Install the Anthropic SDK
RUN pip3 install anthropic
# Startup script
COPY start.sh /start.sh
RUN chmod +x /start.sh
EXPOSE 5900
CMD ["/start.sh"]
# start.sh
#!/bin/bash
Xvfb :99 -screen 0 1920x1080x24 &
export DISPLAY=:99
fluxbox &
x11vnc -display :99 -forever -nopw &
python3 /app/computer_use_agent.py
Current Limitations
| Limitation | Impact | Workaround |
|---|---|---|
| Latency (~2-5s per action) | Long workflows are slow | Group instructions |
| Fixed resolution | Must match actual screen | Configure display_width_px correctly |
| No complex drag-and-drop | Some interactions impossible | Combine with keyboard shortcuts |
| Coordinate errors | Clicks sometimes imprecise | Request a verification screenshot |
| High cost | Each screenshot = ~2500 tokens | Limit the number of steps |
Comparison with Alternatives
| Aspect | Computer Use (Claude) | Selenium/Playwright | RPA (UiPath, etc.) |
|---|---|---|---|
| Setup | Minimal (just the API) | Drivers, selectors | Heavy installation |
| Fragility | Low (adaptive vision) | High (selectors break) | Medium |
| Speed | Slow (~2-5s/action) | Fast | Fast |
| Flexibility | Very high | Medium (web only) | High |
| Cost per execution | High (API tokens) | Near zero | Software license |
| Supported apps | All (via screen) | Web only | Desktop + Web |
Module 0 โ Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.
Dorian Laurenceau
Full-Stack Developer & Learning DesignerFull-stack web developer and learning designer. I spent 4 years as a freelance full-stack developer and 4 years teaching React, JavaScript, HTML/CSS and WordPress to adult learners. Today I design learning paths in web development and AI, grounded in learning science. I founded learn-prompting.fr to make AI practical and accessible, and built the Bluff app to gamify political transparency.
Weekly AI Insights
Tools, techniques & news โ curated for AI practitioners. Free, no spam.
Free, no spam. Unsubscribe anytime.
โRelated Articles
FAQ
What is Claude Computer Use?+
Computer Use is a feature that allows Claude to interact with a computer via screenshots. Claude sees the screen, decides what actions to perform (click, type, scroll), and receives new screenshots to continue.
How does Computer Use work technically?+
It's a loop: your code takes a screenshot, sends it to Claude, Claude returns an action (click at x,y or type text), your code executes the action, takes a new screenshot, and the cycle repeats.
Is Computer Use safe to use?+
Computer Use should be used with caution. Anthropic recommends running it in an isolated environment (VM, container), never using it with privileged admin accounts, and supervising sessions.
What are the main use cases for Computer Use?+
Automated UI testing, data entry in legacy systems, product demos, workflow automation on applications without APIs, and complex data scraping.
How do I get Claude to control my PC?+
Claude Computer Use works via the API. You need to set up a Docker container or VM with Anthropic's tools installed. Claude then takes screenshots, analyzes them, and sends commands (click, keyboard, scroll). A quickstart guide is available in Anthropic's official documentation.
Is Claude easy to use for controlling a computer?+
Using Computer Use requires technical knowledge (API, Docker, Python). It's not a simple click in the interface, you need to configure a dedicated environment. However, with Anthropic's Python SDK, the basic code fits in 20 lines and many tutorials are available.