How to Automate GUIs with Computer Use#
Use gptme to control desktop applications, automate web forms, and capture screenshots with the computer and browser tools.
Prerequisites#
Install the required system tools before using computer control:
# Check what's missing and see fix suggestions
gptme-doctor
# Linux/X11
sudo apt install xdotool scrot
# or: sudo pacman -S xdotool scrot
# macOS
brew install cliclick
# Then grant your terminal Screen Recording + Accessibility permissions in System Settings (macOS Ventura+) or System Preferences (older macOS)
For web automation (structured ARIA snapshots), install Playwright:
# Install gptme with browser support (provides the 'playwright' CLI)
pip install "gptme[browser]"
# Install Playwright system browsers (required for snapshot_url, open_page, etc.)
playwright install chromium
For headless Linux environments, start an Xvfb display first:
Xvfb :1 -screen 0 1024x768x24 &
export DISPLAY=:1
fluxbox & # or any window manager
Start a computer-use session#
The computer-use profile sets the right tool access and backend selection policy:
gptme --agent-profile computer-use 'take a screenshot and describe what you see'
Or enable the computer tool for a one-off task:
gptme --tools +computer 'open Firefox, navigate to github.com, and take a screenshot'
Web automation (structured-first)#
For web targets, prefer structured ARIA snapshots over raw screenshots — they’re faster, cheaper, and more reliable when the page has a DOM you can address:
gptme --agent-profile computer-use 'go to https://news.ycombinator.com, find the top story, and summarize it'
gptme will automatically:
Use
snapshot_url()to read the page’s ARIA/accessibility treeUse
open_page()+click_element()/fill_element()when it needs to interactFall back to screenshots only for canvas, layout verification, or image-heavy content
Fill a form without screenshots:
gptme --agent-profile computer-use \
'go to the login form at http://localhost:3000/login, fill username "alice" and password "hunter2", click submit'
Desktop / native app control#
For native apps or anything not reachable via a URL, the computer tool takes over:
gptme --tools +computer 'open the calculator app, compute 137 * 42, and tell me the result'
The observe-act-verify loop:
gptme --tools +computer \
'take a screenshot to see the current state' \
- 'click the "New File" button at the top-left' \
- 'type "hello world" then press Control+S' \
- 'take a screenshot and confirm the file was saved'
Screenshot and visual verification#
Take a screenshot and analyse what’s on screen:
gptme --tools +computer,+vision 'screenshot the screen and describe any UI errors you see'
Verify a web page renders correctly:
gptme --agent-profile computer-use \
'open http://localhost:5173, take a screenshot, and check that the login button is visible'
Efficient UI loops#
Use wait_for_change after triggering actions so the agent waits for the UI to settle
instead of polling with repeated screenshots:
gptme --tools +computer \
'click the submit button, then wait for the screen to change and describe the result'
Use window_focus when opening new windows so input goes to the right app:
gptme --tools +computer \
'open a new terminal window, wait for it to appear, then run "ls -la"'
Run inside Docker (isolated headless desktop)#
For a fully isolated environment with VNC access:
make build-docker-computer # build once
make run-docker-computer # start container (noVNC on :6080, gptme server on :8080)
Then connect a browser to http://localhost:6080 to watch the agent work.
Backend selection cheat sheet#
Situation |
Tool to use |
|---|---|
Read a web page |
|
Fill a form or click a link |
|
Visual layout check / canvas |
|
Wait for UI to settle |
|
Click a native app |
|
Type text in native app |
|
Focus a window by name |
|
Scroll in native UI |
|
Tips#
Use the
computer-useprofile: it sets the backend selection policy so the agent picks the right tool automatically without extra prompting.Prefer
snapshot_urlfor web: structured ARIA trees are faster and use no vision tokens.Combine with
--non-interactive: add-nfor scripted or CI use where you don’t want prompts (but ensure the task is well-scoped first).Describe visual outcomes: “confirm the dialog closed” works better than “click OK and move on”.