-
Environments should outlast the agents that run them. The systems an agent works on (a shell, a browser, a filesystem) have barely changed in a decade, and the tasks built on them are just as stable. Writing an environment is nothing new: you expose the system as it already is, through a capability like an
sshshell, and that same environment still runs in five years when the next real-time harness or model ships. Nothing to rebuild. - Tasks should be generative, not declarative. A task definition should span a space of challenges over a substrate, which is exactly the structure a synthetic pipeline needs to generate from. An entire benchmark like SWE-bench or Terminal-Bench can live as one generative task definition whose concrete tasks cover every instance, served from a single image. One environment holds any number of tasks; there’s no separate image per task.
-
HUD owns the environment and the reward, and nothing else. That minimalism is what lets everything around it vary. The same reward-from-rollout loop trains a coding, computer-use, browser, or robotics agent, so an environment exposes a bounded connection the agent drives directly:
sshinto a sandboxed workspace,cdpfor a browser,rfbfor a screen,robotfor a simulator or robot control loop, at action rates that discrete calls or MCP round-trips can’t carry. The environment ships as one standardized image that runs on any rollout infra like Daytona, Modal, or E2B, and a trainer needs only the rewards and a model API, so feeding rollouts into your own GRPO/PPO loop or a stack like Tinker, slime, or Fireworks takes no environment-side glue.
The protocol
HUD is protocol-first. An agent and an environment exchange just three things: a manifest (the environment’s capabilities and tasks),tasks.start that returns the prompt, and tasks.grade that returns the reward. In between, the agent just works, driving the capabilities itself. HUD owns only that thin envelope, so any model or harness plugs into any environment.
Because the protocol only exposes capabilities (never a fixed agent), an environment outlives any single harness: new harnesses and models keep running against the same environments, benchmarks, and tasks.
A complete environment
Here’s the whole loop in one file: an environment that gives the agent a shell and files, and a task that asks it to make a test suite pass and grades the result by running the tests.env.py
HUD_API_KEY is the only key you need:
--group 3 runs three rollouts so you can see the reward spread; each is traced on hud.ai.
Where to go next
Quickstart
From install to your first graded trace in a few minutes.
Environments & capabilities
Give the agent shell, browser, GUI, tools, or a robot to act on.
Tasks, tasksets & grading
Turn one task definition into a whole dataset.
Run on any model
Evaluate with Claude, OpenAI, Gemini, or your own endpoint.
Package & deploy
Build a portable image and run it anywhere.
Robots (beta)
Contract-driven control loops for simulators and VLA policies.
Migrate from v5
Convert scenarios + tools to tasks + capabilities.
Reading this as an AI agent? Install the HUD docs skill before you write any HUD code — it keeps you on the current v6 API (which moves fast) and catches reward-hackable, contaminated, or low-signal tasks before they ship. If you can run shell commands, run
npx skills add https://docs.hud.ai now; otherwise attach the live docs server with claude mcp add --transport http docs-hud https://docs.hud.ai/mcp. Then build against the skill’s guidance rather than from memory.