Closed Source
Open Source
Frontier AI
Human Baseline

LMs Can Zero-Shot on Robotics Tasks —
with CaP-Agent0

Today's off-the-shelf LMs have incredible generalization, reasoning, and planning capabilities. Agentic harnesses in CaP-Agent0 unleashes their potential in the physical world.

Click on each task to see the agent in action.

CaP-Bench: Evaluating LM Agents
on Embodied Intelligence

CaP-Bench provides the first comprehensive benchmark for evaluating how well large language model agents can write code to control robots. Integrated with hundreds of manipulation tasks across multiple robot learning benchmarks (LIBERO-PRO, Robosuite, BEHAVIOR), CaP-Bench tests both LLM and VLM models on their ability to generate executable robot control policies from natural language instructions.

100+ Manipulation Tasks 12+ Frontier Models Sim-to-Real Transfer Multi-Turn Evaluation Code Generation Open Source

Simulation Results

Click on each task to see the agent in action.

Key Findings

1

Frontier models achieve non-zero success on today's robotics benchmarks

Large language and vision-language models can directly generate executable robot manipulation code with meaningful success rates across multiple tasks — without any task-specific training or fine-tuning. The best model achieves over 30% average success rate across all tasks.

Loading chart...
2

VLM Agents zero-shot on manipulation tasks where specialized VLAs cannot

Vision-language model agents, using only natural language prompts and visual feedback, can solve manipulation tasks that state-of-the-art vision-language-action models (e.g., Pi 0.5) fail on entirely. This highlights the surprising generality of code-generation approaches over end-to-end learned policies.

Figure coming soon — VLM vs VLA comparison
3

Post-training makes VLM agents better at robotics tasks

Models that undergo additional post-training (e.g., RLHF, instruction tuning) show consistent improvements in their ability to generate correct robot manipulation code. This suggests that general reasoning improvements directly transfer to embodied task performance.

Figure coming soon — Post-training comparison
4

Smaller models catch up when only abstract reasoning is needed

While larger models dominate on complex manipulation tasks, the performance gap narrows significantly on tasks that primarily require abstract reasoning rather than low-level control. This indicates that model scale is most critical for fine-grained physical reasoning.

Loading chart...