Today's off-the-shelf LMs have incredible generalization, reasoning, and planning capabilities. Agentic harnesses in CaP-Agent0 unleashes their potential in the physical world.
Click on each task to see the agent in action.
CaP-Bench provides the first comprehensive benchmark for evaluating how well large language model agents can write code to control robots. Integrated with hundreds of manipulation tasks across multiple robot learning benchmarks (LIBERO-PRO, Robosuite, BEHAVIOR), CaP-Bench tests both LLM and VLM models on their ability to generate executable robot control policies from natural language instructions.
Click on each task to see the agent in action.
Large language and vision-language models can directly generate executable robot manipulation code with meaningful success rates across multiple tasks — without any task-specific training or fine-tuning. The best model achieves over 30% average success rate across all tasks.
Vision-language model agents, using only natural language prompts and visual feedback, can solve manipulation tasks that state-of-the-art vision-language-action models (e.g., Pi 0.5) fail on entirely. This highlights the surprising generality of code-generation approaches over end-to-end learned policies.
Models that undergo additional post-training (e.g., RLHF, instruction tuning) show consistent improvements in their ability to generate correct robot manipulation code. This suggests that general reasoning improvements directly transfer to embodied task performance.
While larger models dominate on complex manipulation tasks, the performance gap narrows significantly on tasks that primarily require abstract reasoning rather than low-level control. This indicates that model scale is most critical for fine-grained physical reasoning.