ESI-Bench:

Towards Embodied Spatial Intelligence that Closes the Perception–Action Loop

Yining Hong1, Jiageng Liu2, Han Yin1, Manling Li3, Leonidas Guibas1, Fei-Fei Li1, Jiajun Wu1, Yejin Choi1
1Stanford University   2UCLA   3Northwestern University
ESI-Bench teaser: 10 task categories and 29 subcategories

Figure 1. ESI-Bench is a comprehensive benchmark for embodied spatial intelligence, spanning 10 task categories and 29 subcategories organized around Spelke's four core knowledge systems [Spelke & Kinzler, 2007]: object persistence, layout and geometry, number representation, and agents and goal-directed actions. High-resolution figure: esi_teaser_v11.pdf.

Abstract

Spatial intelligence unfolds through a perception–action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen — occlusion, dynamics, containment, and functionality — beyond the reach of passive sensing.

We take a step beyond prior formulations of spatial intelligence, which often emphasize passive perception or assume access to oracle observations, by recasting the observer as an actor. We introduce ESI-Bench, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy — perception, locomotion, and manipulation — and how to act to answer questions that cannot be resolved from passive observation alone.

We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instruction, while passive multi-view adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness, and their coupling drives cascading failures where bad actions produce bad views which produce worse actions. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect reconstruction proves more harmful than 2D baselines by actively distorting spatial relations.

Human studies further reveal that, unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

10Task Categories
29Subcategories
51Interactive Scenes
1,829Object Categories

Three Departures from Prior Spatial Benchmarks

Contribution 01

From sensing to competence

Agents are evaluated not only on what they can perceive, but on whether they know how to act to perceive it — closing the loop between observation and action.

Contribution 02

Selective sensing

Agents must determine which observations are worth acquiring, prioritizing task-relevant information over redundant or uninformative inputs.

Contribution 03

Resolving perceptual mirages

Agents must reason through incomplete or misleading observations to infer hidden spatial structures and physical constraints beyond what is directly observed.

Key Findings

Finding 1

Action blindness dominates perceptual blindness — and their coupling drives failure cascades.

  • Without explicit instruction, active agents spontaneously discover emergent spatial strategies (e.g., move-behind, top-down repositioning, pick-up, pour-out) — driving large gains over passive baselines.
  • For most tasks, perception is not the bottleneck: with the right viewpoint, models succeed dramatically (e.g., Gemini 3.1 jumps from 14.6% → 95.1% on Partial Occlusion under oracle views).
  • Passive multi-view adds noise, not signal: GPT-5 even drops from 53.9% to 49.1% on Spatial Distance despite consuming far more images.
  • Suboptimal actions produce uninformative views, which trigger worse subsequent actions — a compounding chain unrecoverable within the step budget (active-to-oracle gap reaches 49.7% on Structural Enclosure).
Finding 2

3D helps when geometry is perfect — imperfect reconstruction actively misleads.

  • Ground-truth 3D + Gemini reaches 60.4% on Material Transparency vs. 44.0% for 2D Gemini — a +16.4 pt improvement on tasks where 2D projections fundamentally lose depth.
  • VGGT-reconstructed scene graphs degrade performance below 2D baselines: 9.9% vs. 27.5% on Geometric Configuration, as geometric artifacts distort fine-grained spatial relations.
  • Imperfect 3D grounding is not a neutral failure — it amplifies errors by feeding the reasoner a corrupted scene graph.
Finding 3

Models can see — but do not know when they have seen enough.

  • Humans seek viewpoints that falsify their hypothesis; models seek confirmation and tend to repeat motions in the same direction.
  • Models commit prematurely with uniformly high confidence, anchoring to first impressions and ignoring contradictory observations.
  • This is a metacognitive failure, not a perceptual one: neither better perception nor more embodied interaction alone closes the gap.

Paper Figures

Dataset Examples

Dataset Example & Action Space

A worked task instance ("put away the comic book, green onion, trash bag") alongside ESI-Bench's unified action vocabulary spanning locomotion, perception, and manipulation.

Average Active Exploration Steps

Average Exploration Steps

Average number of active steps to reach a correct answer for GPT-5 (solid) vs. Gemini 3.1 (outlined), grouped by task category.

Qualitative Results

Qualitative Results

Emergent strategies, cascading failures, hard perceptual ceilings, and premature commitment with spatial hallucination.

Task Distribution

Task Distribution

Distribution of task instances across the 10 categories, organized around Spelke's four core knowledge systems.

Results — Active vs. Passive vs. Oracle

Subcategory (GPT-5) Passive Single Passive Multi Active GT Passive
Partial Occlusion30.532.962.491.5
View Hallucination11.720.260.187.8
Material Transparency30.336.766.196.3
Rigid Containment45.042.542.595.0
Stacking & Stability34.837.162.986.5
Counting w/ Occlusion3.33.313.356.7
Structural Enclosure5.010.022.567.5
Physical Contact40.041.764.290.0
Dimensional Size42.544.967.780.3
Unobserved Change40.541.251.477.0

Table 1. Accuracy (%) of GPT-5 across four paradigms on representative subcategories. Active exploration consistently outperforms passive multi-view; the large remaining gap to GT Passive isolates failures of action selection from failures of perception. Full results across all 29 subcategories and 4 paradigms (GPT-5, Gemini 3.1, VGGT+Gemini, GT 3D+Gemini, Human) appear in the paper.

Task Taxonomy — 10 Categories · 29 Subcategories

Each category targets a distinct spatial faculty structurally inaccessible to passive sensing. Across all categories, the correct answer emerges not from any single image but from the agent's capacity to act selectively and reason over the result.

01

Physical Capacity

Manipulation reveals containment capacity hidden from view.
Rigid Containment

Rigid Containment

Plan placement of multiple objects across multiple containers.

Liquid Volume

Liquid Volume

Compare liquid-holding capacity across containers.

Deformable Fitting

Decide whether a deformable container conforms to an object.

02

Physical Dynamics

Predict motion and stability under shape, mass, geometry.

Inclined Plane

Predict object motion and stability on slopes.

Stacking Stability

Stacking & Stability

Whether objects stack or balance given shape, mass, and geometry.

03

Specular Reflection

Active repositioning to disambiguate mirror vs. real-world content.
Reflection Authoring

Reflection Authoring

Distinguish real objects from mirror reflections.

Spatial Relations

Spatial Relations

Infer relations across mirror and real-world views.

Scene Correspondence

Correspondence

Identify which objects appear in the mirror given the real scene.

04

Perceptual Grounding

Repositioning to resolve viewpoint-dependent phenomena.

Partial Occlusion

Reason about objects hidden behind other scene elements.

View Hallucination

Detect objects whose visibility changes critically with viewing angle.

Material Transparency

Reason about objects seen through transparent surfaces.

05

Metric Comparison

Locomotion to overcome forced-perspective distortions.

Dimensional Size

Compare relative sizes of objects across vantage points.

Spatial Distance

Compare relative distances with respect to a reference object.

06

Enumerative Perception

Counting under occlusion, segmentation, and ambiguity.
Occluded Counting

Counting w/ Occlusion

Count objects partially obscured by other scene elements.

Spatial Segmentation

Spatial Segmentation

Count objects separated across distinct spatial regions.

Category Ambiguity

Category Ambiguity

Count visually similar objects requiring fine-grained distinction.

Merged Observation

Merged Observation

Count groups that appear visually merged from a single view.

Illumination Variability

Illumination Variability

Count objects under challenging or non-uniform lighting.

Structural Enclosure

Structural Enclosure

Count objects hidden within enclosed or covered spaces.

07

Spatial Relations

Navigation to vantage points that break projective symmetry.
Linear Alignment

Linear Alignment

Whether objects are arranged along a common axis.

Geometric Configuration

Geometric Configuration

Identify the shape formed by a set of objects (e.g., equilateral triangle).

Physical Contact

Physical Contact

Detect whether two or more objects are in direct contact.

08

Cognitive Mapping

Multi-step locomotion to construct topological representations.
Connectivity

Topology & Connectivity

Whether two locations or regions are mutually reachable.

Traversable Passage

Identify navigable corridors or passageways between regions.

Regional Boundary

Regional Boundary

Identify and delineate distinct functional spatial regions.

Long-Term Navigation

Plan multi-step navigation toward a distant goal.

09

Temporal Scene Understanding

Manipulation and interaction to trigger or observe state changes.
Unobserved Change

Unobserved State Change

Infer scene changes that occurred during an unobserved interval.

Multi-Agent Interaction

Multi-Agent Interaction

Reason about scene dynamics induced by other agents.

10

Action Sequencing

Reasoning over ordered actions to determine causal dependencies.
Action Order Inference

Action Order Inference

Determine the correct procedural ordering of an action sequence.