ESI-Bench:

Towards Embodied Spatial Intelligence that Closes the Perception–Action Loop

Yining Hong¹, Jiageng Liu², Han Yin¹, Manling Li³, Leonidas Guibas¹, Fei-Fei Li¹, Jiajun Wu¹, Yejin Choi¹

¹Stanford University ²UCLA ³Northwestern University

ESI-Bench teaser: 10 task categories and 29 subcategories

Figure 1. ESI-Bench is a comprehensive benchmark for embodied spatial intelligence, spanning 10 task categories and 29 subcategories organized around Spelke's four core knowledge systems [Spelke & Kinzler, 2007]: object persistence, layout and geometry, number representation, and agents and goal-directed actions. High-resolution figure: esi_teaser_v11.pdf.

Abstract

Spatial intelligence unfolds through a perception–action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen — occlusion, dynamics, containment, and functionality — beyond the reach of passive sensing.

We take a step beyond prior formulations of spatial intelligence, which often emphasize passive perception or assume access to oracle observations, by recasting the observer as an actor. We introduce ESI-Bench, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy — perception, locomotion, and manipulation — and how to act to answer questions that cannot be resolved from passive observation alone.

We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instruction, while passive multi-view adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness, and their coupling drives cascading failures where bad actions produce bad views which produce worse actions. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect reconstruction proves more harmful than 2D baselines by actively distorting spatial relations.

Human studies further reveal that, unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

10Task Categories

29Subcategories

51Interactive Scenes

1,829Object Categories

Three Departures from Prior Spatial Benchmarks

Contribution 01

From sensing to competence

Agents are evaluated not only on what they can perceive, but on whether they know how to act to perceive it — closing the loop between observation and action.

Contribution 02

Selective sensing

Agents must determine which observations are worth acquiring, prioritizing task-relevant information over redundant or uninformative inputs.

Contribution 03

Resolving perceptual mirages

Agents must reason through incomplete or misleading observations to infer hidden spatial structures and physical constraints beyond what is directly observed.

Key Findings

Finding 1

Action blindness dominates perceptual blindness — and their coupling drives failure cascades.

Without explicit instruction, active agents spontaneously discover emergent spatial strategies (e.g., move-behind, top-down repositioning, pick-up, pour-out) — driving large gains over passive baselines.
For most tasks, perception is not the bottleneck: with the right viewpoint, models succeed dramatically (e.g., Gemini 3.1 jumps from 14.6% → 95.1% on Partial Occlusion under oracle views).
Passive multi-view adds noise, not signal: GPT-5 even drops from 53.9% to 49.1% on Spatial Distance despite consuming far more images.
Suboptimal actions produce uninformative views, which trigger worse subsequent actions — a compounding chain unrecoverable within the step budget (active-to-oracle gap reaches 49.7% on Structural Enclosure).

Finding 2

3D helps when geometry is perfect — imperfect reconstruction actively misleads.

Ground-truth 3D + Gemini reaches 60.4% on Material Transparency vs. 44.0% for 2D Gemini — a +16.4 pt improvement on tasks where 2D projections fundamentally lose depth.
VGGT-reconstructed scene graphs degrade performance below 2D baselines: 9.9% vs. 27.5% on Geometric Configuration, as geometric artifacts distort fine-grained spatial relations.
Imperfect 3D grounding is not a neutral failure — it amplifies errors by feeding the reasoner a corrupted scene graph.

Finding 3

Models can see — but do not know when they have seen enough.

Humans seek viewpoints that falsify their hypothesis; models seek confirmation and tend to repeat motions in the same direction.
Models commit prematurely with uniformly high confidence, anchoring to first impressions and ignoring contradictory observations.
This is a metacognitive failure, not a perceptual one: neither better perception nor more embodied interaction alone closes the gap.

Paper Figures

Dataset Example & Action Space

A worked task instance ("put away the comic book, green onion, trash bag") alongside ESI-Bench's unified action vocabulary spanning locomotion, perception, and manipulation.

Average Exploration Steps

Average number of active steps to reach a correct answer for GPT-5 (solid) vs. Gemini 3.1 (outlined), grouped by task category.

Qualitative Results

Emergent strategies, cascading failures, hard perceptual ceilings, and premature commitment with spatial hallucination.

Task Distribution

Distribution of task instances across the 10 categories, organized around Spelke's four core knowledge systems.

Results — Active vs. Passive vs. Oracle

Subcategory (GPT-5)	Passive Single	Passive Multi	Active	GT Passive
Partial Occlusion	30.5	32.9	62.4	91.5
View Hallucination	11.7	20.2	60.1	87.8
Material Transparency	30.3	36.7	66.1	96.3
Rigid Containment	45.0	42.5	42.5	95.0
Stacking & Stability	34.8	37.1	62.9	86.5
Counting w/ Occlusion	3.3	3.3	13.3	56.7
Structural Enclosure	5.0	10.0	22.5	67.5
Physical Contact	40.0	41.7	64.2	90.0
Dimensional Size	42.5	44.9	67.7	80.3
Unobserved Change	40.5	41.2	51.4	77.0

Table 1. Accuracy (%) of GPT-5 across four paradigms on representative subcategories. Active exploration consistently outperforms passive multi-view; the large remaining gap to GT Passive isolates failures of action selection from failures of perception. Full results across all 29 subcategories and 4 paradigms (GPT-5, Gemini 3.1, VGGT+Gemini, GT 3D+Gemini, Human) appear in the paper.

Task Taxonomy — 10 Categories · 29 Subcategories

Each category targets a distinct spatial faculty structurally inaccessible to passive sensing. Across all categories, the correct answer emerges not from any single image but from the agent's capacity to act selectively and reason over the result.

Physical Capacity

Manipulation reveals containment capacity hidden from view.

Rigid Containment

Plan placement of multiple objects across multiple containers.

Liquid Volume

Compare liquid-holding capacity across containers.

Deformable Fitting

Decide whether a deformable container conforms to an object.

Physical Dynamics

Predict motion and stability under shape, mass, geometry.

Inclined Plane

Predict object motion and stability on slopes.

Stacking & Stability

Whether objects stack or balance given shape, mass, and geometry.

Specular Reflection

Active repositioning to disambiguate mirror vs. real-world content.

Reflection Authoring

Distinguish real objects from mirror reflections.

Spatial Relations

Infer relations across mirror and real-world views.

Correspondence

Identify which objects appear in the mirror given the real scene.

Perceptual Grounding

Repositioning to resolve viewpoint-dependent phenomena.

Partial Occlusion

Reason about objects hidden behind other scene elements.

View Hallucination

Detect objects whose visibility changes critically with viewing angle.

Material Transparency

Reason about objects seen through transparent surfaces.

Metric Comparison

Locomotion to overcome forced-perspective distortions.

Dimensional Size

Compare relative sizes of objects across vantage points.

Spatial Distance

Compare relative distances with respect to a reference object.

Enumerative Perception

Counting under occlusion, segmentation, and ambiguity.

Counting w/ Occlusion

Count objects partially obscured by other scene elements.

Spatial Segmentation

Count objects separated across distinct spatial regions.

Category Ambiguity

Count visually similar objects requiring fine-grained distinction.

Merged Observation

Count groups that appear visually merged from a single view.

Illumination Variability

Count objects under challenging or non-uniform lighting.

Structural Enclosure

Count objects hidden within enclosed or covered spaces.

Spatial Relations

Navigation to vantage points that break projective symmetry.

Linear Alignment

Whether objects are arranged along a common axis.

Geometric Configuration

Identify the shape formed by a set of objects (e.g., equilateral triangle).

Physical Contact

Detect whether two or more objects are in direct contact.

Cognitive Mapping

Multi-step locomotion to construct topological representations.

Topology & Connectivity

Whether two locations or regions are mutually reachable.

Traversable Passage

Identify navigable corridors or passageways between regions.

Regional Boundary

Identify and delineate distinct functional spatial regions.

Long-Term Navigation

Plan multi-step navigation toward a distant goal.

Temporal Scene Understanding

Manipulation and interaction to trigger or observe state changes.

Unobserved State Change

Infer scene changes that occurred during an unobserved interval.

Multi-Agent Interaction

Reason about scene dynamics induced by other agents.

Action Sequencing

Reasoning over ordered actions to determine causal dependencies.

Action Order Inference

Determine the correct procedural ordering of an action sequence.