LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of 20 state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. Furthermore, based on LEGO-Puzzzles, we design a generation task to investigate whether MLLMs can transfer their spatial understanding and reasoning abilities to image generation. Our experiments show that only GPT-4o and Gemini-2.0-Flash exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

To comprehensively assess multi-step spatial reasoning in MLLMs, we design LEGO-Puzzles, a new benchmark built upon LEGO assembly tasks. Inspired by how humans develop spatial skills through construction processes, we categorize our tasks into three levels: spatial understanding, single-step sequential reasoning, and multi-step sequential reasoning.

LEGO-Puzzles contains over 1,100 visual question-answering samples across 11 tasks. These tasks are evenly distributed across spatial understanding (36.4%), single-step sequential reasoning (36.4%), and multi-step sequential reasoning (27.3%).

We evaluate 20 state-of-the-art MLLMs, including both proprietary and open-source models, across all LEGO-Puzzles tasks. The table below summarizes the performance of these models on spatial understanding, single-step reasoning, and multi-step reasoning tasks. Despite recent advances, most models exhibit significant limitations in reasoning performance, especially when handling spatial rotations, 3D adjacency, or multi-step assembly sequences.

To further explore the performance gap between humans and MLLMs, we introduce LEGO-Puzzles-Lite, a compact subset of the full benchmark with 220 carefully selected samples (20 per task). The following table compares the performance of top-performing MLLMs against human annotators. While humans achieve near-perfect accuracy, even the best models lag significantly, highlighting the challenges MLLMs face in visual and spatial reasoning.

Furthermore, based on LEGO-Puzzzles, we design mage generation tasks to investigate whether MLLMs can transfer their spatial understanding and reasoning abilities to image generation. We design 5 generation tasks across two main categories: spatial understanding (Rotation*, Multiview*) and single-step sequential reasoning (Position*, Dependency*, Next-Step*). Each task consists of 20 questions, resulting in a total of 100 generation questions. Models are expected to generate an image of the intermediate LEGO configuration that reflects the given instruction. Because standard automated metrics fail to assess spatial fidelity and reasoning consistency, we rely on human evaluation to rate two aspects: appearance similarity and instruction following.

Only GPT-4o and Gemini-2.0-Flash show limited success, with most open-source models failing entirely to follow instructions or preserve structural identity. This reveals critical limitations in spatially grounded image synthesis within current MLLMs.

To analyze model behavior under increased reasoning depth, we introduce Next-k-Step, a fine-grained extension of our sequential reasoning tasks. It requires predicting the correct assembly result after k consecutive steps. We evaluate performance as k increases (1–5), with and without Chain-of-Thought (CoT) prompting. The results below reveal that most models suffer from performance degradation with more steps, and CoT prompting does not consistently help, especially in long-range spatial reasoning.

BibTeX

@article{tang2025lego,
    title={LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?},
    author={Tang, Kexian and Gao, Junyao and Zeng, Yanhong and Duan, Haodong and Sun, Yanan and Xing, Zhening and Liu,
    Wenran and Lyu, Kaifeng and Chen, Kai},
    journal={arXiv preprint arXiv:2503.19990},
    year={2025}
    }

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Abstract

LEGO-Puzzles Dataset

Dataset & Benchmark

Statistics of LEGO-Puzzles.

Main Evaluation Results

Human vs Model Performance

Image Generation Results

Qualitative visual results for image generation tasks.
Note: The questions above are slightly simplified for clarity and brevity.

Multi-Step Reasoning via Next-k-Step

Samples

Samples of generation tasks in LEGO-Puzzles dataset.

BibTeX

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Abstract

LEGO-Puzzles Dataset

Dataset & Benchmark

Statistics of LEGO-Puzzles.

Main Evaluation Results

Human vs Model Performance

Image Generation Results

Qualitative visual results for image generation tasks. Note: The questions above are slightly simplified for clarity and brevity.

Multi-Step Reasoning via Next-k-Step

Samples

Samples of generation tasks in LEGO-Puzzles dataset.

BibTeX

Qualitative visual results for image generation tasks.
Note: The questions above are slightly simplified for clarity and brevity.