Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.
To comprehensively assess multi-step spatial reasoning in MLLMs, we design LEGO-Puzzles, a new benchmark built upon LEGO assembly tasks. Inspired by how humans develop spatial skills through construction processes, we categorize our tasks into three levels: spatial understanding, single-step sequential reasoning, and multi-step sequential reasoning.
Task examples of LEGO-Puzzles. From left to right, the columns represent tasks in Spatial Understanding, Single-Step Sequential Reasoning, and Multi-Step Sequential Reasoning. Note: The questions above are slightly simplified for clarity and brevity.
LEGO-Puzzles contains over 1,100 visual question-answering samples across 11 tasks. These tasks are evenly distributed across spatial understanding (36.4%), single-step sequential reasoning (36.4%), and multi-step sequential reasoning (27.3%).
We evaluate 20 state-of-the-art MLLMs, including both proprietary and open-source models, across all LEGO-Puzzles tasks. The table below summarizes the performance of these models on spatial understanding, single-step reasoning, and multi-step reasoning tasks. Despite recent advances, most models exhibit significant limitations in reasoning performance, especially when handling spatial rotations, 3D adjacency, or multi-step assembly sequences.
Table 1. Full Evaluation Results of 18 MLLMs on LEGO-Puzzles. Dark Gray indicates the best performance for each task among all models and Light Gray indicates the best result among open-source model. We also highlight the top three models based on their overall performance, using Dark Green, Medium Green, and Light Green, respectively.
To further explore the performance gap between humans and MLLMs, we introduce LEGO-Puzzles-Lite, a compact subset of the full benchmark with 220 carefully selected samples (20 per task). The following table compares the performance of top-performing MLLMs against human annotators. While humans achieve near-perfect accuracy, even the best models lag significantly, highlighting the challenges MLLMs face in visual and spatial reasoning.
Table 2. Comparing Top-Performing MLLMs with Human Proficiency on LEGO-Puzzles-Lite. The best results are marked in bold. The top three overall performances are highlighted in Dark Green, Medium Green, and Light Green, respectively.
In addition to question-answering, LEGO-Puzzles also incorporates image generation tasks to evaluate whether MLLMs can visually interpret and simulate spatial transformations. We design 5 generation tasks across two main categories: spatial understanding (Rotation*, Multiview*) and single-step sequential reasoning (Position*, Dependency*, Next-Step*). Each task consists of 20 questions, resulting in a total of 100 generation questions. Models are expected to generate an image of the intermediate LEGO configuration that reflects the given instruction. Because standard automated metrics fail to assess spatial fidelity and reasoning consistency, we rely on human evaluation to rate two aspects: appearance similarity and instruction following.
Only Gemini-2.0-Flash and GPT-4o show limited success, with most open-source models failing entirely to follow instructions or preserve structural identity. This reveals critical limitations in spatially grounded image synthesis within current MLLMs.
Table 3. Evaluation on Generation. We conduct human-based evaluation to assess the “Appearance” (App) and “Instruction Following” (IF) scores of Gemini-2.0-Flash, GPT-4o, Emu2, GILL, and Anole, using a scoring scale from 0 to 3 for both dimensions.
To analyze model behavior under increased reasoning depth, we introduce Next-k-Step, a fine-grained extension of our sequential reasoning tasks. It requires predicting the correct assembly result after k consecutive steps. We evaluate performance as k increases (1–5), with and without Chain-of-Thought (CoT) prompting. The results below reveal that most models suffer from performance degradation with more steps, and CoT prompting does not consistently help, especially in long-range spatial reasoning.
Table 4. Evaluation on Next-k-Step. k represents the number of steps, and CoT refers to adding a “Think step by step before answering” instruction in QA pairs, similar to those in LLMs.
@article{tang2025lego,
title={LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?},
author={Tang, Kexian and Gao, Junyao and Zeng, Yanhong and Duan, Haodong and Sun, Yanan and Xing, Zhening and Liu,
Wenran and Lyu, Kaifeng and Chen, Kai},
journal={arXiv preprint arXiv:2503.19990},
year={2025}
}