Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to answer prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work.
The Reasoning about Environments and Actions (REA) dataset is designed to benchmark the spatio-temporal reasoning capabilities of multimodal large language models (MLLMs). It contains question-answer (QA) pairs organized into five distinct tasks, each targeting a different aspect of spatial and temporal understanding. Answering these questions requires models to reason about both the global 3D scene context (e.g., room layout, object positions) and localized temporal cues from egocentric video (e.g., human-object interactions, motion sequences). Together, these tasks offer a comprehensive challenge for developing MLLMs capable of real-world reasoning. The dataset includes 24,445 training samples and 1,757 validation samples, providing a solid foundation for both supervised learning and performance evaluation. Below, we showcase qualitative examples from each of the five REA tasks.
The REA dataset is constructed via a six-step pipeline that integrates egocentric videos with 3D scene understanding. For each query, we sample video clips from EPIC-KITCHENS and estimate the 3D positions of the person and objects using action annotations and segmentation masks. We then compute spatial relationships, refine navigation paths with a VideoLLM, and reconstruct a dense point cloud using hand-filtered frames. Finally, each video frame is registered to the point cloud to enable precise spatial grounding.
The proposed Spatio-Temporal LLM (ST-LLM) integrates egocentric video, 3D point cloud, and text inputs for unified reasoning. It leverages a pretrained vision encoder and a point cloud encoder to extract features, which are aligned using a cross-modal Q-Former module with learnable queries. A 3D positional encoding is applied to both video and point cloud features to enhance spatial precision. The fused representation is then passed to an LLM decoder to generate task-specific answers.
@misc{zheng2025spatiotemporalllmreasoningenvironments,
title={Spatio-Temporal LLM: Reasoning about Environments and Actions},
author={Haozhen Zheng and Beitong Tian and Mingyuan Wu and Zhenggang Tang and Klara Nahrstedt and Alex Schwing},
year={2025},
eprint={2507.05258},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.05258},
}