Spatio-Temporal LLM: Reasoning about Environments and Actions

Abstract

Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to answer prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this challenge, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work.

Dataset

The Reasoning about Environments and Actions (REA) dataset is designed to benchmark the spatio-temporal reasoning capabilities of multimodal large language models (MLLMs). It contains question-answer (QA) pairs organized into five distinct tasks, each targeting a different aspect of spatial and temporal understanding. Answering these questions requires models to reason about both the global 3D scene context (e.g., room layout, object positions) and localized temporal cues from egocentric video (e.g., human-object interactions, motion sequences). Together, these tasks offer a comprehensive challenge for developing MLLMs capable of real-world reasoning. The dataset includes 24,445 training samples and 1,757 validation samples, providing a solid foundation for both supervised learning and performance evaluation. Below, we showcase qualitative examples from each of the five REA tasks.

Relative Direction:
Q: Is the microwave to the left of the person when the person is performing 'put down knives' and also to the left of the person when performing 'pick up plate'?
A: The microwave remains to the left of the person during both 'put down knives' and 'pick up plate'.

Relative Distance:
Q: Does the person move closer to the hob between 'turn off rice cooker' and 'serve curry'?
A: The person remains at about the same distance from the hob when performing both 'turn off rice cooker' and 'serve curry'.

Find My Item:
Q: Where is the tarragon, and how can the person get to it?
A: The tarragon is located on the countertop to the right of the person. The person can reach it by turning to their right and walking towards the countertop.

Furniture Affordance:
Q: Which of the following objects does the person interact with next, given their previous actions and current motion? A. hob, B. oven, C. fridge
A: The person is most likely to interact with the hob next because they have just finished cleaning the sink and are now moving towards the hob, which suggests they might need to use it for cooking or cleaning purposes.

Action Planning:
Q: We are performing a cooking/assembly task with the following sequence of actions: put down tray, put down sponge, take glass, wash knife.. Based on the video, what should I do next, and how can I get to the place where the next step takes place?
A: You have already completed the actions of putting down the tray and the sponge. Now, you are about to take the glass and wash the knife. To do this, move right to the sink and start washing the knife.

Data Generation Pipeline

The REA dataset is constructed via a six-step pipeline that integrates egocentric videos with 3D scene understanding. For each query, we sample video clips from EPIC-KITCHENS and estimate the 3D positions of the person and objects using action annotations and segmentation masks. We then compute spatial relationships, refine navigation paths with a VideoLLM, and reconstruct a dense point cloud using hand-filtered frames. Finally, each video frame is registered to the point cloud to enable precise spatial grounding.

Model Architecture

The proposed Spatio-Temporal LLM (ST-LLM) integrates egocentric video, 3D point cloud, and text inputs for unified reasoning. It leverages a pretrained vision encoder and a point cloud encoder to extract features, which are aligned using a cross-modal Q-Former module with learnable queries. A 3D positional encoding is applied to both video and point cloud features to enhance spatial precision. The fused representation is then passed to an LLM decoder to generate task-specific answers.

BibTeX

@misc{zheng2025spatiotemporalllmreasoningenvironments, title={Spatio-Temporal LLM: Reasoning about Environments and Actions}, author={Haozhen Zheng and Beitong Tian and Mingyuan Wu and Zhenggang Tang and Klara Nahrstedt and Alex Schwing}, year={2025}, eprint={2507.05258}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.05258}, }

Spatio-Temporal LLM: Reasoning about Environments and Actions

Both spatial and temporal understanding is required to correctly answer questions in our "Reasoning about Environments and Actions" (REA) data.