We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying length of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each.
Our unified framework tackles various 3D tasks. Select each tab below to explore the results for each task.
Our method can reconstruct dynamic scenes in a feed-forward, online manner without per-video optimization.
See more dynamic reconstruction results on DAVIS dataset in our gallery.
Our method can reconstruct the 3D scenes from videos in a feed-forward, online manner without global optimization.
See more reconstruction results in our gallery.
Our method can reconstruct 3D scenes from sparse photo collections in an online manner, processing them as sequences.
In addition to reconstructing scenes from image observations, our method can infer unseen structures from virtual viewpoints within the reconstructed scene, predicting colored pointmaps at each virtual view.
Our model continuously updates its state representation as new data arrives. As more observations become available, the state should refine its understanding of the 3D world. In this section, we demonstrate this capability of our method.
When running our method online, the state has access only to the context available up to the current time, without information from future observations. Can the model perform better with additional context? To explore this, we introduce a new experimental setup called revisiting: we first run the method through all images so the state can see the full context, then freeze this final state and reprocess the same set of images to generate predictions—this operation evaluates the 3D understanding captured by the final state. Below, we compare the reconstruction results of the online and revisiting setups, showing that revisiting produces improved reconstructions.
The images are provided to the model in this order. Because the first and second images do not overlap (a significant viewpoint change), the model initially produces a suboptimal prediction of the second image, causing the TV and coffee table to overlap (shown below on the left). However, when revisiting the scene with the final state (i.e., with the full context), the prediction for the TV, coffee table and sofa become more accurate (shown below on the right).
We demonstrate our results on an intriguing illusion video. When observing only the first frame, the scene appears to be a 3D chair due to our prior knowledge. However, with more observations, it becomes evident that it is actually a 2D plane. Our online reconstruction results are presented below. Our method captures rich prior knowledge (predicting the first frame as a 3D chair) but updates its understanding as more observations are processed, ultimately recognizing the scene as a plane. Notably, even when the final frame of the video closely resembles the initial one, the model maintains its belief that the scene is flat. This highlights the state's ability to effectively update based on additional observations.
Given a sequence of images as input, our method performs online dense 3D reconstruction. At the core of our method is a state representation that keeps updating as each new observation comes in. Given the current image, we first use a vit encoder to encode it into token representations. The image tokens then interact with the state tokens through a joint process of state update and state readout. The final outputs from state readout are the pointmaps and camera parameters for the current observation. At the same time, the state update module incorporates the current observation to update the state. We repeat the process for every image, and their outputs accumulate into dense scene reconstruction over time.
Not only can our method reconstructs the scene given image observations, it can also infer new structures unseen by the input images. Given a query camera as shown as the blue camera, we use its raymap to query the current state to read out its corresponding point map. Adding this inferred point map to the existing reconstruction makes the reconstruction more complete.
@misc{cut3r,
Author = {Qianqian Wang and Yifei Zhang and Aleksander Holynski and Alexei A. Efros and Angjoo Kanazawa},
Title = {Continuous 3D Perception Model with Persistent State},
Year = {2025},
Eprint = {arXiv:2501.12387},
}