ICON Continuous 3D Perception Model with Persistent State

Qianqian Wang1,2*     Yifei Zhang1*     Aleksander Holynski 1,2     Alexei A Efros1     Angjoo Kanazawa1    
1UC Berkeley      2Google DeepMind
(*: equal contribution)
TL;DR: An online 3D reasoning framework for various 3D tasks from only RGB inputs.

Abstract

We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying length of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each.


Results

Our unified framework tackles various 3D tasks. Select each tab below to explore the results for each task.

Dynamic Scene Reconstruction

Our method can reconstruct dynamic scenes in a feed-forward, online manner without per-video optimization.



Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause

See more dynamic reconstruction results on DAVIS dataset in our gallery.


3D Reconstruction (Video)

Our method can reconstruct the 3D scenes from videos in a feed-forward, online manner without global optimization.



Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause

See more reconstruction results in our gallery.

3D Reconstruction (Photo Collection)

Our method can reconstruct 3D scenes from sparse photo collections in an online manner, processing them as sequences.


Image 1 Image 2 Image 3 Image 4
Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause

Inferring Unseen Structure

In addition to reconstructing scenes from image observations, our method can infer unseen structures from virtual viewpoints within the reconstructed scene, predicting colored pointmaps at each virtual view.



State Analysis

Our model continuously updates its state representation as new data arrives. As more observations become available, the state should refine its understanding of the 3D world. In this section, we demonstrate this capability of our method.



Online vs. Revisiting

When running our method online, the state has access only to the context available up to the current time, without information from future observations. Can the model perform better with additional context? To explore this, we introduce a new experimental setup called revisiting: we first run the method through all images so the state can see the full context, then freeze this final state and reprocess the same set of images to generate predictions—this operation evaluates the 3D understanding captured by the final state. Below, we compare the reconstruction results of the online and revisiting setups, showing that revisiting produces improved reconstructions.


The images are provided to the model in this order. Because the first and second images do not overlap (a significant viewpoint change), the model initially produces a suboptimal prediction of the second image, causing the TV and coffee table to overlap (shown below on the left). However, when revisiting the scene with the final state (i.e., with the full context), the prediction for the TV, coffee table and sofa become more accurate (shown below on the right).



Online


Revisiting



(Results are downsampled for efficient online rendering and synced for ~100ms delay)
Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom


An illusion example

We demonstrate our results on an intriguing illusion video. When observing only the first frame, the scene appears to be a 3D chair due to our prior knowledge. However, with more observations, it becomes evident that it is actually a 2D plane. Our online reconstruction results are presented below. Our method captures rich prior knowledge (predicting the first frame as a 3D chair) but updates its understanding as more observations are processed, ultimately recognizing the scene as a plane. Notably, even when the final frame of the video closely resembles the initial one, the model maintains its belief that the scene is flat. This highlights the state's ability to effectively update based on additional observations.


Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause

Method Overview

Reconstruction

Given a sequence of images as input, our method performs online dense 3D reconstruction. At the core of our method is a state representation that keeps updating as each new observation comes in. Given the current image, we first use a vit encoder to encode it into token representations. The image tokens then interact with the state tokens through a joint process of state update and state readout. The final outputs from state readout are the pointmaps and camera parameters for the current observation. At the same time, the state update module incorporates the current observation to update the state. We repeat the process for every image, and their outputs accumulate into dense scene reconstruction over time.

Inference

Not only can our method reconstructs the scene given image observations, it can also infer new structures unseen by the input images. Given a query camera as shown as the blue camera, we use its raymap to query the current state to read out its corresponding point map. Adding this inferred point map to the existing reconstruction makes the reconstruction more complete.



Acknowledgements

We would like to thank Haiwen Feng, Chung Min Kim, Justin Kerr, Songwei Ge, Chenfeng Xu, Letian Fu, and Ren Wang for helpful discussions. And we thank Brent Yi for his support on the interactive visualization. We especially thank Noah Snavely for his guidance and support. This project is supported in part by DARPA No.~HR001123C0021, IARPA DOI/IBC No.~140D0423C0035, NSF:CNS-2235013, Bakar Fellows, ONR, MURI, TRI and BAIR sponsors. The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements of these institutions.

BibTeX


    @misc{cut3r,
        Author = {Qianqian Wang and Yifei Zhang and Aleksander Holynski and Alexei A. Efros and Angjoo Kanazawa},
        Title = {Continuous 3D Perception Model with Persistent State},
        Year = {2025},
        Eprint = {arXiv:2501.12387},
        }