The flavor of the bittersweet lesson for computer vision

Vincent Sitzmann recently shared a blog post titled “The flavor of the bitter lesson for computer vision.” In it, he argues that computer vision as we currently know it may become obsolete, with its remaining elements being absorbed into end-to-end perception-action models. His post is timely, and reflects ongoing discussions in the community.

In this article, I will examine some of Vincent’s arguments, and provide an alternative perspective. Others have made overlapping points on this X post, which I highly recommend reading.

Is action the sole purpose of computer vision?

The post suggests that embodied intelligence, specifically action, is the sole purpose of computer vision. But in reality, computer vision has a wide range of applications.

Consider the following use cases, none of which require first-order action execution:

Metrology: Engineers use computer vision to measure the physical world with high precision.
Media & Entertainment: Creators leverage tools like face filters and generative video to make media content.
Virtual Reality: VR artists utilize vision to create 3D content with accurate parallax for stereo headsets.
Cognitive Science: Researchers apply vision to code and analyze animal behavior at scale.

More broadly, if computer vision is to remain a rigorous scientific discipline, we cannot consider action execution alone, or mandate action as a pre-training task for all problems. Computer vision must provide an action-agnostic understanding of the physical world. After all, the word science derives from the Latin scientia, meaning “knowledge, awareness, understanding.”

Is computer vision obsolete for embodied intelligence?

The post argues that computer vision, as we know it today, is going to go away. This may hold true for narrow, well-defined embodied tasks—such as navigation or basic pick-and-place—where percepts are fully observed and actions are straightforward.

However, complex long-horizon physical tasks—the hallmarks of biological spatial intelligence—remain firmly out of reach. Consider dexterous manipulation: even the most advanced manipulators today are as clumsy as a dog’s jaw or an infant monkey’s hand. They are energy-intensive, slow, struggle with coordination, and frequently require task-specific modifications. Note that clumsy hands can still be useful and create economic value, especially if they are cheap and scalable. But we must acknowledge they are not as capable and general purpose as human hands.

I don’t believe that the currently-dominant paradigm of training robots solely on monocular 2D videos (‘raw 2D data’) will help us achieve Artificial General Dexterity (AGD)—robots with Great Ape-level general-purpose hand dexterity. Even at scale, monocular 2D videos lack the necessary signal to capture the intricate contact dynamics and interactions required for fine-grained manipulation.

The path to AGD might actually rest on 3D data from a combination of real-world reconstructions, simulations, 4D world models, and even contact/touch sensing [1]. We see glimpses of this trend in recent work like ManipTrans, SPIDER or DexMachina, which leverage 3D demonstrations from multi-view datasets like GigaHands (created by my PhD student Rao Fu), or specialized hardware like the DexUMI gloves.

Artificial Super Dexterity (ASD)—robots with superhuman hand dexterity might be even further away than Artificial General Dexterity (AGD).

Is 3D a ‘Hand-Crafted’ Representation?

The post calls 3D a ‘hand-crafted’ representation that the Bitter Lesson warns against. However, unlike human-developed linguistic grammars, ‘3D’ is not a hand-crafted inductive bias. It is a measurable physical property of the universe: objects, people, and machines move and interact in 3D/4D.

We must distinguish between two related concepts: 3D as a physical quantity versus the digital representation of that quantity. Spatial properties—including geometry, reflectance, and material composition—are objective physical attributes. These can be measured and encoded through various representations, such as radiance fields, point clouds, meshes, or multi-view images.

This perspective reframes 3D reconstruction not as a ‘task’ to be solved, but as a way to ‘measure’ the world. In medical imaging, volumetric 3D data, created using invisible techniques (e.g., CT, MRI), is treated as ‘raw data’ [2], not a ‘hand-crafted’ representation. Similarly, 3D reconstruction in the visible spectrum can be viewed as a measurement process. Once ‘raw 3D data’ is measured, we can leverage computation to solve problems following the Bitter Lesson.

Could 3D be more sample efficient?

If we treat 3D as a measurable physical quantity, it could offer a significant advantage over 2D data: sample efficiency.

Empirically, large models trained on ‘raw 3D data’ seem to do well with much less data than their 2D counterparts. Video generation models require millions of hours of footage, robot world models like DreamDojo are trained on 44k hours of videos, and image generation models are trained on billions of 2D images. In contrast, 3D generation models like TRELLIS are trained on fewer than one million 3D assets. Feedforward 3D reconstruction methods like DUSt3R and VGGT are trained only on 8-20M image-point cloud pairs [3]. A CT foundation model can be trained with just 150k CT scans.

While the sample efficiency of 3D remains an empirical observation requiring further quantification, the implications could be significant. It could make ‘raw 3D data’ more important to the future of computer vision, especially when the marginal cost of acquiring 3D data converges with that of 2D [4].

Bitter and Sweet

The Bitter Lesson has been true in computer vision for some time now; even Sutton recognized it in his original essay. Many tasks that we considered fundamental in vision are turning out to be Girshick’s fake tasks.

Does this render computer vision as a discipline obsolete? I don’t believe so.

On the contrary, there is a flavor of sweetness to the lesson: an exciting opportunity to re-examine our community’s foundational assumptions about tasks, representations, data, and techniques. It is an opportunity to discover how the field can evolve in a world where leveraging computation is the most effective path. The insights born from this re-examination may be exactly what we need to solve problems like general-purpose dexterity, 4D spatial intelligence, and efficient continual learning.

At Brown IVL, we are really excited about what’s to come and curious to hear from others. We are co-organizing the 4D World Models workshop at CVPR which could offer a venue for further discussion and debate.

Thanks to Rahul Sajnani, Aashish Rai, Arthur Chen, Rao Fu, Sudarshan Harithas, and Gaurav Singh for their feedback on this post.

Notes

In our V-HOP work, Brown PhD student Hongyu Li showed that that touch information together with proprioception can effectively be used as a 3D constraint for 6DoF pose estimation.
Measurement could involve ‘reconstruction’. CT scans are reconstructed from X-ray projections. Images are reconstructed from photoreceptor responses and demosaicing of the Bayer pattern in color filter arrays.
Although DUSt3R and VGGT benefit from 2D pre-training that likely helps learn low-level features.
Methods like CAT3D, Amodal3R, and commercial demos like Marble or Echo can reconstruct 3D from sparse 2D images. We could think of these as methods for learning-based 3D measurement.

February 25, 2026