ICRA 2025
ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation
1[Amazon Fulfillment Technologies & Robotics]
2[Brown University]
3[Northeastern University]
Abstract
Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring–mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
Methodology
![[Method Figure]](/assets/images/projects/vita-zero/method.png)
Our framework consists of three modules: visual estimation, feasibility checking, and tactile refinement.
- Initially, a visual model estimates the pose.
- Then, we assess the feasibility of the pose using constraints derived from the tactile signals and proprioception.
- If the pose does not meet these constraints, we refine it through our test-time optimization algorithm using tactile and proprioceptive observations, yielding the final pose estimate.
Supplementary Video
Citations
@inproceedings{li2025vitazero,
title={ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation},
author={Li, Hongyu and Akl, James and Sridhar, Srinath and Brady, Tye and Padir, Taskin},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2025},
}
Acknowledgements
The work was done when Hongyu Li was an intern at Amazon. Taskin Padir holds concurrent appointments as a Professor of Electrical and Computer Engineering at Northeastern University and as an Amazon Scholar. This paper describes work performed at Amazon and is not associated with Northeastern University.
Contact
Hongyu Li (contact email)