CVPR 2023
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language
1 Autodesk AI Research
2 Brown University
3 Columbia University
Overview
![CLIP-Sculptor teaser](/assets/images/projects/clip-sculptor/teaser_new.png)
Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP’s image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines.
Method Overview
The CLIP-Sculptor architecture during training (top) and inference (bottom). CLIP-Sculptor is trained in three stages. In Stage 1, we train two separate VQ-VAE models for low and high resolution voxel grids. In Stage 2 we train a coarse transformer conditioned on a CLIP embedding to generate low resolution VQ-VAE latent grids. In Stage 3, we train a fine transformer to perform super resolutionon these latent grids. During inference, a text prompt is passed through the CLIP text encoder and used to condition the coarse transformer to generate a coarse latent grid. This coarse grid is then used to condition the fine transformer to generate a fine latent grid. Finally, this fine latent grid is then passed through the Training Stage 1 high resolution VQ-VAE decoder to generate the output shape.”.
![CLIP-Sculptor Method Diagram](/assets/images/projects/clip-sculptor/method_diagram.jpg)
Results on ShapeNetCore (13 Categories)
Multiple generated 3D shapes by CLIP-Sculptor with different text input. The text inputs are (sub-)category names of ShapeNet13, and phases with semantic attributes.
"an airplane"
![Airplane](/assets/images/projects/clip-sculptor/combined-gifs/a airplane_combined.gif)
"a ak-47"
![ak-47](/assets/images/projects/clip-sculptor/combined-gifs/a ak-47_combined.gif)
"a delta wing"
![delta-wing](/assets/images/projects/clip-sculptor/combined-gifs/a delta wing_combined.gif)
"an jet"
![Jet](/assets/images/projects/clip-sculptor/combined-gifs/a jet_combined.gif)
"a machine gun"
![machine gun](/assets/images/projects/clip-sculptor/combined-gifs/a machine gun_combined.gif)
"a office chair"
![office chair](/assets/images/projects/clip-sculptor/combined-gifs/a office chair_combined.gif)
"a round shaped lamp"
![round shaped lamp](/assets/images/projects/clip-sculptor/combined-gifs/a round shaped lamp_combined.gif)
"a round table"
![round table](/assets/images/projects/clip-sculptor/combined-gifs/a round table_combined.gif)
"an egg chair"
![egg chair](/assets/images/projects/clip-sculptor/combined-gifs/an egg chair_combined.gif)
Results on ShapeNetCore (55 Categories)
"a truck"
![truck](/assets/images/projects/clip-sculptor/combined-gifs/a truck_combined.gif)
"a bathtub"
![bathtub](/assets/images/projects/clip-sculptor/combined-gifs/a bathtub_combined.gif)
"a bowl"
![bowl](/assets/images/projects/clip-sculptor/combined-gifs/a bowl_combined.gif)
"a formula one car"
![formula one car](/assets/images/projects/clip-sculptor/combined-gifs/a formula one car_combined.gif)
"a motor bike"
![motor bike](/assets/images/projects/clip-sculptor/combined-gifs/a motor bike_combined.gif)
"a round guitar"
![round guitar](/assets/images/projects/clip-sculptor/combined-gifs/a round guitar_combined.gif)
"a round jar"
![round jar](/assets/images/projects/clip-sculptor/combined-gifs/a round jar_combined.gif)
"a trash can"
![trash can](/assets/images/projects/clip-sculptor/combined-gifs/a trash can_combined.gif)
Citation
@InProceedings{sanghi2023clipsculptor,
title={CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language},
author={Sanghi, Aditya and Fu, Rao and Liu, Vivian and Willis, Karl and Shayani, Hooman and Khasahmadi, Amir Hosein and Sridhar, Srinath and Ritchie, Daniel},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}}
Acknowledgements
This work was supported by AFOSR grant FA9550-21-1-0214.
Contact
Aditya Sanghi (aditya.sanghi@autodesk.com)
Rao Fu (rao_fu@brown.edu)