TripoSG is a high-fidelity image-to-3D foundation model developed by VAST-AI Research (Tripo) and released in early 2025. It is designed to synthesize detailed 3D shapes from a single input image, prioritizing geometric precision and structural coherence over the speed-centric approaches of previous generations. The model aims to bridge the gap between rapid feed-forward reconstruction and high-quality shape synthesis by utilizing advanced diffusion techniques.
Architecture and Technical Specs
The model is built on a large-scale rectified flow transformer architecture, which utilizes linear trajectory modeling for more stable and efficient training compared to standard diffusion methods. It operates within a latent space defined by an advanced 3D VAE, which represents geometry using Signed Distance Functions (SDFs) and is trained with hybrid supervision including surface normal guidance and eikonal loss. The primary version of the model contains 1.5 billion parameters and processes objects using 2,048 latent tokens.
Capabilities and Training
TripoSG was trained on a meticulously curated dataset of over 2 million Image-SDF pairs, enabling the model to generalize across various artistic styles, including photorealistic photos, sketches, and stylized illustrations. The model is particularly noted for its ability to produce meshes with sharp geometric features and complex topologies that remain semantically consistent with the input image. While it excels at producing high-resolution geometry, it is frequently used as a geometry engine in pipelines where separate texturing models are applied to the generated mesh.