ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling

Abstract

We propose ID-to-3D, a method to generate identity- and text-guided 3D human heads with disentangled expressions, starting from even a single casually captured in-the-wild image of a subject. The foundation of our approach is anchored in compositionality, alongside the use of task-specific 2D diffusion models as priors for optimization. First, we extend a foundational model with a lightweight expression-aware and ID-aware architecture, and create 2D priors for geometry and texture generation, via fine-tuning only 0.2% of its available training parameters. Then, we jointly leverage a neural parametric representation for the expressions of each subject and a multi-stage generation of highly detailed geometry and albedo texture. This combination of strong face identity embeddings and our neural representation enables accurate reconstruction of not only facial features but also accessories and hair and can be meshed to provide render-ready assets for gaming and telepresence. Our results achieve an unprecedented level of identity-consistent and high-quality texture and geometry generation, generalizing to a "world" of unseen 3D identities, without relying on large 3D captured datasets of human assets.

Overview

Our method deploys a novel human parametric expression model in tandem with specialized geometry and albedo guidance, not only to create intricately detailed head avatars with realistic textures but also to achieve strikingly ID-consistent results across a wide range of expressions, setting a new benchmark in comparison to existing SDS techniques.

Without having to rely on 3D captured datasets that are expensive to collect and typically biased, and without being constrained on a specific geometry template, our method can be employed by a broad range of subjects, with different features such as skin tone and hairstyle.

ID-consistent generation

Given an ID-embedding as input, ID-to-3D can generate high-quality 3D reprensentation of a subject with state-of-the-art ID retention.

Craig Robinson

Florence Pugh

Jason Mamoa

Mark Ruffalo

Timothee Chalamet

Reference

ID-conditioned examples for celebrities identities. Conditioning images parsed from BingImages

ID-conditioned examples for identities AI-generated or extracted from real selfies. From left to right: conditioning created starting from the test cases of the NHPM dataset, Stable-Diffusion model, Arc2Face model or taken from a small set of selfies.

Expressive ID-conditioned Generation

ID-to-3D creates a set of distinct and vivid expressions with robust ID consistency. It captures expressive fine grained details that convey a wide range of expressions.

Renderings and normal maps in camera coordinates. Identities are created using one set of selfies (top) without text conditioning and one set of AI generated images (bottom) paired with the text prompt "with short buzzcut hairstyle".

Renderings and normal maps in camera coordinates. Identities of celebrities : Will Smith, Anya Taylor Joy, and Kanye West.

Comparisons with SDS methods

ID-to-3D sets a new state-of-the-art in the domain of SDS-based 3D face asset generation.

Comparisons with recent SDS methods under the same rendering and lighting conditions. We use text-to-human avatar generation methods (Human-Norm, TADA, DreamFace) and methods that leverage both text and images to create 3D assets (Magic123, DreamCraft3D).

ID-consistent text-based customization

ID-to-3D customize 3D objects with text-conditioning without altering the subject’s general identity.

3D heads generated using different identity conditioning and textual prompt.