We deployed a 3D digital human agent on the Apple Vision Pro, which interacts with users through an ASR-LLM-TTS pipeline. Facial expressions and gestures are dynamically controlled by an Audio2BS model, allowing the agent to respond naturally with synchronized speech, expressions, and movements.
TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro
Our method can obtain high-quality normal, which can be used for real-time image-based relighting.
Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices.We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we "bake" the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.
Overall Framework of TaoAvatar. Our pipeline begins by reconstructing (a) a cloth-extended SMPLX mesh with aligned Gaussian textures. To address complex dynamic non-rigid deformations, (b) we employ a teacher StyleUnet to learn pose-dependent non-rigid maps, which are then baked into a lightweight student MLP to infer non-rigid deformations of our template mesh. For high-fidelity rendering, (c) we introduce learnable Gaussian blend shapes to enhance appearance details.
We showcase various high-fidelity full-body avatar synthesis results generated by our method, highlighting the versatility and accuracy of our approach.
Our method outperforms state-of-the-art methods by producing clearer clothing dynamics and enhanced facial details.
Our dataset, TalkBody4D , contains eight multi-view image sequences. They are captured with 59 well-calibrated RGB cameras in 20 fps, with a resolution of 3000×4000 and lengths ranging from 800 to 1000 frames. We use the data to evaluate our method for building animatable human body avatars.
To request the dataset, please visit our HuggingFace repository, complete the required login information, and submit the corresponding request form. We will perform dual verification of your identity and grant access permissions on HuggingFace after validation.
The above model viewer shows examples of the per-frame mesh reconstructions for the actors in the dataset. It also visualizes the camera locations -- please note that you can adjust the viewpoint with the cursor.