UIKA: Fast Universal Head Avatar from Pose-Free Images

1Nanjing University, 2Ant Group, 3HKUST, 4Xi'an Jiaotong University
*Work done during an internship at Ant Group,  Project lead,  Corresponding author
Teaser Image

We present UIKA, a novel feed-forward approach for high-fidelity 3D Gaussian head avatar reconstruction from an arbitrary number of input images (e.g., a single portrait image or multi-view captures) without requiring extra camera or expression annotations.

Abstract

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings.

Method

Method Image

Pipeline Overview. Given a set of unposed input images, our pipeline begins with a facial correspondence estimator that predicts UV coordinates for valid facial pixels, and the corresponding colors are reprojected onto the shared UV space. The source images (screen space) and reprojected images (UV space) are encoded through two dedicated encoders, producing multi-scale features from both screen space and UV space. We then apply screen attention and UV attention to inject these into learnable UV tokens, which are then decoded into UV Gaussian attribute maps while incorporating the aggregated color and confidence map. The resulting canonical Gaussian head avatar supports animation via standard linear blend skinning and achieves real-time rendering at 220 FPS.

Synthetic dataset visualization

Self reenactment results

Cross reenactment results

Ablation study

In-the-wild results

Smartphone-captured results

Related Links

Concurrent to our research, several excellent works have been introduced for 3D head reconstruction.

For single-image inputs, FlexAvatar (Kirschstein et al.) focuses on generating high-fidelity complete avatars, while PercHead and FastAvatar (Liang et al.) explore semantic editing and pose-invariant reconstruction, respectively.

Additionally, FastGHA introduces a method for real-time animatable 3D head avatar reconstruction from four fixed input images, following a setting similar to Avat3r.

Sharing a similar setting to ours, FlexAvatar (Peng et al.) and FastAvatar (Wu et al.) propose feed-forward frameworks for fast reconstruction from an arbitrary number of input images.

BibTeX

@misc{wu2026uikafastuniversalhead,
      title={UIKA: Fast Universal Head Avatar from Pose-Free Images}, 
      author={Zijian Wu and Boyao Zhou and Liangxiao Hu and Hongyu Liu and Yuan Sun and Xuan Wang and Xun Cao and Yujun Shen and Hao Zhu},
      year={2026},
      eprint={2601.07603},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07603}, 
}