UIKA: Fast Universal Head Avatar from Pose-Free Images

Abstract

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of pose-free inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings.

Method

Pipeline Overview. Given a set of pose-free input images, our pipeline begins with a facial correspondence estimator that predicts UV coordinates for valid facial pixels, and the corresponding colors are reprojected onto the shared UV space. The source images (screen space) and reprojected images (UV space) are encoded through two dedicated encoders, producing multi-scale features from both screen space and UV space. We then apply screen attention and UV attention to inject these into learnable UV tokens, which are then decoded into UV Gaussian attribute maps while incorporating the aggregated color and confidence map. The resulting canonical Gaussian head avatar supports animation via standard linear blend skinning and achieves real-time rendering at 220 FPS.

Synthetic dataset visualization

Input number analysis

Ablation study

In-the-wild results

Smartphone-captured results

BibTeX

@misc{wu2026uikafastuniversalhead,
      title={UIKA: Fast Universal Head Avatar from Pose-Free Images}, 
      author={Zijian Wu and Boyao Zhou and Liangxiao Hu and Hongyu Liu and Yuan Sun and Xuan Wang and Xun Cao and Yujun Shen and Hao Zhu},
      year={2026},
      eprint={2601.07603},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07603}, 
}

UIKA: Fast Universal Head Avatar from Pose-Free Images

We present UIKA, a novel feed-forward approach for high-fidelity 3D Gaussian head avatar reconstruction from an arbitrary number of input images (e.g., a single portrait image or multi-view captures) without requiring extra camera or expression annotations.