Short Introduction
You can now create highly realistic, lip-synced talking avatar videos — powered by AI — using nothing more than a standard 10GB VRAM GPU. With WanGP 5.4 and the Hunyuan Video Avatar model, anyone can generate stunning 15-second clips that match speech or songs to facial expressions and motion, all locally and without needing enterprise hardware.
Simplified One-Line Flowchart
Voice/song + Portrait Image ➔ WanGP Web UI + Hunyuan Model ➔ 15s Talking Avatar Output
Easy Step-by-Step Method
Step 1: Get the Tool
- Install from GitHub:
https://github.com/deepbeepmeep/Wan2GP - This tool is browser-based, lightweight, and supports multiple video models including Hunyuan.
Step 2: Prepare Inputs
- Voice or song clip up to 15 seconds for best stability.
- Portrait image resized to exactly 480x832 to prevent tensor mismatch errors during generation.
Step 3: Launch the Interface and Set Parameters
- Open WanGP in your browser.
- Upload your voice/audio file and image.
- Select Hunyuan Video Avatar as your model.
- Recommended settings:
- Steps: 10–30 (20 is a good balance)
- Resolution: 512x512
- Attention:
autoorsage2 - Data Type:
BF16 - Quantization:
Scaled
Step 4: Optimize for Smooth Generation
- If generation is sluggish or fails:
- Use PyTorch 2.7.1 with video/audio support
- Add
tritonandsage-attention - Avoid uploading full-length songs; trim to small chunks (5–10s)
Step 5: Optional Add-ons & Advanced Use
- GGUF-prepped model version for manual handling:
https://huggingface.co/lym00/HunyuanVideo-Avatar-GGUF - If installing via Docker, you can change ports or patch DNS issues manually:
!