WildVidFit

Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

ECCV 2024

Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li*, Philip H.S.Torr, Liang Lin

Abstract

Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencdoer for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos.

WildVidFit Framework

Overview of our WildVidFit framework. Our method contains two modules, i.e., a one-stage image try-on network and a guidance module. In timestep \( t\), we crop the garment area and decode the latent \(Z_t\) into sequence \(\mathbf{I_t}\). Between adjacent frames \(I^{j+1}_t\) and \(I^j_t\), the similarity loss \(L_{SIM}\) is calculated using cosine distance. Additionally, we randomly mask the sequence \(\mathbf{I_t}\) into \(\hat{\mathbf{I}}_t\), which is then inputted into VideoMAE for reconstruction. \(L_{MAE}\) represents the distance between the sequences \(\mathbf{I_t}\) and \(\hat{\mathbf{I}}_t\). We assume that a lower reconstruction loss will result in a smoother sequence. \(L_{SIM}\) and \(L_{MAE}\) together constitute the temporal loss, which controls the sampling process from \(Z_t\) to \(Z_{t-1}\).

One-stage Image Try-on Network

Overview of the proposed one-stage image try-on network. First, we extract the person representation and garment representation during preprocessing. The person representation includes the cloth-agnostic image \(A\) and the human pose \(P\) while the garment representation includes the garment image \(G\) and the edge map \(E_g\). Then two representations condition the diffusion model in the way of hierarchical fusion in UNet decoder and cross attention respectively.

Qualitative Comparison on the TikTok Dataset

Virtual Try-on Results on Real-life TikTok Videos

Cross-dataset Video Try-on Results

Qualitative Comparison on the VVT Dtaset

Qualitative Comparison on the VITON-HD Dataset

Qualitative Comparison on the DressCode Dataset