Hotshot - ACT 1

Direct Text-to-Video Synthesis with Enhanced Motion Dynamics and Large-Scale Text-Video Pair Training

Hotshot Research

Text-to-Video Examples

Abstract

ACT 1 (Advanced Cinematic Transformer) is a state-of-the-art direct text-to-video synthesis system developed by Hotshot to empower the world to share their imagination through video.

ACT 1 produces high-definition videos at a variety of aspect ratios and without watermarks, creating an engaging user experience. Recently, latent diffusion models have enabled high-quality image synthesis propelled in part by the abundance of public text-image pair data. Unfortunately, accessible video datasets of the same fidelity and scale have remained few and far between; video creation has not seen the same advances. Furthermore, publicly available multimodal datasets heavily feature "conceptual captions," and after training are unaware of many of the People, Places, Characters, and Things that the average interested user cares most to generate. We solve this problem by training a cascaded video-captioner tailor made to annotate videos while making special care to make note of actions, interesting common knowledge elements, and everyday language one would use to describe that video.

We conduct a variety of architecture & dataset size experiments and find utilizing a large-scale high-resolution text-video corpus to be crucial to high fidelity spatial alignment, temporal alignment, and aesthetic quality.

Comparisons with Other Methods

Hotshot - ACT 1 Pika 1.0 Runway ML