Generating Videos from Text and Images with HunyuanVideo 1.5

Creating videos from written prompts or still images is one of the most impressive and distinctive uses of deep learning technology. Almost anything imaginable, from entirely fictional scenes to everyday activities, can now be visualized with only a few keystrokes.

Video carries a sense of realism that static images often cannot provide. It adds motion, timing, and continuity, creating an effect that earlier technologies struggled to reproduce, even with advanced CGI. Image generators can already create nearly anything we can describe, and with additional editing tools and enough time, those results can be refined further. Video generators, however, offer much broader creative flexibility because they can produce complex movement, transitions, effects, and evolving scenes rather than a single static frame.

This article introduces one of the latest state-of-the-art open source deep learning models for video generation: HunyuanVideo 1.5. Released recently, this model performs at a level comparable to closed-source systems such as Wan2.5 and Sora 2, while avoiding many of the access and usage restrictions that can come with proprietary models.

With GPU-based cloud infrastructure, HunyuanVideo 1.5 can be run using popular tools such as ComfyUI and DiffSynth-Studio. In this guide, we look at what makes HunyuanVideo 1.5 powerful and then explain how to run the model on cloud GPU infrastructure. For the demonstration, the setup uses an NVIDIA H200-powered GPU server with ComfyUI.

Prerequisites

Access to an NVIDIA GPU server

Key Takeaways

  • HunyuanVideo 1.5 is a collection of text-to-video, image-to-video, and video super-resolution models that can compete with leading closed-source models such as Wan2.5 and Sora 2.
  • With only 8.3 billion parameters, the model is efficient enough to run inference on consumer-grade GPUs.
  • Using NVIDIA H200-powered cloud GPU infrastructure, 720p videos can be generated within minutes.

HunyuanVideo 1.5

HunyuanVideo 1.5 is a compact but powerful video generation system that delivers state-of-the-art visual quality and strong motion consistency with only 8.3 billion parameters. This makes efficient inference possible even on consumer-grade GPUs. Its performance is based on several important components: strict data curation, an advanced DiT architecture with selective and sliding tile attention, improved bilingual capabilities through glyph-aware text encoding, a progressive pre-training and post-training process, and an efficient video super-resolution module. Combined, these elements create a unified framework for high-quality text-to-video and image-to-video generation across different durations and resolutions.

Training

The training process of HunyuanVideo 1.5 is defined by two central characteristics: careful data curation and the use of the Muon optimizer. During data acquisition, the focus was placed on both diversity and quality. Video material was gathered from a variety of sources and then prepared for efficient training by splitting it into clips between 2 and 10 seconds long. The dataset was then filtered for visual quality, aesthetics, and basic properties such as video borders.

For captioning the videos, the same approach used for HunyuanImage 3.0 was applied. This process includes “(1) a hierarchical schema for structured image description, (2) a compositional synthesis strategy for diverse data augmentation, and (3) specialized agents for factual grounding.” (Source). Together, these methods create a reliable system for captioning each video effectively and efficiently before training.

The actual training was carried out in three stages. First, the model was trained on the text-to-image task at 256p and then 512p. This text-to-image phase helped the model learn semantic alignment between text and images. The researchers found that this step improved the later text-to-video and image-to-video stages by speeding up convergence and improving performance.

During pre-training, a blended training strategy is used that combines T2I, T2V, and I2V tasks in a 1:6:3 ratio. This balances semantic understanding with video-specific modeling. Large-scale T2I datasets are emphasized to strengthen the model’s understanding of visual semantics and increase generative variety, while T2V and I2V tasks provide strong video generation capabilities. A structured multi-stage process, shown as Stages III to VI in Table 2, begins at 256p resolution with 16 fps and gradually increases to 480p and 720p at 24 fps. Video durations range from 2 to 10 seconds. This gradual increase in spatial and temporal resolution supports stable convergence and improves the model’s ability to generate detailed, coherent videos. (Source). For post-training, several connected stages of continued training, reinforcement learning, and supervised fine-tuning are applied separately to I2V and T2V tasks. These stages eventually produce the final I2V and T2V models.

Architecture

The unified Diffusion Transformer architecture shows the path the model follows when generating an image during inference. For example, “for the I2V task, the reference image is integrated into the model via two complementary strategies: (1) VAE-based encoding, where the image latent is concatenated with the noisy latent along the channel dimension to leverage its exceptional detail reconstruction capacity; and (2) SigLip-based feature extraction, where semantic embeddings are concatenated sequentially to enhance semantic alignment and strengthen instruction adherence in I2V generation. A learnable type embedding is introduced to explicitly distinguish between different types of conditions.” (Source).

The Variational AutoEncoder, or VAE, is a “causal 3D transformer architecture designed for joint image-video encoding, which achieves a spatial compression ratio of (16 \times) and a temporal compression ratio of (4 \times), with a latent channel dimension of 32.” The text encoder is a Multimodal LLM, or MLLM, based on Qwen 2.5 VL as a multimodal encoder. The additional integration of Glyph ByT5 improves the model’s ability to understand and render text in different languages. SigLip is also used to align images and text in a shared representation space for tasks such as zero-shot image classification and image-text retrieval.

To process this information across multiple modalities, the model uses a new attention mechanism called Selective and Sliding Tile Attention, or SSTA. “The SSTA algorithm comprises four key steps: 3D Block Partition, Selective Mask Generation, STA Mask Generation and Block-Sparse Attention. They propose an engineered acceleration toolkit for sparse attention mechanisms, utilizing the ThunderKittens framework to efficiently implement the flex_block_attention algorithm.” (Source).

How to Run HunyuanVideo 1.5 on a Cloud GPU Server

To begin running HunyuanVideo 1.5 on a cloud GPU server, it is recommended to follow a setup process that explains how to create a GPU-powered server with SSH access. The setup should also cover how to configure VS Code or Cursor so that the Simple Browser feature can be used to access ComfyUI, which runs on the remote machine’s GPU, from a local browser. An NVIDIA H200 GPU is recommended for this tutorial.

After the GPU server has been created, connect to it from your local terminal using SSH. Switch to the working directory you want to use, and then paste the following commands into the terminal. The commands clone the ComfyUI repository, download the required models, and start ComfyUI.

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
apt install python3-venv python3-pip
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cd models/clip_vision
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/clip_vision/sigclip_vision_patch14_384.safetensors
cd ../text_encoders
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/text_encoders/byt5_small_glyphxl_fp16.safetensors
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/text_encoders/qwen_2.5_vl_7b.safetensors
cd ../vae
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/vae/hunyuanvideo15_vae_fp16.safetensors
cd ../diffusion_models
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/diffusion_models/hunyuanvideo1.5_720p_t2v_fp16.safetensors
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/diffusion_models/hunyuanvideo1.5_720p_i2v_fp16.safetensors
wget https://huggingface.co/Comfy-Org/HunyuanVideo_1.5_repackaged/resolve/main/split_files/diffusion_models/hunyuanvideo1.5_1080p_sr_distilled_fp16.safetensors
cd ../..
python main.py

Next, copy the URL shown in the terminal and paste it into the Simple Browser in VS Code or Cursor. Then select the arrow button in the upper-right corner to open ComfyUI in your browser. Download the workflow JSON from the ComfyUI examples page and open it in ComfyUI. For the image-to-video workflow, use the corresponding workflow file.

You can now begin generating videos by entering your prompt. Adjust the height, width, step count, and number of frames to change the generated output. This workflow also supports video super-resolution upscaling if all purple blanked-out modules in the lower section of the workflow are bypassed.

The quality is excellent, even in the down-scaled GIF version of the original output. Overall, this is a strong model for generating videos in many different styles, including 3D, animation, realism, and more. On an H200 GPU, these videos can be generated within minutes. ComfyUI is highly recommended for generating videos with HunyuanVideo 1.5.

Closing Thoughts

HunyuanVideo 1.5 is an impressive video generation model with capabilities that can rival systems such as Sora 2 in pure video generation quality. Because of its innovative training strategy, future releases may have an even greater impact on the open source video generation ecosystem. Users are encouraged to try the model on GPU-powered cloud infrastructure.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

Build Faster Agentic LLM Workflows with Python

AI/ML, Tutorial
Vijona10 minutes ago Build Faster Agentic LLM Workflows with Asynchronous Python Calls Large language models can be difficult to run reliably in production because they may introduce inaccurate answers, inconsistent…