Technical

Diffusion Transformers (DiT): How AI Video Models Work

A technical deep dive into the Diffusion Transformer architecture that replaced U-Net and now powers Sora, Veo, Kling, and every major AI video model in 2026.

Lychee TeamJune 27, 202612 min read
Technical diagram showing how Diffusion Transformer architecture processes video through patchification and attention layers

Every major AI video model released in 2026 — Sora 2, Veo 3, Kling 3.0, Seedance 2.0, Hailuo, WAN, HunyuanVideo, CogVideoX, LTX-Video — shares the same core architecture. It is not a U-Net. It is a Diffusion Transformer, commonly called DiT. This single architectural shift, introduced in a 2023 research paper by William Peebles and Saining Xie, has become the defining technical decision behind modern AI video generation.

If you have read about how diffusion models generate video, you already understand the denoising process: start with noise, iteratively refine it into coherent frames. What changed is the neural network doing the denoising. The backbone swapped from convolutional U-Nets to transformer networks — and that swap unlocked longer videos, higher resolutions, better prompt adherence, and more predictable scaling with compute.

This post explains what DiT is, why it replaced U-Net, and how it actually processes video data from noise to finished frames.

Why U-Net Hit a Ceiling

U-Net architectures served as the backbone for diffusion models from 2020 through early 2024. They worked well for images. The U-shaped structure — encoding inputs to progressively smaller feature maps, then decoding them back up with skip connections — gave the model both local detail (from early layers) and global context (from the bottleneck).

But video exposed three structural limitations.

Limited receptive field. U-Net relies on convolutional layers that see only a small neighborhood of pixels at a time. Stacking many layers expands the effective receptive field, but the expansion is gradual. For video, where a camera pan in frame one should influence the composition of frame ninety-six, the convolutional approach struggles to propagate information across long temporal distances without adding expensive, specialized temporal layers.

Awkward scaling. Convolutional networks do not scale as predictably as transformers. Doubling the parameters in a U-Net does not reliably double the output quality. Adding depth increases computation quadratically due to the skip connections, and architecture changes (wider layers, more blocks, different normalization) often require manual tuning. This made it difficult to invest in larger models with confidence that quality would improve proportionally.

Fixed resolution bias. U-Nets are designed around fixed spatial dimensions. Handling videos at different resolutions or aspect ratios required resizing, padding, or training separate model variants. Transformers, by contrast, naturally handle variable-length sequences — you simply change the number of input tokens.

These limitations did not matter much for 512x512 images. They became critical barriers when the goal shifted to generating 1080p video clips lasting ten or more seconds.

How DiT Replaces U-Net

The Diffusion Transformer keeps the overall diffusion framework intact: latent encoding, iterative denoising, text conditioning, and decoding. The only change is what sits inside the denoising loop. Where U-Net used convolutional encoder-decoder blocks with skip connections, DiT uses a stack of transformer blocks operating on sequences of tokens.

The key steps are patchification, self-attention, conditioning, and depatchification.

Patchification: Turning Video into Tokens

Raw video — even in compressed latent form — is a 3D grid of values: height, width, and time. Transformers do not process grids. They process sequences of tokens.

DiT bridges this gap through patchification. The latent video representation is sliced into small spatiotemporal cubes — patches that span a few pixels spatially and a few frames temporally. Each patch is flattened into a vector and linearly projected into the transformer's embedding dimension, producing a token.

A 10-second video at 24 frames per second, after VAE compression at 16x spatial and 4x temporal, might produce a latent grid that patchifies into roughly 10,000 to 50,000 tokens. Each token represents a small chunk of video in both space and time.

Position embeddings are added to each token so the transformer knows where it came from in the original grid. Some architectures use learned positional embeddings; others use rotary position embeddings (RoPE) extended to 3D coordinates, which generalize better to resolutions and durations not seen during training.

Self-Attention: Processing All Tokens Together

Once patchified, the token sequence passes through a stack of transformer blocks. Each block applies multi-head self-attention followed by a feedforward network.

Self-attention is what gives DiT its power over U-Net. Every token can attend to every other token in a single operation. A patch from the first frame can directly influence a patch from the last frame. A foreground object can attend to its background. A character's face in the upper-left can attend to their hand in the lower-right, four seconds later.

This is fundamentally different from convolutions, where information must flow through many intermediate layers to travel long distances. In a transformer, the information flow is direct — every patch "sees" every other patch at every layer.

The computational cost is quadratic in the number of tokens, which is why practical implementations use various efficiency techniques. Full 3D attention (where every token attends to every other token across space and time) produces the highest quality but is expensive. We will return to the attention design choices in the next section.

Conditioning: Injecting Text and Timestep Information

The transformer needs to know two things beyond the noisy video tokens: what the text prompt says and what timestep of the denoising process it is on.

DiT handles conditioning primarily through adaptive layer normalization (AdaLN). Instead of using fixed normalization parameters, the scale and shift values in each layer norm are computed as functions of the timestep embedding and, in many architectures, the text embedding. This is computationally elegant — it does not add extra attention layers for conditioning, yet it modulates every layer's behavior based on the guidance signal.

For more detailed text conditioning, most video DiT models add cross-attention layers where video tokens attend to text tokens from a language encoder (typically T5-XXL or a CLIP model). This allows fine-grained spatial alignment between text descriptions and visual regions.

The combination of AdaLN for global conditioning and cross-attention for detailed text alignment gives DiT strong prompt adherence — one of the most noticeable quality improvements over U-Net-based predecessors.

Depatchification: Reassembling the Video

After the transformer stack processes all tokens through the specified number of layers, the output tokens are reshaped back into the spatiotemporal latent grid. This is essentially the reverse of patchification: each token is projected back to its patch dimensions, and the patches are stitched back into a contiguous latent volume.

The VAE decoder then decompresses this latent volume into full-resolution video frames.

Attention Design: Full 3D vs. Factored Approaches

Not all DiT implementations handle attention the same way, and the design choice has significant implications for quality and speed.

Full 3D attention treats all spatiotemporal tokens equally. Every token attends to every other token regardless of whether they share a frame, a spatial location, or neither. Models like HunyuanVideo, CogVideoX, and LTX-Video use this approach. The advantage is maximum expressiveness: the model can learn arbitrary relationships between any two points in the video. The cost is quadratic scaling with token count, which limits practical resolution and duration.

Factored (separated) attention decomposes the problem. Spatial attention is applied within each frame (tokens attend to other tokens at the same timestep), and temporal attention is applied across frames (tokens at the same spatial position attend to each other across time). Open-Sora and some earlier architectures use this pattern. It is dramatically cheaper — each attention operation covers a smaller token set — but it restricts what the model can learn. A motion that involves simultaneous spatial and temporal dependencies (like a rotating object moving through space) must be captured indirectly across multiple layers.

Hybrid approaches combine the two. Some architectures alternate between full 3D attention blocks and factored blocks. Others use sliding-window or tiled attention, where full attention is applied within local spatiotemporal neighborhoods and sparser attention connects distant regions. Research from Tencent's HunyuanVideo team demonstrated that selective and sliding tile attention (SSTA) can approach full 3D attention quality while reducing memory requirements by over 60%.

The trend in 2026 is toward full 3D attention for flagship models (where quality is paramount) and hybrid approaches for models designed to run on consumer hardware.

The VAE Layer: How Video Gets Compressed

DiT does not operate on raw pixels. It operates on latent representations produced by a Variational Autoencoder (VAE) — specifically, a 3D causal VAE designed for video.

The compression ratios are substantial. HunyuanVideo 1.5 uses a 3D causal VAE that compresses video by 16x spatially and 4x temporally. Wan 2.2 uses a high-compression 3D VAE with similar ratios. In practice, this means a 1080p video at 24fps is compressed to a latent volume roughly 1/1000th the size of the raw pixel data before the diffusion transformer ever touches it.

The "causal" in causal VAE means the encoder processes frames in temporal order, where each frame's encoding depends only on itself and previous frames (not future frames). This design enables autoregressive generation — producing video frame-by-frame rather than all at once — which is important for generating videos longer than the model's native window.

The VAE quality has a direct ceiling effect on output quality. A VAE that loses fine texture detail during compression cannot recover it during decoding, regardless of how sophisticated the DiT is. This is why recent models have invested heavily in VAE architecture. The shift from 2D image VAEs (which encoded each frame independently) to native 3D video VAEs (which exploit temporal redundancy) was a major quality inflection point, producing smoother motion and more consistent textures across frames. For a deeper look at how video data is broken into processable units, see our post on video tokenization.

Scaling Laws: Why Bigger DiT Means Better Video

One of the strongest arguments for DiT over U-Net is predictable scaling. Research published at CVPR 2025 ("Towards Precise Scaling Laws for Video Diffusion Transformers") demonstrated that video DiT models follow clear power-law relationships between model size, training compute, dataset size, and output quality.

This means you can predict, before training, how much quality improvement a given increase in parameters or compute will yield. For organizations investing millions in training runs, this predictability is critical.

The numbers illustrate the trend. Early DiT models for video had 1-2 billion parameters. CogVideoX scaled to approximately 5 billion. HunyuanVideo reached 13 billion in its original release and optimized to 8.3 billion in version 1.5. Wan 2.2 uses a mixture-of-experts (MoE) architecture with 27 billion total parameters (14 billion active per forward pass). Research prototypes have explored DiT-MoE architectures scaling to 16.5 billion parameters.

Each scale jump brought measurable improvements: longer temporal coherence, better physics simulation, more accurate character consistency, and stronger prompt adherence. The temporal coherence improvements in particular have been dramatic — where 2024-era models struggled to maintain consistent scenes beyond 3-4 seconds, current DiT models hold coherence for 15-20 seconds.

Mixture-of-experts (MoE) has emerged as the preferred scaling strategy for the largest models. Instead of making every layer larger, MoE keeps many parallel "expert" sub-networks and routes each token to only a few of them. This increases total parameter count (and the knowledge stored in the model) without proportionally increasing the compute needed per token. Wan 2.2's dual 14B-parameter expert architecture exemplifies this approach.

What DiT Means for Practical Video Quality

Understanding the architecture helps explain several behaviors that users encounter when working with AI video tools.

Why prompts matter more now. DiT's cross-attention mechanism aligns text and video at every layer. Vague prompts produce generic results not because the model lacks capability, but because there is insufficient signal for the attention mechanism to differentiate regions. Detailed, specific prompts activate distinct attention patterns that guide different spatial regions toward different visual outcomes.

Why resolution and duration trade off. The token count in a DiT grows with both spatial resolution and temporal duration. Higher resolution means more spatial patches; longer duration means more temporal patches. Since self-attention cost scales quadratically with token count, doubling resolution at the same duration roughly quadruples the compute requirement. This is why most tools offer resolution-duration tradeoffs rather than maximizing both simultaneously.

Why some models are faster than others. The choice between full 3D attention and factored attention directly impacts generation speed. A model using full 3D attention on 50,000 tokens will be significantly slower than one using factored attention on the same input. Smaller models like Wan's 1.3B variant can run on consumer GPUs with 8GB of VRAM specifically because they use fewer parameters and more efficient attention patterns — but they sacrifice some quality to get there.

Why consistency improved suddenly. The jump in character and scene consistency between 2024 and 2026 models is largely attributable to DiT's ability to maintain attention across long token sequences. When every frame token can directly attend to every other frame token, the model has a structural mechanism for enforcing consistency. U-Net had to propagate consistency through many intermediate convolutional layers, losing signal along the way.

Looking Ahead

The DiT architecture has settled into a dominant position, but it is still evolving. Active research directions include dynamic token pruning (dropping uninformative tokens mid-generation to save compute), progressive generation (starting at low resolution and progressively adding detail), and architectural distillation (training smaller DiT models to mimic larger ones).

The most impactful near-term development may be improved VAE architectures. Current 3D causal VAEs achieve impressive compression ratios, but they still introduce artifacts at extreme compression levels. Research into higher-fidelity video autoencoders could raise the quality ceiling for all DiT-based models simultaneously — a single improvement that propagates across the entire ecosystem.

For teams building video workflows today, the practical takeaway is that the architecture race is largely settled. The differentiation between tools increasingly comes from training data quality, VAE design, inference optimization, and the workflow layer on top — areas where tools like Lychee focus their engineering effort on making the underlying technology accessible without requiring users to understand patchification or attention mechanisms.

diffusion transformersDiT architectureAI video generationvideo diffusiontransformer vs U-Netvideo AI technicallatent diffusion