Z-Image (造相): a fast, 6B single-stream diffusion transformer you can run locally (with a full code guide)

Z-Image (造相): a fast, 6B single-stream diffusion transformer you can run locally (with a full code guide)

Meta description: Z-Image is Tongyi-MAI’s efficient 6B image generation family built on a Scalable Single-Stream DiT (S3-DiT). This post explains what it is, what models exist (Turbo, Base, Edit), why it is fast, and how to generate images using the official repo and Diffusers, plus low VRAM options like stable-diffusion.cpp.


Table of contents


What is Z-Image?

Z-Image is the official open-source image generation project from Tongyi-MAI. The headline is simple:
a 6B-parameter diffusion transformer family designed to be powerful but efficient, with a strong focus on
photorealism, bilingual text rendering (Chinese and English), and fast inference.

The flagship checkpoint most people start with is Z-Image-Turbo, a distilled variant built to generate high-quality images in
only 8 NFEs (Number of Function Evaluations), and it is explicitly positioned as a fast, practical model that can fit on
16GB VRAM consumer GPUs while also achieving sub-second latency on high-end data-center GPUs.

This post is written so you can go from “What is Z-Image?” to “I can generate images locally” in one sitting, with code you can run.

Official repo: https://github.com/Tongyi-MAI/Z-Image
Hugging Face model: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo


Model lineup: Turbo vs Base vs Edit (and Omni-Base)

According to the project README, Z-Image currently describes (and in some cases releases) these variants:

  • Z-Image-Turbo: distilled + RL post-training, 8 NFEs, built for fast generation and strong photorealism and bilingual text rendering.
  • Z-Image-Base: the non-distilled foundation model, intended to unlock community fine-tuning and custom development (listed as “to be released”).
  • Z-Image-Edit: an editing-focused variant fine-tuned for image-to-image editing with strong instruction following (listed as “to be released”).
  • Z-Image-Omni-Base: described as a generation + editing base (listed as “to be released”).

If you want something you can download and run today with the official pipeline, start with Z-Image-Turbo.


Showcase: realism, text rendering, reasoning, editing

One thing the README does well is show what the model is meant to be good at. Below are the same showcase images embedded directly from the repo.

Z-Image-Turbo photorealistic image generation showcase
Photorealistic quality: examples from Z-Image-Turbo.
Z-Image-Turbo bilingual text rendering showcase (Chinese and English)
Accurate bilingual text rendering: complex Chinese and English text examples.
Z-Image prompt enhancer and reasoning showcase
Prompt enhancing and reasoning: the README highlights a Prompt Enhancer that helps the model go beyond surface descriptions.
Z-Image-Edit image editing showcase
Creative image editing: examples shown for Z-Image-Edit (weights listed as to be released in the model zoo).

Architecture: Scalable Single-Stream DiT (S3-DiT)

Z-Image adopts a Scalable Single-Stream DiT (S3-DiT) architecture. Instead of separating modalities into different streams,
it concatenates text tokens, visual semantic tokens, and image VAE tokens at the sequence level into a
single unified input stream. The README’s key claim is that this setup improves parameter efficiency compared to dual-stream approaches.

Z-Image S3-DiT single-stream architecture diagram
Single-stream S3-DiT overview from the official README.

Performance notes and leaderboards

The README notes that Z-Image-Turbo has been validated on multiple independent benchmarks and highlights two public leaderboards:

  • Artificial Analysis Text-to-Image Leaderboard (the README states Z-Image-Turbo ranked 8th overall and was the top open-source model at the time of posting).
  • Alibaba AI Arena Elo-based human preference evaluation (the README states competitive performance vs proprietary models).

You can check the current Text-to-Image leaderboard here:
Artificial Analysis: Text-to-Image


Quick start: official repo (PyTorch native)

The official repo includes a “PyTorch native inference” path:

# 1) Clone the repo
git clone https://github.com/Tongyi-MAI/Z-Image.git
cd Z-Image

# 2) Create and activate an environment (example: venv)
python -m venv .venv

# Linux or macOS
source .venv/bin/activate

# Windows PowerShell
# .venv\Scripts\Activate.ps1

# 3) Install in editable mode (installs repo dependencies)
pip install -U pip
pip install -e .

# 4) Run the example inference script
python inference.py

That is the simplest “use the official code” route.
If you prefer a more standard pipeline API and fast iteration, Diffusers is usually the smoothest experience.


Diffusers guide (recommended)

The README recommends installing Diffusers from source for Z-Image support, because Z-Image support was added via PRs and you want the latest features.

pip install git+https://github.com/huggingface/diffusers

Then you can run Z-Image-Turbo like this (this is the README example, kept intact in spirit and structure):

import torch
from diffusers import ZImagePipeline

# 1) Load the pipeline
# Use bfloat16 for optimal performance on supported GPUs
pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("cuda")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Flash Attention for better efficiency if supported:
# pipe.transformer.set_attention_backend("flash")    # Flash-Attention-2
# pipe.transformer.set_attention_backend("_flash_3") # Flash-Attention-3

# [Optional] Model Compilation
# pipe.transformer.compile()

# [Optional] CPU Offloading (for memory-constrained devices)
# pipe.enable_model_cpu_offload()

prompt = (
    "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, "
    "red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. "
    "Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, "
    "above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), "
    "blurred colorful distant lights."
)

# 2) Generate
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This results in 8 DiT forwards for Turbo
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")

If you want the official Diffusers documentation page for the Z-Image pipeline, see:
Diffusers: Z-Image pipeline docs


Speed and memory tips (practical checklist)

Z-Image-Turbo is already designed for few-step generation. The biggest wins you can stack on top are:

  • bfloat16 on GPUs that support it (the README uses it in the example).
  • Flash Attention (set attention backend to flash or _flash_3 if supported).
  • torch.compile for the transformer (faster after warm-up; first run takes longer).
  • CPU offload if you are memory constrained (slower, but can make the difference between OOM and working).

Common OOM fixes:

  • Lower resolution (try 768×768 or 512×512 while testing).
  • Disable compilation first (get it working, then optimize).
  • Use CPU offload as a fallback.
  • Run one image at a time (batch size of 1).

Batch generation script (copy-paste runnable)

Below is a simple batch generator you can save as zimage_batch.py. It uses Diffusers, generates a small gallery, and writes images to an output folder.

import os
import time
import torch
from diffusers import ZImagePipeline

PROMPTS = [
    "A photorealistic portrait photo of a chef plating ramen, studio lighting, shallow depth of field, 85mm lens.",
    "A cozy reading nook with rain on the window, soft lamp light, ultra-detailed, cinematic.",
    "A product photo of a smartwatch on black stone, dramatic rim light, high contrast, sharp details.",
    "A futuristic street market at night with bilingual neon signs in English and Chinese, reflections on wet pavement.",
]

def main():
    out_dir = "zimage_outputs"
    os.makedirs(out_dir, exist_ok=True)

    pipe = ZImagePipeline.from_pretrained(
        "Tongyi-MAI/Z-Image-Turbo",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=False,
    ).to("cuda")

    # Optional: speed knobs
    # pipe.transformer.set_attention_backend("_flash_3")
    # pipe.transformer.compile()

    # Turbo guidance should be 0
    guidance_scale = 0.0

    for i, prompt in enumerate(PROMPTS, start=1):
        seed = 1000 + i
        g = torch.Generator("cuda").manual_seed(seed)

        start = time.time()
        img = pipe(
            prompt=prompt,
            height=1024,
            width=1024,
            num_inference_steps=9,
            guidance_scale=guidance_scale,
            generator=g,
        ).images[0]
        elapsed = time.time() - start

        path = os.path.join(out_dir, f"zimage_{i:02d}_seed{seed}.png")
        img.save(path)
        print(f"[{i}/{len(PROMPTS)}] saved {path} in {elapsed:.2f}s")

if __name__ == "__main__":
    main()

Tip: if you are building an app, keep the pipeline loaded and reuse it across requests. Most “slow” complaints come from reloading weights every time.


Run Z-Image on 4GB VRAM with stable-diffusion.cpp

The Z-Image README highlights that stable-diffusion.cpp supports fast and memory-efficient Z-Image inference across platforms (CUDA, Vulkan, etc),
and even mentions running on as little as 4GB VRAM.

One example workflow is documented on the stable-diffusion.cpp wiki page:
How to Use Z-Image on a GPU with Only 4GB VRAM

The wiki provides an example command similar to the following (shown here as a reference format; adjust filenames to what you downloaded):

.\bin\Release\sd-cli.exe ^
  --diffusion-model z_image_turbo-Q3_K.gguf ^
  --vae ae.safetensors ^
  --llm Qwen3-4B-Instruct-2507-Q4_K_M.gguf ^
  -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night..." ^
  --cfg-scale 1.0 -v --offload-to-cpu --diffusion-fa -H 1024 -W 512

Low VRAM flags called out in the wiki:

  • --offload-to-cpu: load weights into VRAM only during compute to reduce VRAM usage.
  • --diffusion-fa: Flash Attention for speed and memory efficiency.

If you are working with big resolutions, the wiki also mentions optional flags like --vae-tiling and --vae-conv-direct to reduce decoding memory.


Community ecosystem

The README lists several community projects that are already extending Z-Image in useful directions:

  • Cache-DiT: inference acceleration for Z-Image (and Z-Image-ControlNet) via DBCache, context parallelism, and tensor parallelism.
  • stable-diffusion.cpp: C++ inference engine supporting Z-Image across multiple backends.
  • LeMiCa: training-free timestep-level acceleration method for faster inference.
  • ComfyUI ZImageLatent: a helper node to use the official Z-Image resolutions more easily in ComfyUI workflows.
  • DiffSynth-Studio: broader support including LoRA training, full training, distillation training, and low VRAM inference.
  • vllm-omni: serving framework adding Z-Image support.
  • SGLang-Diffusion: performance-oriented acceleration for diffusion generation, supporting Z-Image.

If you are optimizing throughput or deploying at scale, those projects are worth scanning before you reinvent tooling.


Citation and license

The Z-Image repo is published under the Apache-2.0 license, which is friendly for commercial use.

If you are writing research or want to cite the work, the README provides BibTeX entries, including:

@article{team2025zimage,
  title={Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer},
  author={Z-Image Team},
  journal={arXiv preprint arXiv:2511.22699},
  year={2025}
}

@article{liu2025decoupled,
  title={Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield},
  author={Dongyang Liu and Peng Gao and David Liu and Ruoyi Du and Zhen Li and Qilong Wu and Xin Jin and Sihan Cao and Shifeng Zhang and Hongsheng Li and Steven Hoi},
  journal={arXiv preprint arXiv:2511.22677},
  year={2025}
}

@article{jiang2025distribution,
  title={Distribution Matching Distillation Meets Reinforcement Learning},
  author={Jiang, Dengyang and Liu, Dongyang and Wang, Zanyi and Wu, Qilong and Jin, Xin and Liu, David and Li, Zhen and Wang, Mengmeng and Gao, Peng and Yang, Harry},
  journal={arXiv preprint arXiv:2511.13649},
  year={2025}
}

FAQ

Why does Turbo use guidance_scale = 0?

The official README explicitly calls out that guidance should be 0 for the Turbo models.
Turbo is trained for few-step generation under that setup, so pushing CFG can degrade results or behave unexpectedly.

Why does num_inference_steps = 9 become 8 steps?

The README notes that num_inference_steps=9 “actually results in 8 DiT forwards.”
Treat this as an implementation detail of the scheduler/pipeline configuration used for Turbo.

Do I need Diffusers from source?

The README says Z-Image support landed via Diffusers PRs and recommends installing Diffusers from source to ensure you have the latest Z-Image pipeline support.

Can I run Z-Image locally on consumer GPUs?

Yes. Z-Image-Turbo is described as fitting comfortably in 16GB VRAM consumer devices in the README, and there are community paths
(like stable-diffusion.cpp quantized weights) that target very low VRAM setups.


Bonus: the “acceleration magic” behind Z-Image (Decoupled-DMD and DMDR)

The README attributes Turbo’s few-step performance to a distillation pipeline centered on Decoupled-DMD, and then introduces DMDR,
which combines Distribution Matching Distillation with Reinforcement Learning during post-training.

Decoupled-DMD diagram from Z-Image repo
Decoupled-DMD diagram from the official repository.
DMDR diagram from Z-Image repo
DMDR diagram from the official repository.

Star history chart

The README includes a star history chart. Here is the same chart source embedded directly:

Tongyi-MAI/Z-Image GitHub star history chart
GitHub stars over time for Tongyi-MAI/Z-Image.

Note on WordPress images: The images above are hotlinked from GitHub. If you want faster load times and fewer external dependencies, download them and re-upload to your WordPress Media Library, then replace the src URLs.


Author update

I will add side-by-side quality comparisons and prompt tips. If you want a generator tested, tell me which one.

Leave a Reply

Your email address will not be published. Required fields are marked *