Qwen-Image-Layered

Model Introduction

This model is trained based on Qwen/Qwen-Image-Layered using the dataset artplus/PrismLayersPro, enabling text-controlled extraction of segmented image layers.

For more details about training strategies and implementation, feel free to check our technical blog.

Usage Tips

The model architecture has been modified from multi-image output to single-image output, producing only the layer relevant to the textual description.
The model was trained exclusively on English text but inherits Chinese language understanding capabilities from the base model.
The native training resolution is 1024x1024; however, inference at other resolutions is supported.
The model struggles to separate multiple overlapping entities (e.g., the cartoon skeleton and hat in the examples).
The model excels at decomposing poster-like images but performs poorly on photographic images, especially those involving complex lighting and shadows.
Negative prompts are supported—use them to specify content you want excluded from the output.

Demo Examples

Some images contain white text on light backgrounds. Users of ModelScope community should click the "☀︎" icon at the top-right corner to switch to dark mode for better visibility.

Example 1

Input Image

Prompt	Output Image	Prompt	Output Image
A solid, uniform color with no distinguishable features or objects		Text 'TRICK'
Cloud		Text 'TRICK OR TREAT'
A cartoon skeleton character wearing a purple hat and holding a gift box		Text 'TRICK OR'
A purple hat and a head		A gift box

Example 2

Input Image

Prompt	Output Image	Prompt	Output Image
蓝天，白云，一片花园，花园里有五颜六色的花		五彩的精致花环
少女、花环、小猫		少女、小猫

Example 3

Input Image

Prompt	Output Image	Prompt	Output Image
一片湛蓝的天空和波涛汹涌的大海		文字“向往的生活”
一只海鸥		文字“生活”

Inference Code

Install DiffSynth-Studio:

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

Model Inference:

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
    prompt,
    seed=0,
    num_inference_steps=30, cfg_scale=4,
    height=1024, width=1024,
    layer_input_image=input_image,
    layer_num=0,
)
images[0].save("image.png")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support