Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

SynLayers: a data-centric study of fully synthetic supervision for editable graphic-design layer decomposition.

Comparison between SynLayers and non or partially synthetic layered design data.
SynLayers explores whether purely synthetic layered graphic-design data can train effective decomposition models, compared with non-synthetic and partially synthetic alternatives.

Abstract

Recent image generation models produce high-quality visuals, but their outputs are flattened images that entangle foreground objects, background, and text. This makes flexible post-generation editing difficult. Existing layered-design approaches rely on scarce proprietary assets or partially synthetic data with limited structural priors, leaving scalability as a central bottleneck.

We investigate whether purely synthetic layered data can improve graphic design decomposition. Building on the CLD baseline, we construct SynLayers, generate textual supervision with vision-language models, and automate inference inputs with VLM-predicted bounding boxes. Our study shows that purely synthetic data can outperform PrismLayersPro at the same 18K scale, that gains stabilize around medium data scales, and that synthetic data gives better control over layer-count distributions.

Key Findings

Synthetic data works

At the matched 18K setting, SynLayers improves layer PSNR from 26.22 to 27.23 and composite PSNR from 30.52 to 31.35 over the PrismLayersPro-trained CLD baseline.

Medium scale is enough

Performance does not increase monotonically with more data. Layer FID is best at 20K, while composite FID is best at 30K, with gains saturating around moderate scales.

Layer-count balance helps

Because layer counts are controllable during synthesis, SynLayers improves robustness across different design complexities, especially high-layer-count cases.

SynLayers Construction

SynLayers recombines multi-source assets, including base designs, RGBA/RGB foreground objects, rendered text, and backgrounds. A low-overlap placement strategy generates composite images, ground-truth RGBA layers, bounding boxes, and raw spatial descriptions. A VLM then refines grid-based captions into coherent whole-image supervision.

Overview of the SynLayers construction pipeline.
Construction pipeline: multi-source assets are placed into synthetic layered designs, serialized into boxes and captions, then used to train the decomposition model and the VLM detector.

Synthetic Dataset Samples

Examples from the SynLayers synthetic layered dataset.
SynLayers samples are fully synthetic while retaining high-quality RGBA supervision and diverse layout structures.

Qualitative Results

Compared with Qwen-Image-Layered and the original PrismLayersPro-trained CLD baseline, the SynLayers-trained model produces cleaner semantic separations, sharper typography, and more accurate object boundaries. The real-world comparison also uses the trained Qwen3-VL detector to predict captions and layer boxes automatically from each raster image.

Qualitative comparison between CLD baseline, SynLayers, Qwen-Image-Layered, and ground truth.
Qualitative comparison on the layer-decomposition benchmark. SynLayers gives sharper text and cleaner layer boundaries than the original CLD baseline and Qwen-Image-Layered.
Qualitative comparison on out-of-distribution real-world examples.
Out-of-distribution real-world comparison. From a single raster input, the detector predicts caption and boxes, then the decomposition model reconstructs editable layers.

Quantitative Results

Training set # Samples Layer PSNR ↑ Layer SSIM ↑ Layer FID ↓ Mask IoU ↑ Mask F1 ↑ Composite PSNR ↑ Composite SSIM ↑ Composite FID ↓
PrismLayersPro 18K 26.22 0.865 6.62 0.910 0.948 30.52 0.944 12.50
SynLayers 18K 27.23 0.879 6.18 0.919 0.954 31.35 0.950 13.21
SynLayers 20K 27.16 0.880 5.97 0.919 0.953 30.82 0.948 12.00
SynLayers 30K 26.60 0.873 6.30 0.912 0.949 30.30 0.947 10.35
SynLayers 50K 26.82 0.875 6.23 0.920 0.954 30.29 0.949 10.93
SynLayers 500K 26.75 0.873 6.12 0.916 0.953 30.89 0.947 12.45

Best CLD-based checkpoint

27.16 / 0.880 / 5.97

Layer PSNR / SSIM / FID for SynLayers 20K.

OOD real-world test

29.35 PSNR, 35.40 FID

Composite-only evaluation on 147 real images.

Detector-caption quality

80.77 / 100

GPT-4.1 judged whole-caption score on 200 samples.

Layer Decomposition Examples

Browse the predicted background and foreground RGBA layers from bottom to top. Each example starts from a raster image and predicted bounding boxes.

Example 1

8 foreground layers
Example 1 original input.
Input
Example 1 detected boxes.
Boxes
Example 1 background layer.
Background
Example 1 foreground layer 0.
Layer 0
Example 1 foreground layer 1.
Layer 1
Example 1 foreground layer 2.
Layer 2
Example 1 foreground layer 3.
Layer 3
Example 1 foreground layer 4.
Layer 4
Example 1 foreground layer 5.
Layer 5
Example 1 foreground layer 6.
Layer 6
Example 1 foreground layer 7.
Layer 7

Example 2

6 foreground layers
Example 2 original input.
Input
Example 2 detected boxes.
Boxes
Example 2 background layer.
Background
Example 2 foreground layer 0.
Layer 0
Example 2 foreground layer 1.
Layer 1
Example 2 foreground layer 2.
Layer 2
Example 2 foreground layer 3.
Layer 3
Example 2 foreground layer 4.
Layer 4
Example 2 foreground layer 5.
Layer 5

Example 3

6 foreground layers
Example 3 original input.
Input
Example 3 detected boxes.
Boxes
Example 3 background layer.
Background
Example 3 foreground layer 0.
Layer 0
Example 3 foreground layer 1.
Layer 1
Example 3 foreground layer 2.
Layer 2
Example 3 foreground layer 3.
Layer 3
Example 3 foreground layer 4.
Layer 4
Example 3 foreground layer 5.
Layer 5

Example 4

5 foreground layers
Example 4 original input.
Input
Example 4 detected boxes.
Boxes
Example 4 background layer.
Background
Example 4 foreground layer 0.
Layer 0
Example 4 foreground layer 1.
Layer 1
Example 4 foreground layer 2.
Layer 2
Example 4 foreground layer 3.
Layer 3
Example 4 foreground layer 4.
Layer 4