Five diffusion papers worth reading today (June 3, 2026)

Wednesday's batch (cs.CV + cs.LG, June 3) breaks into two practical clusters: papers that cut compute or step counts without touching the architecture, and papers that fix supervision or guidance failures that have been quietly plaguing practitioners. ByG eliminates the paired-data bottleneck for flow matching editing; FocusDiT gets near-SOTA image quality with less than a fifth of PixArt-α's training budget; the reward guidance theory paper finally explains why guided samples overfit the reward function and ships a fix; Video-Mirai teaches causal autoregressive video generators to plan ahead at zero inference cost; Flicker-DDPM matches the quality of 500-step white-noise DDPM in 150 steps by changing the noise color instead of the solver.

1. Bootstrap Your Generator (ByG): unpaired flow matching editing

ArXiv: 2606.03911 | NVIDIA / Tel Aviv University | cs.CV

Peer-review status: Preprint. No code repository linked at time of writing.

Training a flow matching model to edit images or videos normally requires paired source–target examples — expensive to collect for images, largely prohibitive for video. ByG eliminates this dependency by turning the base model into its own supervisor. One branch of the model follows the editing instruction; a parallel branch reconstructs the original from the same starting point. The difference between those two outputs provides the training signal — no external paired data, no frozen teacher, no separate reward model. 1

ByG training paradigm: supervised training (top) requires explicit paired samples; external model guidance (middle) is bounded by the teacher's quality ceiling; ByG's intrinsic signal (bottom) derives supervision from the base model's own edit-vs-reconstruct divergence. — ByG's three-way comparison: supervised, externally guided, and self-supervised via intrinsic edit/reconstruct divergence. 1

The framework is built on FLUX.1-dev with Musubi-Tuner as the video backbone and is architecture-agnostic by design. On cartoon↔photorealism video editing benchmarks, ByG achieves CLIP directional similarity 0.104 ± 0.005 versus Ditto's 0.091 ± 0.007, and DINO feature similarity 0.718 ± 0.012 versus Ditto's 0.536 ± 0.017 — a substantial source-preservation margin while also winning on edit success. Temporal flickering (0.967) and aesthetic quality (0.574) are competitive with Ditto (0.972 and 0.585 respectively), with motion fidelity 0.715 vs Ditto 0.616. 1

Code/resources: Not released at submission. Base models: black-forest-labs/FLUX.1-dev, kohya-ss/musubi-tuner.

Why read it: Paired video data collection is the practical ceiling for most labs working on video editing. ByG's self-supervised formulation removes that ceiling — if it generalizes to other flow matching architectures, the same bootstrapping idea could unlock editing pipelines across video domains without any data collection overhead.

2. FocusDiT: learnable query masking cuts DiT compute below 20%

ArXiv: 2606.02090 | Zhejiang University / Westlake University | cs.CV

Peer-review status: Preprint (v2, June 2). No code repository linked at time of writing.

FocusDiT starts from an observation about the FFN layer in Diffusion Transformers: it acts as a key-value vocabulary for visual semantics, and only a subset of query tokens actually need to retrieve from it at any given denoising step. The rest are retrieving noise. The paper's proposed fix is Q-MaskGen, a lightweight mask-prediction network that learns which query tokens are critical — based on the current hidden state and diffusion timestep — and amplifies their contribution while suppressing the rest. 2

FocusDiT diagnostic: top row shows generated images with heatmaps of query token utilization (red = high, blue = low). Bottom left: utilization-ratio curves across shallow/intermediate/deep layers, showing deep-layer queries are highly concentrated. Bottom right: percentage of FFN vocabulary entries retrieved per query token — shallow layers retrieve broadly, deep layers focus narrowly. — FocusDiT's FFN-vocabulary analysis: utilization is heavily concentrated in a small fraction of query tokens, motivating the learned mask. 2

Training took 156 A100 days versus PixArt-α's 753 A100 days — less than 21% of the compute. Despite that, FocusDiT achieves FID 27.81 versus PixArt-α's 30.50, and GenEval 0.57 versus 0.48. CLIPScore (0.307 vs 0.315) is within noise. ImageReward lags (0.29 vs 0.74), so the perceptual quality delta from human raters is real — but compositional accuracy and generation quality surpass the full-compute baseline. 2

Code/resources: Not released at submission.

Why read it: The diagnostic — that most query tokens in a DiT are retrieving redundant vocabulary entries — is actionable independent of FocusDiT's specific implementation. Any group training a DiT-family model with budget constraints should examine query utilization distributions across layers; the 4× training efficiency gap between FocusDiT and PixArt-α suggests there is systematic headroom in standard DiT training.

3. The mechanics of reward guidance: why it hacks, and a fix

ArXiv: 2606.02884 | Sanjit Dandapanthula, Nicholas M. Boffi | cs.LG

Peer-review status: Preprint. Code: github.com/sanjitdp/reward-guidance.

Reward hacking — where guided diffusion samples score well on the reward model but degrade visually — has been treated as an empirical nuisance. This paper identifies the precise mathematical cause: finite-particle plug-in estimation of the Doob h-transform. When you compute the guidance direction using a finite sample estimate of the reward-weighted score, you introduce a systematic bias that over-concentrates probability mass within already-high-reward modes. The model stops exploring across modes and collapses into a narrow slice of the reward function's support. 3

Reward damping comparison: three columns per prompt — unguided FLUX.1 (left), naively reward-guided (middle), and reward-damped (right). Middle column shows homogenization and over-saturation; right column preserves diversity while improving reward-aligned attributes. — Reward damping on FLUX.1-dev: naively guided samples collapse toward a narrow saturated mode; damped guidance retains visual diversity while still improving reward-relevant attributes. 3

The paper also isolates a second failure mode: plug-in guidance cannot allocate probability mass across multiple high-reward modes — it can only concentrate within whichever mode it lands in first. Best-of-n sampling can achieve near-optimal inter-mode selection where guidance alone fails. The proposed fix, reward damping, scales the guidance signal by a data-dependent damping factor derived from the analysis, provably correcting the within-mode bias. Experiments across ImageReward, HPSv2, PickScore, and CLIP-based rewards on FLUX.1-dev confirm the fix holds across reward types. 3

github.com · GitHub 仓库

sanjitdp/reward-guidance

https://github.com/sanjitdp/reward-guidance

正在加载内容卡片…

Why read it: Practically every reward-guided inference pipeline (aesthetic scoring, safety filtering, alignment fine-tuning) hits reward hacking. This paper gives a principled explanation — not a heuristic patch — and the resulting damping fix is lightweight enough to drop into existing pipelines. The inter-mode analysis also gives a clear reason why best-of-n should be layered on top of, not replaced by, reward guidance.

4. Video-Mirai: teaching causal video generators to see ahead

ArXiv: 2606.03971 | University of Tokyo / NII / Peking University | cs.CV

Peer-review status: Preprint. No code repository linked at time of writing.

Causal autoregressive video diffusion processes each segment independently with KV-cache generation. The problem is that each segment's hidden state is optimized for local reconstruction — it discards information the next segment will need. Video-Mirai calls this the representation-level planning gap and fixes it with a training-time auxiliary objective: a frozen MLP readout trained to predict future frame RGB from the current hidden state. To predict the future accurately, the representation must retain long-range information. 4

Video-Mirai foresight comparison: four columns — current frame (blue border), baseline causal readout, Video-Mirai readout (red border), ground-truth future frame. Three rows show different scene types (animated characters, silhouettes, fantasy). Baseline readouts are blurry or structurally incorrect; Video-Mirai readouts closely match ground truth. — Video-Mirai readout vs. baseline: the foresight-trained representation recovers future-frame structure the baseline readout loses. 4

The foresight module uses bidirectional attention during training, giving the representation access to future context. At inference, the MLP readout is discarded entirely — no architecture changes, no additional compute. VBench results with foresight window {0, 1} (current + one segment ahead): Total 84.62 versus baseline Causal-Forcing 83.82, with Quality improving from 84.54 to 85.38 and Semantic from 80.93 to 81.59. Extending to a {0, 1, 2} window pushes Semantic to 82.00 but slightly reduces Quality to 85.11. 4

Code/resources: Not released at submission.

Why read it: The planning gap is a structural property of any causal generative model that trains segment-by-segment. The foresight fix is clean and zero-cost at inference — which means it should be applicable to any KV-cache-based causal video architecture. The same auxiliary readout idea could extend to autoregressive audio or long-form text generation.

5. Flicker-DDPM: 1/f colored noise halves the steps you need

ArXiv: 2606.03393 | Wuhan University | cs.LG

Peer-review status: Preprint. Code: github.com/Mao-Kexiang/Flicker_DDPM.

Natural images have 1/k^α power spectra — low spatial frequencies carry more energy than high ones. Standard DDPM corrupts them with white noise (flat spectrum), which means the forward process fights against the data's natural structure before the model can even start learning the reverse. Flicker-DDPM replaces white noise with 1/f colored noise, whose power-law spectrum is matched to the data. The forward SDE becomes x_t = α_t x_0 + β_t Lε, where L is a power-law covariance matrix rather than identity. 5

Flicker-DDPM noise comparison: panel (a) white noise on a 32×32 lattice — salt-and-pepper uncorrelated pixels; panel (b) colored noise with η=0.2 — spatially correlated blotchy structure matching natural image statistics; panel (c) log-log power spectrum plot showing flat white noise (blue, Σ=I) versus power-law colored noise (red, η=0.2, α=1.89) matching the theoretical P(k)~k^(−α) curve. — White noise vs. 1/f colored noise: the correlated spatial structure of colored noise matches natural image power spectra. 5

The paper derives a universal formula — η = (3−α)/2 — for the optimal noise spectral exponent η given the data's spectral exponent α. For CIFAR-10 (α ≈ 1.89), the optimal η = 0.555. Results on CIFAR-10 (10k samples): 5

Steps (T)	Flicker-DDPM FID	White-noise DDPM FID	Reduction
100	22.57	36.17	−37.6%
150	12.24	25.36	−51.7%
200	11.57	18.08	−36.0%
500	11.96	13.02	−8.1%

At T=150, Flicker-DDPM FID (12.24) beats white-noise DDPM at T=500 (13.02) — using 70% fewer steps.

github.com · GitHub 仓库

Mao-Kexiang/Flicker_DDPM

https://github.com/Mao-Kexiang/Flicker_DDPM

正在加载内容卡片…

Why read it: Most DDPM acceleration research targets the sampler or the network. Flicker-DDPM changes the noise itself — which means it is orthogonal to, and potentially combinable with, distillation, stride-sampling, or ODE solver improvements. The η formula is immediately applicable to any image dataset: estimate the spectral exponent of your training data, compute η, and drop it in.

Quick reference

Paper	ArXiv ID	Core method	Key result	Code
Bootstrap Your Generator (ByG)	2606.03911	Self-supervised flow matching editing; edit/reconstruct divergence as training signal	CLIP dir. sim. 0.104 vs Ditto 0.091; DINO sim. 0.718 vs 0.536	Not released
FocusDiT	2606.02090	Learnable query importance masks in DiT FFN layers	FID 27.81, GenEval 0.57 at <21% of PixArt-α compute (156 vs 753 A100 days)	Not released
Reward guidance mechanics	2606.02884	Theory of plug-in h-transform bias; reward damping fix	Reduces reward hacking across ImageReward / HPSv2 / PickScore / CLIP on FLUX.1-dev	GitHub
Video-Mirai	2606.03971	Foresight training via auxiliary MLP readout; zero inference overhead	VBench Total 84.62 vs baseline 83.82; Quality 85.38 vs 84.54	Not released
Flicker-DDPM	2606.03393	1/f colored noise replacing white noise; universal η=(3−α)/2 formula	FID 12.24 at T=150 vs white-noise FID 25.36; beats T=500 white-noise FID with 70% fewer steps	GitHub

Two threads run through all five papers. First, supervision cost: ByG removes paired-data collection for editing, FocusDiT cuts training compute by 4×, and Flicker-DDPM cuts inference steps by roughly half. Second, the gap between what a model optimizes during training and what practitioners need at deployment: ByG's reconstruction branch bridges the unpaired gap; reward damping corrects the plug-in guidance bias; Video-Mirai's foresight objective bridges the causal segment-level gap. In each case, the paper doesn't change the architecture — it changes what the training objective sees.

Cover image: AI-generated illustration