Five diffusion papers worth reading today (June 2, 2026)

Tuesday's batch (cs.CV + cs.LG, June 2) spans five themes that rarely appear together in a single day: reinforcement learning for multi-turn image editing, a theoretical account of score-function hallucinations with an open-source fix, compact-model distillation that preserves classifier-free guidance fidelity, concept erasure for Rectified Flow models, and a unification framework for reward-based fine-tuning. Three of the five have code released at submission.

1. MT-EditFlow: RL for multi-turn image editing with flow matching

ArXiv: 2606.01985 | Ying Nian Wu group (UCLA), Mingyuan Zhou group (UT Austin) | cs.CV

Peer-review status: Preprint. No code repository confirmed at time of writing.

Single-turn training for image editing produces capable models, but deploying them in multi-turn sequences exposes two compounding failure modes. First, any single missed instruction causes the entire downstream trajectory to be graded as failed — an all-or-nothing loss landscape that discourages the model from attempting difficult intermediate steps. Second, the gap between training (clean reference images) and inference (edited images from prior turns) accumulates as an exposure bias: errors introduced early in the sequence get compounded by each subsequent editing step. 1

MT-EditFlow addresses both failure modes through a flow-matching RL framework that reasons at the trajectory level rather than the turn level. The method integrates a multi-turn perspective with a multi-reward formulation, providing a unified structure applicable to both GRPO (Generative Reward Policy Optimization) and NFT-based RL approaches. 1 The key design choice is advantage broadcasting: the aggregated advantage signal is distributed across the entire editing trajectory, so each turn's update is informed by the global multi-turn outcome rather than just its local reward.

Three further contributions sharpen the framework. The paper analyzes turn-level aggregation strategies for scoring — how to combine per-turn quality assessments into a trajectory-level signal without masking individual failures. It characterizes the bias-variance trade-off in VLM reasoning mode when VLMs are used as reward models (longer chain-of-thought reasoning increases variance, shorter reasoning increases bias). And it identifies advantage fusion level as a design handle for preventing reward hacking across turns.

Qualitative multi-turn editing comparison: FLUX 2-Klein-Base-9B (left) vs MT-EditFlow (right). Left column shows error accumulation — the sheep survives past Turn 1, the horse becomes a deer at Turn 2. Right column shows MT-EditFlow maintaining correct edits across all three turns in both example sequences. — MT-EditFlow qualitative results: three consecutive edits on two image sequences. The base model fails at Turn 1 (left sequence) and Turn 2 (right sequence); MT-EditFlow succeeds across all turns. 1

Evaluated on FLUX.1-Kontext-dev, MT-EditFlow achieves a +6.85-point gain in turn-3 overall performance, surpassing Qwen-Image-Edit, the current leading open-source multi-turn editing model. 1

Code/resources: Not released at time of writing.

Why read it: Most RL-for-image-generation papers optimize single-step quality. MT-EditFlow is one of the first to treat multi-turn editing as a sequential decision problem with trajectory-level credit assignment. The advantage broadcasting mechanism directly addresses why single-turn RL training fails in multi-step deployment — a gap that will become more relevant as interactive image editing interfaces extend to longer conversation histories. The unified GRPO/NFT framing also makes the work accessible to groups already experimenting with either RL variant.

2. Score-Control: suppressing hallucinations by modulating score smoothness

ArXiv: 2606.00377 | David Doermann lab (University at Buffalo) | cs.CV

Peer-review status: Preprint. Code and datasets: github.com/bhosalems/VSM.

Diffusion model hallucinations — generating objects, textures, or structures that have no basis in the conditioning signal — are typically attributed to training data artifacts or classifier-free guidance artifacts. This paper takes a different position: hallucinations are a direct consequence of score smoothness. When the learned score function is too smooth (low Lipschitz constant), probability mass leaks off the support of the training distribution, appearing as plausible-looking but groundless content. 2

The authors formalize this for 1D Gaussian mixtures — a clean setting where the true score and the true support are both computable — and prove a relationship between hallucination probability mass and the Lipschitz constant of the learned score. The mechanism extends naturally to higher dimensions, with L2 weight regularization and small training sets both confirmed as smoothness sources: more regularization smooths the score and increases hallucination; more data sharpens it and reduces hallucination.

VSM score-smoothness analysis on 1D Gaussian mixtures. Panel (a): increasing L2 regularization (λ = 0.005, 0.002, 0.0) progressively smooths the learned score, shown against the true score (solid blue). Corresponding density histograms show probability mass leaking between mixture components. Panel (b): same effect as training set shrinks (N = 10k, 25k, 50k). Panel (c): VSM modulation (ρ = 0.005, 0.01) tightens the learned score toward the true score and eliminates the inter-component leakage. — Score smoothness analysis: panels (a) and (b) show hallucination-inducing score leakage under regularization and small data; panel (c) shows VSM correcting the learned score toward the true score and restoring correct density support. 2

The practical fix is Variance-Guided Score Modulation (VSM): a training-time intervention that controls the score Jacobian directly, sharpening the score function without changing the model architecture. Applied across Hands-11K, MNIST, Cards, Shapes, ChessImages, and ImageNet-1K, VSM reduces hallucinations by up to ~25% while maintaining fidelity and diversity (C-FID, FID, FLD). 2 The paper also introduces Cards and ChessImages as new benchmark datasets with extreme semantic variation — designed specifically to stress-test hallucination in structured-content generation.

Code/resources: github.com/bhosalems/VSM — code and benchmark datasets released.

github.com · GitHub 저장소

bhosalems/VSM

https://github.com/bhosalems/VSM

콘텐츠 카드를 불러오는 중…

Why read it: The theoretical framing is the paper's main contribution: mapping hallucination rate to a computable property of the score function (its Lipschitz constant) gives researchers something concrete to optimize against, rather than treating hallucinations as an emergent nuisance. VSM's Jacobian control is a training-time mechanism, which means it can be applied independently of architecture choices. The two new benchmarks are also a practical contribution for groups that want systematic hallucination evaluation rather than relying on FID as a proxy.

3. DASH: dual-branch score distillation for guidance-preserving compact models

ArXiv: 2606.00798 | cs.CV, cross-listed cs.LG

Peer-review status: Preprint. Code: github.com/C-loud-Nine/DASH_Dual-Branch-Score-Distillation.

Knowledge distillation for diffusion models faces a specific problem when classifier-free guidance (CFG) is involved. Standard output-level distillation trains the student on the teacher's conditional predictions, but the guidance computation — which subtracts the unconditional prediction from the conditional — requires both branches. When the student's unconditional branch is left unsupervised, the student has no objective forcing it to produce meaningfully different conditional and unconditional outputs. The two branches collapse toward the same prediction, the CFG gap becomes undefined, and guidance fidelity breaks. 3

DASH (Dual-Branch Score Distillation) fixes this with independent supervision targets for both branches. The conditional branch receives an imitation loss against the teacher's conditional output; the unconditional branch receives a separate unconditional loss. An anchor term regularizes conditional predictions toward ground-truth noise, preventing the conditional branch from drifting. 3

DASH distillation pipeline diagram. Frozen teacher (left) produces conditional and unconditional predictions through two passes. Student (right, trainable) receives three losses: imitation loss on the conditional branch, unconditional loss on the unconditional branch, and an anchor loss pulling conditional predictions toward ground-truth noise ε. TIRT Transfer (top arrow) copies the teacher's converged per-timestep importance weights to the student as a frozen curriculum prior. — DASH pipeline: frozen teacher supervises both student branches independently, with TIRT Transfer providing a pre-learned timestep curriculum. 3

A second contribution, TIRT Transfer (Teacher Importance Re-use Transfer), copies the teacher's converged per-timestep importance curriculum directly to the student as a frozen prior. The teacher has already learned which timesteps contribute most to generation quality; re-learning that curriculum from scratch on a limited distillation budget is inefficient. Transferring it lets the student allocate its distillation compute to content, not scheduling. 3

Ablation studies quantify the contribution of each component: unconditional supervision alone accounts for more than 60% of total distillation gain. Curriculum transfer and the anchor regularizer provide complementary increments. On CIFAR-10 and CIFAR-100, DASH achieves 5.9× compression with 50-step DDIM FID remaining within 4 points of the teacher, substantially better than training from scratch at the same parameter count. 3

Code/resources: github.com/C-loud-Nine/DASH_Dual-Branch-Score-Distillation.

github.com · GitHub 저장소

C-loud-Nine/DASH_Dual-Branch-Score-Distillation

https://github.com/C-loud-Nine/DASH_Dual-Branch-Score-Distillation

콘텐츠 카드를 불러오는 중…

Why read it: The diagnosis — unconditional branch supervision is the missing piece in CFG-preserving distillation — is immediately actionable. Any group compressing a CFG-trained diffusion model using output-level distillation should check whether their student's unconditional branch has an explicit objective. The ablation that attributes 60%+ of the gain to unconditional supervision alone is a strong result and suggests the fix is more fundamental than an architecture choice. TIRT Transfer is a clean engineering contribution on top of that: if teacher importance weights are available, using them as a frozen curriculum prior costs nothing at distillation time.

4. GEM: geometry-based concept erasure for Rectified Flow models

ArXiv: 2606.00140 | Anna Rohrbach, Marcus Rohrbach (TU Darmstadt) | cs.LG

Peer-review status: Preprint. No code repository linked at time of writing.

Most concept erasure methods were designed for DDPM/DDIM-based U-Net diffusion models. As the field shifts toward Rectified Flow Transformers (Flux, SD3, Stable Video Diffusion 3), these methods face two problems: the mathematical formalism of score-based diffusion does not translate directly to rectified flows, and the inference efficiency assumptions differ. GEM (Geometric Erasure by Contrastive Velocity Matching) is designed from the ground up for Rectified Flow architectures. 4

The theoretical contribution is a bridge between two previously disconnected erasure paradigms. Trajectory-based unlearning (methods based on Generative Flow Networks) defines erasure as suppressing paths in the generative trajectory that lead to target concepts. Teacher-guided erasure defines it as pushing the model's output away from a reference teacher's output for specific prompts. GEM shows these are equivalent in the Rectified Flow setting and derives a single geometric guidance objective combining complementary attraction signals (toward safe content) and repulsion signals (away from target concepts) from the teacher. 4

GEM concept erasure on Flux. Left panel: original Flux generations showing copyright characters (Goku, Stitch), violent imagery (gore, shark attack), and explicit content (blurred). Right panel: after GEM erasure, the same prompts produce safe replacements — original characters replaced by generic anime-style figures, violent imagery replaced by benign variants — while non-targeted content (the gemstone in the center) is preserved unchanged. — GEM erasure on Flux: targeted content (copyright characters, violence, explicit material) suppressed; untargeted content (the gemstone) preserved intact. 4

On the Flux model (one of the leading open Rectified Flow text-to-image models), GEM achieves concept erasure at 5× the speed of the prior state-of-the-art while maintaining benign generation quality. 4 Anna Rohrbach and Marcus Rohrbach (TU Darmstadt), who previously held positions at UC Berkeley and Meta AI, have published extensively on multimodal safety and model interpretability — the authorship is aligned with the problem domain.

Code/resources: No repository linked. The authors have not released code at submission.

Why read it: As Flux and SD3-based systems become the infrastructure layer for deployed image generation, the erasure tooling for them needs to catch up. GEM gives both the theory (the equivalence proof between trajectory-based and teacher-guided erasure in Rectified Flow) and the practical result (5× speedup on Flux). The geometric framing is also portable: the attraction/repulsion objective can in principle be applied to any Rectified Flow model, not just Flux. The missing piece is code — the method's accessibility depends on whether the authors release an implementation.

5. Reward Score Matching: a unified framework for reward-based fine-tuning

ArXiv: 2604.17415 | Jong Chul Ye lab (KAIST) | cs.LG, cross-listed to cs.CV on June 2

Peer-review status: Preprint. Code: github.com/jaylee2000/rsm.

Reward-based fine-tuning for diffusion and flow models has produced a fragmented literature: DDPO uses policy gradient with denoising step rollouts, DPOK adds KL regularization, GFlowNet-based methods define rewards over full trajectories, RLCM uses consistency models for efficiency, Flow-GRPO adapts group relative policy optimization to flow matching. Each was motivated independently, with different assumptions about what "alignment" means for generative models. 5

RSM (Reward Score Matching) proposes that all of these can be written under a common framework in which alignment is defined as reward-guided score matching: the fine-tuned model's score function is matched to the product of the pretrained score and a reward signal. Under this framing, the primary differences between methods reduce to two design choices: how the value guidance estimator is constructed, and how optimization intensity is distributed across timesteps. 5

RSM temporal optimization strength comparison across nine reward fine-tuning methods. Panel (a): first-order methods (DDPO, PCPO-Diffusion) achieve successful alignment by concentrating value guidance reduction at low-SNR timesteps (t near 0). Panel (b): improved zeroth-order methods (VGG-Flow, Flow-GRPO, GRPO-Guard, PCPO-Flow, TempFlow-GRPO) concentrate value guidance reduction at high-SNR timesteps (t near 1). Panel (c): Residual ∇-DB (proposed RSM redesign) enforces a substantially stronger trust-region constraint specifically at low-SNR timesteps, where the trust-region violation risk is highest. — RSM temporal optimization strength: three panels compare value guidance h(t) schedules across first-order methods, zeroth-order methods, and the RSM-derived Residual ∇-DB redesign. 5

The unification lets the authors audit existing designs against the bias-variance-compute trilemma: some methods carry more bias because their value guidance estimator drops critical gradient terms; others incur excess compute to avoid variance without meaningful alignment gains. Based on this analysis, the paper derives simpler redesigns — including Residual ∇-DB, a first-order estimator with explicit trust-region constraints at low-SNR timesteps — that improve alignment and compute efficiency in both differentiable-reward and black-box-reward settings. 5 Jong Chul Ye (KAIST) is among the most cited researchers in diffusion models and inverse problems; the RSM paper continues his group's pattern of producing theoretical frameworks that reorganize large bodies of empirical work.

Code/resources: github.com/jaylee2000/rsm.

github.com · GitHub 저장소

jaylee2000/rsm

https://github.com/jaylee2000/rsm

콘텐츠 카드를 불러오는 중…

Why read it: If you are currently running any reward fine-tuning experiment, RSM is directly relevant: it tells you whether the gradient terms your method is dropping are hurting you, and where in the timestep schedule your method is likely to diverge. The Residual ∇-DB redesign is a concrete, lighter-weight alternative to GRPO-based approaches that the authors validate in both reward settings. For groups working on theoretical foundations, the score-matching formulation is clean enough to derive new methods from rather than just auditing existing ones.

Quick reference

Paper	ArXiv ID	Core method	Code
MT-EditFlow	2606.01985	Flow-matching RL with trajectory-level advantage broadcasting; +6.85 pts on FLUX.1-Kontext-dev turn-3	Not released
Score-Control (VSM)	2606.00377	Score Jacobian modulation reducing hallucinations up to ~25%; two new benchmark datasets	GitHub
DASH	2606.00798	Dual-branch supervision + TIRT Transfer; 5.9× compression, FID within 4 of teacher	GitHub
GEM	2606.00140	Contrastive velocity matching for concept erasure in Rectified Flows; 5× faster than prior SOTA on Flux	Not released
RSM	2604.17415	Reward score matching unification; bias-variance-compute audit + Residual ∇-DB redesign	GitHub

Today's five papers converge on a shared concern: how do you control what a diffusion or flow model produces without breaking everything else? MT-EditFlow asks this across a sequence of edits. Score-Control asks it at the level of the score function's geometry. DASH asks it under the constraints of model compression. GEM asks it for targeted concept removal. RSM asks it as a theoretical accounting question — what are existing reward fine-tuning methods actually optimizing, and what are they silently ignoring? Each paper frames the control problem differently, but they all arrive at the same observation: the default training or distillation objective leaves some part of the model's generative behavior underdetermined, and that underdetermination is where failures live.

Cover image: AI-generated illustration

Five diffusion papers worth reading today (June 2, 2026)

1. MT-EditFlow: RL for multi-turn image editing with flow matching

2. Score-Control: suppressing hallucinations by modulating score smoothness

3. DASH: dual-branch score distillation for guidance-preserving compact models

4. GEM: geometry-based concept erasure for Rectified Flow models

5. Reward Score Matching: a unified framework for reward-based fine-tuning

Quick reference

참고 출처