Interview Prep · Diffusion Generative Modeling

Diffusion Foundations 面试 Cheat Sheet

DDPM / Score / DDIM / EDM / CFG / Consistency Models · 公式推导 + From-Scratch 代码 + 25 高频题（L1 必会 · L2 进阶 · L3 顶级 lab）

By Ruofeng Yang (杨若峰), Shanghai Jiao Tong University

Source: docs/tutorials/diffusion_foundations_tutorial.md SHA256: 95c1efc6f929 Rendered: 2026-05-19 05:40 UTC

§0 TL;DR

9 句话搞定 Diffusion 基础

一页拿下面试核心要点（详见 §1–§13 推导）。

DDPM (Ho 2020)：forward $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ 闭式可采样；reverse $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \Sigma_\theta)$ 学反向 Gaussian；ELBO 化简到 $L_\text{simple} = \mathbb{E}\|\epsilon - \epsilon_\theta(x_t, t)\|^2$（$\epsilon$-prediction）。
三种视角等价：DDPM 的 $\epsilon$、score-based 的 $s = \nabla \log p_t$、flow matching 的 $v$ 在 Gaussian path 下线性可逆 —— $s_\theta = -\epsilon_\theta / \sigma_t$，$v = \alpha'x_0 + \sigma'\epsilon$。
Tweedie 公式：$\mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t)$ —— 一行式连接 denoiser 与 score。
Score SDE (Song 2021)：VP-SDE / VE-SDE 统一框架；reverse-time SDE 与 probability flow ODE 共享同一族边缘分布，ODE 形式直接给出 FM 的 vector field。
DDIM (Song 2020 / ICLR 2021)：non-Markovian forward 推出 deterministic sampler，marginal 与 DDPM 相同但采样路径可控（$\eta=0$ 确定性；$\eta=1$ + 走完整 $T$ 步退化为 DDPM ancestral，skip 步下则只是匹配 DDPM 方差，不严格等价）。
EDM (Karras 2022)：preconditioning 让网络输出方差恒为 1：$D_\theta(x;\sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(c_\text{in}(\sigma) x, c_\text{noise}(\sigma))$；配合 $\sigma$-schedule + Heun 2nd-order，FID SOTA 同时 NFE 降到 18-35。
CFG (Ho-Salimans 2022)：训练时以概率 $p_\text{drop}$ drop 条件 → 同一 net 学 conditional/unconditional；推理 $\tilde\epsilon = (1+w)\epsilon_\theta(x,c) - w\epsilon_\theta(x,\emptyset)$，$w \in [3, 7]$ 是 text-to-image 主力。
Production：SD/SDXL 用 VAE latent + UNet；SD3 / FLUX.1 改用 Rectified Flow + MM-DiT；ControlNet 给 frozen UNet 加可训练 side branch；DiT 把 UNet 全换 Transformer。
加速：DPM-Solver++ 把 NFE 压到 10-20；Consistency Models 学 $f_\theta(x_t, t) \mapsto x_0$ 做到 1-4 步；LCM / LCM-LoRA / SDXL-Turbo / SD3-Turbo (ADD) 让蒸馏在 Stable Diffusion 全家桶可用。

§1 直觉 & 三种视角

1.1　一句话直觉

Diffusion = 学会"去噪"：把数据从干净逐渐加噪到纯 Gaussian（forward），然后学会反过来从噪声一步步还原数据（reverse）。所有 diffusion 论文的差异都在三件事：

forward 怎么加噪（schedule、SDE 类型 VP/VE）
网络预测什么（$\epsilon$ / $x_0$ / $v$ / score / $D$）
reverse 怎么采样（Markov ancestral / DDIM / DPM-Solver / EDM Heun / Consistency one-step）

1.2　三种视角对照

                            统一框架（Song et al. 2021）
                            
       离散视角（DDPM）         连续视角（Score SDE）      Flow 视角（FM/RF）
       ────────────         ──────────────────       ────────────────
       q(x_t|x_{t-1})  →    dx = f(x,t)dt+g(t)dW  →   dx = u_t(x) dt
        闭式 q(x_t|x_0)        forward SDE              ODE (deterministic)
              ↓                       ↓                       ↓
        ε-prediction        score s = ∇ log p_t        vector field v_t
              ↘                       ↓                       ↙
                          全部线性可逆（在 Gaussian path 下）
                          s = -ε/σ_t,   v = α'x_0 + σ'ε,   ε = -σ s

面试一句话答

"DDPM 是离散时间下 VP-SDE 的特例；score-based 是连续时间下的等价参数化；Flow Matching 在 VP/VE path 下与 score matching 同信息，只是参数化成 $v$ 不是 $s$。Rectified Flow 跳出 SDE 框架，用线性 path 直接学 ODE 的 vector field。"

1.3　Convention（全文统一）

符号	含义
$x_0$	干净数据样本
$x_t$, $t \in \{1,\dots,T\}$ 或 $t \in [0,T]$	加噪后的样本
$\epsilon \sim \mathcal{N}(0, I)$	标准 Gaussian 噪声
$\alpha_t, \beta_t = 1 - \alpha_t$	DDPM 单步 forward 系数
$\bar\alpha_t = \prod_{s=1}^t \alpha_s$	DDPM cumulative 系数
$\sigma_t$	标准差（NCSN / EDM 视角的"噪声水平"）
$s_\theta(x_t, t) \approx \nabla_{x_t}\log p_t(x_t)$	score
$\epsilon_\theta(x_t, t) \approx \epsilon$	DDPM 中预测的噪声
$D_\theta(x; \sigma) \approx x_0$	EDM 的 denoiser 输出

时间方向陷阱

DDPM 论文 forward 是 $t = 0 \to T$（数据加噪到纯噪声），reverse 是 $T \to 0$；FM 论文常用 $t = 0$ 噪声、$t = 1$ 数据。面试写代码前一定要先 disambiguate 时间方向——否则 sampler 容易跑反。

§2 DDPM Forward Process

2.1　单步与闭式表达

DDPM forward 是一条 Markov chain：

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I), \quad t = 1, \dots, T$$

定义 $\alpha_t = 1 - \beta_t$，$\bar\alpha_t = \prod_{s=1}^t \alpha_s$。关键性质：$q(x_t | x_0)$ 是 闭式 Gaussian——可以一步从 $x_0$ 跳到任意 $t$（训练效率的核心）：

$$\boxed{\; q(x_t | x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar\alpha_t}\, x_0,\; (1-\bar\alpha_t) I\right) \;}$$

等价 reparameterization：

$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

2.2　闭式推导（必考，会反复出现）

由 reparameterization $x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} z_t$，$z_t \sim \mathcal{N}(0, I)$ 独立。递推：

$$ \begin{aligned} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} z_t \\ &= \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} z_{t-1}\right) + \sqrt{\beta_t} z_t \\ &= \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \underbrace{\sqrt{\alpha_t \beta_{t-1}} z_{t-1} + \sqrt{\beta_t} z_t}_{\text{独立 Gaussian 之和}} \end{aligned} $$

两个独立 Gaussian 之和的方差：$\alpha_t \beta_{t-1} + \beta_t = \alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$。所以可以合并成单个 Gaussian $\sqrt{1 - \alpha_t \alpha_{t-1}}\, \bar z$。归纳到 $t$ 步：

$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon$$

变分 trick 直觉

Markov chain 的好处是每一步都是 Gaussian，所以累积仍是 Gaussian；这让 forward 不用网络就能采样、训练时不用模拟整条链。

2.3　边界与极限

$t = 0$：$\bar\alpha_0 = 1$，$x_0$ 自身 —— forward 起点
$t = T$（DDPM 取 1000）：要求 $\bar\alpha_T \approx 0$，则 $x_T \approx \epsilon \sim \mathcal{N}(0, I)$ —— forward 终点接近 Gaussian prior

Schedule 末端的 SNR (Signal-to-Noise Ratio)

SNR$(t) = \bar\alpha_t / (1-\bar\alpha_t)$；linear schedule 在 $t=T$ 时 $\bar\alpha_T \approx 4\times 10^{-5}$ 对应 SNR $\approx 4\times 10^{-5}$——虽然很小但严格意义上未到 0，prior 仍非完全匹配 $\mathcal{N}(0,I)$；这是 cosine schedule 与 "v-prediction" 改进的动机之一。

§3 DDPM Reverse Process & 训练

3.1　Reverse 是 Gaussian 的前提

理论上 $q(x_{t-1} | x_t)$ 不是 Gaussian（依赖整个数据分布）。但当 $\beta_t$ 足够小时，反向条件分布近似是 Gaussian（Feller 1949 / Sohl-Dickstein 2015），所以参数化为：

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)$$

3.2　ELBO 推导

DDPM 优化 evidence lower bound（与 VAE 类似）：

$$ \begin{aligned} \log p_\theta(x_0) &\ge \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] \\ &= -\underbrace{\mathbb{E}_q[\text{KL}(q(x_T|x_0) \,\Vert\, p(x_T))]}_{L_T \text{（常数，prior 匹配）}} \\ &\quad - \sum_{t=2}^T \underbrace{\mathbb{E}_q[\text{KL}(q(x_{t-1}|x_t, x_0) \,\Vert\, p_\theta(x_{t-1}|x_t))]}_{L_{t-1}} \\ &\quad + \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0 \text{（decoder log-likelihood）}} \end{aligned} $$

核心：$q(x_{t-1} | x_t, x_0)$ 是闭式 Gaussian（由 Bayes 推得）：

$$q(x_{t-1} | x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \tilde\mu_t(x_t, x_0),\; \tilde\beta_t I\right)$$

其中：

$$\tilde\mu_t(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}} \beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} x_t, \quad \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$$

3.3　化简到 $L_\text{simple}$（必考推导）

把 $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon) / \sqrt{\bar\alpha_t}$ 代入 $\tilde\mu_t$：

$$\tilde\mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon\right)$$

参数化 $\mu_\theta(x_t, t)$ 也采用同样形式（$\epsilon$-prediction）：

$$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t, t)\right)$$

固定 $\Sigma_\theta = \sigma_t^2 I$（取 $\sigma_t^2 = \beta_t$ 或 $\tilde\beta_t$）。两个 Gaussian 的 KL：

$$L_{t-1} = \mathbb{E}\left[\frac{1}{2\sigma_t^2} \| \tilde\mu_t - \mu_\theta \|^2\right] = \mathbb{E}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

Ho 2020 的工程妙招：扔掉前面所有系数 + 常数项，直接用 unweighted 版本：

$$\boxed{\; L_\text{simple}(\theta) = \mathbb{E}_{t \sim \mathcal{U}\{1,\dots,T\},\; x_0,\; \epsilon}\Big[\big\|\epsilon - \epsilon_\theta\!\big(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon,\; t\big)\big\|^2\Big] \;}$$

为什么扔系数也能 work？

Ho 2020 经验观察：unweighted 版本相当于给低 SNR（高 $t$）loss 更大权重，反而提升 sample 质量。但代价是 $\log$-likelihood 不再是 ELBO 下界——所以"FID 好"≠"likelihood 好"。后续 Improved DDPM (Nichol-Dhariwal 2021) 引入 hybrid loss $L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}$（$\lambda = 0.001$），同时学 $\Sigma_\theta$。

3.4　预测目标的等价转换（必背）

给定 $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon$，三种主流参数化线性可逆：

$$ \begin{aligned} \epsilon\text{-pred} &:\quad \epsilon_\theta(x_t, t) \approx \epsilon \\ x_0\text{-pred} &:\quad \hat x_0(x_t, t) = \frac{x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta}{\sqrt{\bar\alpha_t}} \\ v\text{-pred (Salimans-Ho 2022)} &:\quad v_\theta = \sqrt{\bar\alpha_t}\, \epsilon - \sqrt{1-\bar\alpha_t}\, x_0 \\ \text{score} &:\quad s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar\alpha_t}} \end{aligned} $$

v-prediction 为何更稳？

$\epsilon$-pred 在 $t \to 0$（小噪声）时退化（loss 系数爆炸）；$x_0$-pred 在 $t \to T$（大噪声）时退化；$v$-pred 是两者插值，在所有 $t$ 上 loss 数值范围近似一致——是 Imagen Video / SD2.1-v / Karras EDM 选用的关键。

§4 Schedule：linear / cosine / EDM

4.1　Linear (Ho 2020)

$$\beta_t = \beta_\text{start} + \frac{t-1}{T-1}(\beta_\text{end} - \beta_\text{start}), \quad \beta_\text{start} = 10^{-4},\; \beta_\text{end} = 0.02$$

$T = 1000$。简单、稳定，但末端 SNR 未严格到 0（$\bar\alpha_T \approx 4 \times 10^{-5}$，对应 SNR $\approx 4 \times 10^{-5}$，理想 prior 要求更接近 0）。

4.2　Cosine (Nichol-Dhariwal 2021)

$$\bar\alpha_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{(t/T) + s}{1 + s} \cdot \frac{\pi}{2}\right), \quad s = 0.008$$

$\beta_t = 1 - \bar\alpha_t / \bar\alpha_{t-1}$（再裁剪到 $[0, 0.999]$ 防止数值问题）。$s = 0.008$ 是为了让 $\beta_1$ 不要太接近 0。

Cosine schedule 为什么更好？

linear schedule 在低 $t$ 区域加噪太快，模型大部分时间"练"在已经全是噪声的区域（学不到东西）。cosine 在低 $t$ 加噪缓慢、中间快、末端 SNR 真的接近 0。Improved DDPM 实验：cosine 比 linear 在 ImageNet 64 上 FID 提升约 20%。

4.3　EDM σ-schedule (Karras 2022)

EDM 把 $\beta$ schedule 重新参数化为 $\sigma$ schedule（直接用 $\sigma$ 当时间）。采样时：

$$\sigma_i = \left(\sigma_\text{max}^{1/\rho} + \frac{i}{N-1}\left(\sigma_\text{min}^{1/\rho} - \sigma_\text{max}^{1/\rho}\right)\right)^\rho, \quad i = 0, \dots, N-1$$

默认 $\sigma_\text{min} = 0.002$, $\sigma_\text{max} = 80$, $\rho = 7$。$\rho = 7$ 是 Karras 实验扫出来的——比线性 / 对数都好，因为它把更多步骤分配在小 $\sigma$（高 SNR）区域，那里步进误差更敏感。

离散 vs 连续 schedule

DDPM 的 $\beta$ 数组等价于 VP-SDE 的 $\beta(t) = T \beta_{\lfloor tT \rfloor}$；EDM 的 $\sigma$-schedule 等价于 VE-SDE 的 $\sigma(t) = t$（线性时间）；两者只差一个 $t$ 重参数化，信息上等价。EDM 的贡献是发现一组工程上更稳的 $\sigma_i$ 选取规则。

§5 Score-based 视角

5.1　Score 与 score matching (Hyvärinen 2005)

定义 $s(x) = \nabla_x \log p(x)$。如果学到 $s_\theta \approx s$，可以用 Langevin dynamics 采样：

$$x_{k+1} = x_k + \frac{\eta}{2} s_\theta(x_k) + \sqrt{\eta}\, z_k, \quad z_k \sim \mathcal{N}(0, I)$$

直接 score matching loss $\mathbb{E}_p\|s_\theta - \nabla\log p\|^2$ 不可计算（不知道 $\nabla \log p$）。Hyvärinen 2005 给出 implicit score matching 通过积分变换避开 $\nabla \log p$：

$$\mathbb{E}_p\left[\|s_\theta(x)\|^2 + 2 \operatorname{tr}(\nabla_x s_\theta(x))\right]$$

但 $\operatorname{tr}(\nabla_x s_\theta)$ 在高维下太贵（Hessian trace）。

5.2　Denoising Score Matching (Vincent 2011)

对每个数据点 $x_0$，加噪 $\tilde x = x_0 + \sigma \epsilon$，定义 perturbed distribution $p_\sigma(\tilde x) = \int p(x_0) \mathcal{N}(\tilde x; x_0, \sigma^2 I) dx_0$。Vincent 2011 证明：

$$\mathbb{E}_{p_\sigma(\tilde x)}\|s_\theta(\tilde x) - \nabla \log p_\sigma(\tilde x)\|^2 = \mathbb{E}_{x_0, \tilde x}\left\|s_\theta(\tilde x) - \nabla_{\tilde x} \log q(\tilde x | x_0)\right\|^2 + \text{const}$$

而 $q(\tilde x | x_0) = \mathcal{N}(x_0, \sigma^2 I)$ 的 score 闭式：

$$\nabla_{\tilde x} \log q(\tilde x | x_0) = -\frac{\tilde x - x_0}{\sigma^2} = -\frac{\epsilon}{\sigma}$$

所以训练 loss 简化为：

$$\boxed{\; L_\text{DSM}(\theta) = \mathbb{E}_{x_0, \sigma, \epsilon}\left\| \sigma\, s_\theta(\tilde x; \sigma) + \epsilon \right\|^2 \;}$$

这正是 NCSN / SMLD 的训练目标（差一个权重）。

5.3　Tweedie 公式（必考推导）

陈述：对加性 Gaussian 噪声 $x_t = x_0 + \sigma_t \epsilon$（VE 视角，$\epsilon \sim \mathcal{N}(0,I)$）：

$$\boxed{\; \mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2\, \nabla_{x_t} \log p_t(x_t) \;}$$

推导：$p_t(x_t) = \int p_0(x_0) \mathcal{N}(x_t; x_0, \sigma_t^2 I)\, dx_0$。对 $x_t$ 求梯度：

$$\nabla_{x_t} p_t(x_t) = \int p_0(x_0) \cdot \nabla_{x_t} \mathcal{N}(x_t; x_0, \sigma_t^2 I)\, dx_0 = \int p_0(x_0) \cdot \mathcal{N}(x_t; x_0, \sigma_t^2 I) \cdot \frac{x_0 - x_t}{\sigma_t^2}\, dx_0$$

两边除以 $p_t(x_t)$：

$$\nabla_{x_t} \log p_t(x_t) = \frac{1}{p_t(x_t)} \int p_0(x_0) \mathcal{N}(x_t | x_0) \frac{x_0 - x_t}{\sigma_t^2}\, dx_0 = \mathbb{E}_{p_0(x_0 | x_t)}\left[\frac{x_0 - x_t}{\sigma_t^2}\right]$$

即：

$$\sigma_t^2 \nabla_{x_t} \log p_t(x_t) = \mathbb{E}[x_0 | x_t] - x_t \quad \Rightarrow \quad \mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t) \quad \square$$

Tweedie 是连接所有 diffusion 参数化的"罗塞塔石碑"

denoiser 网络的最优输出（MMSE estimator）就是 score 加上恒等映射。所有 $\epsilon$-pred / score-pred / $x_0$-pred / $v$-pred 之间的转换都是 Tweedie 的一行式重排。

5.4　NCSN / SMLD (Song-Ermon 2019)

Noise-Conditional Score Network：训练一个共享网络 $s_\theta(x, \sigma)$，对一组噪声水平 $\sigma_1 > \sigma_2 > \dots > \sigma_L$ 同时做 DSM。采样时做 annealed Langevin dynamics：先在大 $\sigma_1$ 上 Langevin（探索全空间），逐步降到 $\sigma_L$（精修细节）。

$$x \leftarrow x + \frac{\epsilon_i}{2} s_\theta(x, \sigma_i) + \sqrt{\epsilon_i}\, z, \quad \epsilon_i = \eta \cdot (\sigma_i / \sigma_L)^2$$

每个 $\sigma_i$ 跑 $T$ 步 Langevin，然后切到下一个 $\sigma_{i+1}$。

为什么单一 $\sigma$ 不行？

小 $\sigma$ 训出来的 score 在远离数据流形的地方完全错（mode 之间的"空地带" $p(x) \approx 0$，score 给不出方向）。多 noise level 的核心是用大 $\sigma$ 把空间"填满"，给小 $\sigma$ 提供初始位置。

§6 Score SDE：统一框架 + Probability Flow ODE

6.1　Forward SDE

Song et al. 2021 (ICLR) 把所有 diffusion 写成 forward SDE：

$$dx = f(x, t)\, dt + g(t)\, dW$$

Type	$f(x, t)$	$g(t)$	对应离散版
VP-SDE (variance preserving)	$-\frac{1}{2}\beta(t) x$	$\sqrt{\beta(t)}$	DDPM
VE-SDE (variance exploding)	$0$	$\sqrt{d[\sigma^2(t)]/dt}$	SMLD / EDM
sub-VP	$-\frac{1}{2}\beta(t) x$	$\sqrt{\beta(t)(1-e^{-2\int_0^t \beta(s)ds})}$	介于 VP/VE，likelihood 更好

VP-SDE 满足 $\text{Var}[x_t] \le 1$（variance preserving），VE-SDE 让方差无界增长（variance exploding）。

6.2　Reverse SDE (Anderson 1982)

对任意 forward SDE，存在 reverse-time SDE：

$$\boxed{\; dx = \left[f(x, t) - g^2(t)\, \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar W \;}$$

$d\bar W$ 是 reverse-time Wiener process。采样：从 $x_T \sim p_T$（接近 prior）开始，用 SDE solver（Euler-Maruyama / predictor-corrector）积分到 $t = 0$。

6.3　Probability Flow ODE（与 FM 的桥梁）

关键定理（Song et al. 2021, "Score-Based Generative Modeling through SDEs"）：以下确定性 ODE 与 reverse SDE 共享所有时刻的边缘 $p_t$：

$$\boxed{\; \frac{dx}{dt} = f(x, t) - \frac{1}{2} g^2(t)\, \nabla_x \log p_t(x) \;}$$

这就是 probability flow ODE。等价于 Flow Matching 的 vector field：

$$u_t(x) = f(x, t) - \tfrac{1}{2} g^2(t)\, s_\theta(x, t)$$

三种采样器的关系图

           forward SDE (训练: score matching)
                       ↓
           ┌──────────────────────┐
           ↓                      ↓
     reverse SDE              probability flow ODE
     (stochastic)             (deterministic, ⇔ FM)
           ↓                      ↓
   DDPM ancestral sampler   DDIM (η=0) / EDM / DPM-Solver

证明草图：写 forward SDE 的 Fokker-Planck（连续性方程）：

$$\frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2} g^2 \Delta p_t$$

利用 $\Delta p_t = \nabla \cdot (p_t \nabla \log p_t)$ 把扩散项写成 transport 形式：

$$\frac{\partial p_t}{\partial t} = -\nabla \cdot \left[\left(f - \tfrac{1}{2} g^2 \nabla \log p_t\right) p_t\right]$$

这正是 ODE $dx/dt = f - \frac{1}{2} g^2 \nabla \log p_t$ 的连续性方程——所以它们的 $p_t$ 一致。

6.4　ODE 视角的优势

优势	说明
Deterministic	同一 noise → 同一 sample，可做 image editing / interpolation
NFE 友好	高阶 ODE solver（Heun / RK4 / DPM-Solver）少步数即可
Likelihood 可计算	$\log p_0(x_0) = \log p_T(x_T) + \int_0^T \nabla \cdot v_t(x(t))\, dt$（PF-ODE 的 instantaneous change-of-variables，Chen et al. 2018），用 Hutchinson trace estimator 估计 div
桥到 FM	RF / SD3 / FLUX 走这条线

SDE vs ODE 的 trade-off

SDE 采样的随机扰动可以"修正"早期错误，通常 sample 质量更高但 NFE 大；ODE deterministic 但易受 solver 误差累积，需更高阶 solver。EDM 提出折中：基础 ODE + 小幅 stochastic churn（"$S_\text{churn}$"），FID 更好。

§7 DDIM：Non-Markovian Forward → Deterministic Sampler

7.1　Motivation

DDPM ancestral sampling 必须走 $T = 1000$ 步（Markov 链）。能否少步数采样且不重训？DDIM (Song et al. 2020 arXiv / ICLR 2021) 给出"yes"——核心是把 forward 改成 non-Markovian，但保持与 DDPM 一样的 marginal $q(x_t | x_0)$。

7.2　Non-Markovian Forward

DDIM 定义一族 forward distribution，由参数 $\eta \in [0, 1]$ 控制：

$$q_\sigma(x_{t-1} | x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \sqrt{\bar\alpha_{t-1}}\, x_0 + \sqrt{1 - \bar\alpha_{t-1} - \sigma_t^2}\, \frac{x_t - \sqrt{\bar\alpha_t} x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$$

其中 $\sigma_t^2 = \eta^2 \cdot \tilde\beta_t = \eta^2 \cdot \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t} \beta_t$。

关键性质（DDIM Theorem 1）：在此 forward 下，$q(x_t | x_0)$ 仍是 $\mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$——与 DDPM 完全一致！所以可以直接用 DDPM 训练的 $\epsilon_\theta$ 做 DDIM 采样。

7.3　DDIM 采样公式

把 $x_0 \to \hat x_0 = (x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta(x_t, t)) / \sqrt{\bar\alpha_t}$ 代入：

$$\boxed{\; x_{t-1} = \sqrt{\bar\alpha_{t-1}}\, \hat x_0 + \sqrt{1 - \bar\alpha_{t-1} - \sigma_t^2}\, \epsilon_\theta(x_t, t) + \sigma_t\, z, \quad z \sim \mathcal{N}(0, I) \;}$$

$\eta = 0$（DDIM）：$\sigma_t = 0$，确定性——同 $x_T$ 同 $\hat x_0$（latent space interpolation 友好）
$\eta = 1$（走完整 $T$ 步）：$\sigma_t = \sqrt{\tilde\beta_t}$，退化为标准 DDPM ancestral sampler；skip 步数 $S < T$ 时只是匹配方差量级，不严格等价 1000 步 DDPM
中间 $\eta \in (0, 1)$：随机性可调

7.4　Skip steps（少步数采样）

不必逐步 $t \to t-1$，可以跳：选 sub-sequence $\tau_0 < \tau_1 < \dots < \tau_S = T$，做：

$$x_{\tau_{i-1}} = \sqrt{\bar\alpha_{\tau_{i-1}}}\, \hat x_0 + \sqrt{1 - \bar\alpha_{\tau_{i-1}} - \sigma_{\tau_i}^2}\, \epsilon_\theta(x_{\tau_i}, \tau_i) + \sigma_{\tau_i}\, z$$

经典 baseline：$S = 50$ 步 DDIM 在 ImageNet 256 上 FID 接近 DDPM 1000 步。

DDIM = probability flow ODE 的离散化

当 $\eta = 0$ 且时间网格连续化时，DDIM 退化为 VP-SDE 对应的 probability flow ODE 的一阶 Euler 离散——这是为什么 deterministic DDIM 与 ODE-based 采样（DPM-Solver、EDM Heun）连成一线。

§8 EDM：Karras 2022 设计空间

8.1　Motivation

Karras 2022 ("Elucidating the Design Space of Diffusion-Based Generative Models") 把 diffusion 的所有设计自由度拆开（参数化、loss weighting、采样器、schedule），逐项扫，给出 SOTA 配方：CIFAR-10 FID 1.79（35 NFE）、ImageNet 64 FID 1.36。

8.2　Preconditioning（必考推导）

EDM 用 VE 视角：$x = x_0 + \sigma \epsilon$，$\epsilon \sim \mathcal{N}(0, I)$，$\sigma$ 直接当 noise level（没有 $\alpha$）。

Denoiser 参数化：

$$\boxed{\; D_\theta(x;\, \sigma) = c_\text{skip}(\sigma)\, x + c_\text{out}(\sigma)\, F_\theta\!\left(c_\text{in}(\sigma)\, x,\; c_\text{noise}(\sigma)\right) \;}$$

其中 $F_\theta$ 是底层网络，四个 $c$ 函数是 手工 schedule。Karras 推导：

推导：unit-variance 论证

目标：让 $F_\theta$ 的输入和训练 target 在所有 $\sigma$ 上方差都是 $\mathcal{O}(1)$。

输入侧：网络看到的输入 $c_\text{in} x$。已知 $\text{Var}[x] = \sigma_\text{data}^2 + \sigma^2$（数据方差 + 噪声方差），所以：

$$c_\text{in}(\sigma) = \frac{1}{\sqrt{\sigma_\text{data}^2 + \sigma^2}} \quad \Rightarrow \quad \text{Var}[c_\text{in} x] = 1$$

输出侧：理想 denoiser $D^*(x; \sigma) = \mathbb{E}[x_0 | x]$（Tweedie）。我们让网络学残差而非全量：定义 effective target

$$F^*(x; \sigma) = \frac{1}{c_\text{out}(\sigma)}\left[D^*(x;\sigma) - c_\text{skip}(\sigma)\, x\right]$$

希望 $\text{Var}[c_\text{out} F^* + c_\text{skip} x - D^*] = 0$ 且 $\text{Var}[F^*] = 1$（让网络的 target 单位方差）。

求最小 effective error 的 $c_\text{skip}$ 与 $c_\text{out}$（最小化 $\mathbb{E}\|F^* - F_\theta\|^2$ 在 $\text{Var}[F^*]=1$ 约束下）。Karras 取 $D^* = x_0$（理想情形），代入并展开：

$$c_\text{skip}(\sigma) = \frac{\sigma_\text{data}^2}{\sigma^2 + \sigma_\text{data}^2}, \quad c_\text{out}(\sigma) = \frac{\sigma \cdot \sigma_\text{data}}{\sqrt{\sigma^2 + \sigma_\text{data}^2}}$$

直觉：

$\sigma \to 0$（低噪声）：$c_\text{skip} \to 1, c_\text{out} \to 0$ —— 输出基本是 input identity（denoiser 啥都不用做）
$\sigma \to \infty$（高噪声）：$c_\text{skip} \to 0, c_\text{out} \to \sigma_\text{data}$ —— 输出由 network 完全决定（input 全是噪声）

时间编码：$c_\text{noise}(\sigma) = \frac{1}{4} \ln \sigma$（log-scale，覆盖 $\sigma \in [\sigma_\text{min}, \sigma_\text{max}]$ 的宽动态范围）。

8.3　训练 loss

EDM 用 weighted L2：

$$L_\text{EDM}(\theta) = \mathbb{E}_{\sigma, x_0, \epsilon}\Big[\lambda(\sigma)\, \big\| D_\theta(x_0 + \sigma\epsilon;\, \sigma) - x_0 \big\|^2\Big]$$

权重 $\lambda(\sigma) = (\sigma^2 + \sigma_\text{data}^2) / (\sigma \cdot \sigma_\text{data})^2 = 1/c_\text{out}^2$，等价 训练 $F_\theta$ 用 unweighted L2（每个 $\sigma$ 上 target 单位方差，loss 数量级一致）。

$\sigma$ 训练采样：$\ln \sigma \sim \mathcal{N}(P_\text{mean}, P_\text{std}^2)$，默认 $P_\text{mean} = -1.2$, $P_\text{std} = 1.2$（让 $\sigma$ 集中在 $0.3$ 附近——这是"最难学"的 SNR 区域，Karras 实验扫出来的）。

8.4　Heun 2nd-order sampler

EDM 采样默认用 Heun 二阶 ODE + 可选 stochastic churn。VE-SDE 的 probability flow ODE 在 $f = 0, g(t) = \sqrt{d\sigma^2/dt}$ 下：

$$\frac{dx}{d\sigma} = -\sigma\, \nabla_x \log p_\sigma(x) = \frac{x - D_\theta(x; \sigma)}{\sigma}$$

（用 Tweedie：$\nabla \log p_\sigma = (D - x)/\sigma^2$，代入 $dx/d\sigma = -\sigma \nabla \log p$）

Heun 第 $i$ 步（$\sigma_i \to \sigma_{i+1}$，$\Delta\sigma = \sigma_{i+1} - \sigma_i$）：

d_i  = (x_i - D_θ(x_i, σ_i)) / σ_i
x_*  = x_i + Δσ · d_i                       # Euler step (predictor)
if σ_{i+1} > 0:                             # 末步跳过 corrector
    d_*  = (x_* - D_θ(x_*, σ_{i+1})) / σ_{i+1}
    x_{i+1} = x_i + Δσ · (d_i + d_*) / 2     # Heun trapezoidal (corrector)
else:
    x_{i+1} = x_*

每步 2 NFE，但二阶精度——比 Euler 一阶 NFE 多但更准。CIFAR-10 EDM 配 35 NFE = 18 steps Heun + 一阶末端，FID 1.79。

Stochastic churn（可选）

在每步开始时把 $\sigma_i$ 临时提高到 $\hat\sigma_i = (1+\gamma_i)\sigma_i$（$\gamma_i$ 是当前步的小幅 churn），需注入额外噪声：$\hat x_i = x_i + \sqrt{\hat\sigma_i^2 - \sigma_i^2}\, z$，其中 $\sqrt{\hat\sigma_i^2 - \sigma_i^2} = \sigma_i\sqrt{2\gamma_i + \gamma_i^2}$；从 $\hat\sigma_i$ 降回 $\sigma_{i+1}$ 等价小幅 SDE。EDM 实验：少量 churn 在 ImageNet 上略涨 FID（约 0.1-0.3）。

§9 高阶采样器：DPM-Solver / DPM-Solver++

9.1　Motivation

DDIM 是一阶 ODE Euler。DPM-Solver (Lu et al. 2022 NeurIPS) 利用 diffusion ODE 的半线性结构做高阶展开。Probability flow ODE 在 VP-SDE 下用 $\epsilon$-pred 改写：

$$\frac{dx}{dt} = f(t)\, x + g(t)\, \epsilon_\theta(x, t)$$

其中 $f(t) = -\frac{1}{2}\beta(t)$，$g(t) = +\frac{1}{2}\beta(t)/\sqrt{1-\bar\alpha_t}$（来自 $-\frac{1}{2}g_\text{SDE}^2 \cdot s = +\frac{1}{2}\beta\cdot \epsilon/\sqrt{1-\bar\alpha_t}$，因为 $s = -\epsilon/\sqrt{1-\bar\alpha_t}$）。

把线性部分精确积分（exponential integrator），剩余部分用 Taylor 展开。

9.2　DPM-Solver-2 / 3（核心思想）

设 $\lambda_t = \log(\sqrt{\bar\alpha_t} / \sqrt{1-\bar\alpha_t})$（log-SNR），用 $\lambda$ 当时间变量。ODE 重写：

$$x_{t} = \frac{\sqrt{\bar\alpha_t}}{\sqrt{\bar\alpha_s}} x_s - \sqrt{\bar\alpha_t} \int_{\lambda_s}^{\lambda_t} e^{-\lambda} \hat\epsilon_\theta(x_\tau, \tau)\, d\lambda$$

把 $\hat\epsilon_\theta$ 在 $\lambda$ 上做 $k$ 阶 Taylor 展开，线性部分精确（exponential weight），剩余按阶数取近似：

DPM-Solver-1 = DDIM（一阶）
DPM-Solver-2：每步 2 NFE，二阶
DPM-Solver-3：每步 3 NFE，三阶

10-15 NFE 即可达到 50 NFE DDIM 同质量。

9.3　DPM-Solver++（CFG 友好版，Lu et al. 2023）

原 DPM-Solver 在 CFG 下不稳（$\epsilon_\theta$ 经 CFG amplify 后超出训练域，Taylor 展开误差大）。DPM-Solver++ 改用 $x_0$-prediction：

$$x_t = \frac{\sigma_t}{\sigma_s} x_s + \sigma_t \int_{\lambda_s}^{\lambda_t} e^{\lambda} \hat x^0_\theta(x_\tau, \tau)\, d\lambda$$

(用 $x_0$-pred 而非 $\epsilon$-pred 让 CFG amplification 落在更稳的区域)

15-20 NFE 在 CFG=7 下质量 close to 100-NFE DDIM。SDXL / SD3 默认 sampler 之一。

9.4　采样器对比

常见 sampler 选择 cheat sheet

按 NFE/质量/适配排序如下（图像生成）。

DDPM ancestral：T=1000 步，作 baseline；现代少用
DDIM ($\eta = 0$)：50-100 NFE，简单稳定，可做 interpolation
PLMS / PNDM：50 NFE，linear-multistep，AUTOMATIC1111 老 default
EDM Heun：18-35 NFE，确定性 ODE 二阶，文献 SOTA baseline
DPM-Solver / DPM-Solver++：10-20 NFE，HuggingFace diffusers 推荐
UniPC (Zhao 2023)：predictor-corrector framework，可超过 DPM-Solver
Consistency Models (one-step / two-step)：1-4 NFE，需蒸馏

§10 Conditioning：Classifier Guidance & CFG

10.1　Classifier Guidance (Dhariwal-Nichol 2021)

训练一个独立 classifier $p_\phi(c | x_t)$（在 noisy data 上），用 Bayes：

$$\nabla_{x_t} \log p(x_t | c) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(c | x_t)$$

实践中给 classifier gradient 加 scale $w$（控制 guidance 强度）：

$$\tilde\epsilon = \epsilon_\theta(x_t, t) - w \sqrt{1-\bar\alpha_t}\, \nabla_{x_t} \log p_\phi(c | x_t)$$

Classifier guidance 的缺点

(a) 必须额外训 noisy classifier，工程负担；(b) classifier gradient 易"对抗"，在远离训练分布时退化；(c) 对 text-to-image 这种连续 condition 不友好。CFG 完全替代了它。

10.2　Classifier-Free Guidance (Ho-Salimans 2022)

训练：以概率 $p_\text{drop}$（一般 0.1）把 $c$ 替换为 $\emptyset$（null embedding），同一个 net 学 conditional 和 unconditional：

$$L_\text{CFG}(\theta) = \mathbb{E}\big[\|\epsilon - \epsilon_\theta(x_t, t, c \text{ or } \emptyset)\|^2\big]$$

推理：把 $w$ 称为 guidance scale：

$$\boxed{\; \tilde\epsilon = \epsilon_\theta(x_t, t, \emptyset) + (1 + w)\big[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)\big] \;}$$

等价形式（Imagen / SD 实现常用）：

$$\tilde\epsilon = (1 + w)\, \epsilon_\theta(x_t, t, c) - w\, \epsilon_\theta(x_t, t, \emptyset)$$

CFG $w$ 的两种 convention

论文 Ho-Salimans 2022 原文 $\tilde\epsilon = \epsilon_\text{uncond} + (1+w)(\epsilon_\text{cond} - \epsilon_\text{uncond})$，即 $w = 0$ 是 unguided、$w > 0$ 增强。但 HuggingFace / SD UI 常用 $w' = w + 1$，即 $w' = 1$ 是 unguided、$w' = 7.5$ 是常用强度。面试代码记得标明 convention。

10.3　CFG 的几何意义

CFG 等价于把采样轨迹拉向"条件梯度"方向：

$$\nabla_{x_t} \log p(x_t | c) \approx \nabla_{x_t} \log p(x_t) + w \nabla_{x_t} \log \frac{p(x_t | c)}{p(x_t)}$$

第二项是"条件性 score 差"，把样本推向 conditional likelihood 高、unconditional likelihood 相对低的区域——直觉上"放大文本对齐"。

CFG 是 SD/SDXL/FLUX 文图对齐的核心

$w \in [3, 7.5]$ 是 Stable Diffusion 的实验 sweet spot；$w > 10$ 容易 over-saturated（颜色饱和、artifact）。FLUX 把 CFG 内化进 distillation（"guidance-distilled"），单 forward 就实现 CFG 效果——这是它推理速度的关键之一。

§11 Production：从 LDM 到 FLUX

11.1　Latent Diffusion (LDM, Rombach 2022 CVPR)

核心 idea：在 VAE latent space 而非 pixel space 跑 diffusion。

训一个 VAE $E, D$：$z = E(x), \hat x = D(z)$，$z$ 比 $x$ 小 ~8×（如 $512^2 \times 3 \to 64^2 \times 4$）
在 $z$ 上训 diffusion model（参数量、显存、计算全部降一个数量级）
生成时：从 $z_T$ 采样到 $z_0$，再用 $D(z_0)$ decode 回 pixel

Stable Diffusion (SD) = LDM + CLIP text encoder + UNet on $64 \times 64 \times 4$ latent，是当时最实用的开源 T2I 模型。

11.2　SDXL (Podell et al. 2023 arXiv / ICLR 2024 spotlight)

SD 1.5 → SDXL 的主要改进：

更大 UNet：参数从 ~860M 升到 ~2.6B，cross-attn 层更多
二阶段架构：base + refiner（refiner 在低噪声段补细节）
更好的 text encoder：OpenCLIP ViT-bigG/14 + CLIP-L/14 拼接
多尺度 / 多 aspect-ratio 训练：原生支持 1024×1024 + 不同长宽比
MicroConditioning：把原始分辨率、crop offset、aspect ratio 当条件喂给 UNet

11.3　DiT (Peebles-Xie 2023 ICCV)

把 UNet 换成纯 Transformer：

把 latent 切 patch（如 $2 \times 2$）成 token sequence
标准 Transformer block（self-attn + MLP）
conditioning 通过 adaptive LayerNorm (adaLN) 注入：$\text{LN}(x) \cdot \gamma(c, t) + \beta(c, t)$，$\gamma, \beta$ 来自 $c, t$ 的 MLP

DiT 实验：scale law 比 UNet 好，FID 随参数量稳定下降。SD3 / FLUX / Sora 都基于 DiT 系。

11.4　SD3 (Esser 2024 ICML) —— diffusion 换成 Rectified Flow

SD3 的两个关键改动：

Rectified Flow 替代 DDPM：训练目标变成 $\|v_\theta - (x_1 - x_0)\|^2$（FM 框架）
MM-DiT：多模态 DiT，text token 和 image token 在同一 Transformer 里互相 attend（不是 cross-attn）

为什么换 RF？Esser 2024 ablation：linear path 的 trajectory 比 cosine path 更直 → 少步采样更好；logit-normal $t$ sampling 让 mid-noise 更被重视，质量提升。

11.5　FLUX.1 (Black Forest Labs 2024)

继承 SD3 + MM-DiT，主要更新：

12B 参数（开源 dev 版）
Guidance-distilled：把 CFG 蒸馏进单 forward，推理无需 2× CFG forward
Adversarial training 末段微调（类似 SD3-Turbo / ADD），4-step 即出图

11.6　ControlNet (Zhang 2023 ICCV)

给 frozen SD UNet 加 trainable copy + zero-conv 连接：

原 UNet (frozen)                    控制信号 (canny / depth / pose)
     ↓                                       ↓
[encoder blocks]                      [trainable copy of encoder]
     ↓ ──────── zero-conv ──────────────────↓
[mid block]                           [trainable mid]
     ↓ ──────── zero-conv ──────────────────↓
[decoder blocks (frozen)]    +   [trainable copy outputs]
     ↓
   output

Zero-conv = 初始权重为 0 的 1×1 卷积 → 训练初始 ControlNet 不改变原 UNet 输出（保留 SD 能力），随训练逐渐学到 condition 控制。

ControlNet 的训练效率

frozen 原 UNet（大部分参数），只训 trainable copy（~一半参数），单卡可训，是开源生态的关键。

§12 Distillation：1-step / Few-step 生成

12.1　Progressive Distillation (Salimans-Ho 2022)

迭代蒸馏：student 一步 $\approx$ teacher 两步，蒸馏 $\log_2 N$ 轮把 $N$ 步压到 1 步。关键：每次只压一半，分布漂移可控。

12.2　Consistency Models (Song 2023 ICML)

思路：直接学一个网络 $f_\theta(x_t, t)$，使得对所有 $t$ 都满足：

$$f_\theta(x_t, t) \approx x_0$$

即网络是 probability flow ODE 的 consistency function——任意 $x_t$ 映到对应的 $x_0$。一步采样：$x_0 = f_\theta(x_T, T)$。

训练目标（Consistency Distillation, CD）：

$$L_\text{CD}(\theta) = \mathbb{E}\left[d\big(f_\theta(x_{t_{n+1}}, t_{n+1}),\; f_{\theta^-}(\hat x_{t_n}, t_n)\big)\right]$$

其中：

$\theta^-$ 是 EMA target
$\hat x_{t_n}$ 由 teacher ODE solver 从 $x_{t_{n+1}}$ 走一步得到（$x_{t_n} = \text{ODE-step}(x_{t_{n+1}})$）
$d$ 是 metric（L2 / LPIPS）

Boundary condition：要求 $f_\theta(x_{\sigma_\text{min}}, \sigma_\text{min}) = x_{\sigma_\text{min}}$（在最低噪声处自洽）——用 EDM-style preconditioning 强制：

$$f_\theta(x, \sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(x, \sigma)$$

$c_\text{skip}, c_\text{out}$ 设计让 $\sigma = \sigma_\text{min}$ 时 $f_\theta \equiv x$。

CT (Consistency Training) vs CD (Consistency Distillation)

CT 完全 from scratch（不用 teacher，直接对 $x_0 + \sigma_n \epsilon$ 与 $x_0 + \sigma_{n+1} \epsilon$ 做 consistency loss）；CD 用 pretrained teacher 蒸馏。质量上 CD > CT；近期 ICT (Song 2024) 让 CT 接近 CD。

12.3　LCM / LCM-LoRA (Luo 2023)

Latent Consistency Model：把 Consistency Models 套到 latent diffusion（SD 1.5 / SDXL）：

Teacher = pretrained SD（用 DDIM 当 ODE solver）
Student = LCM，4-8 step 出图

LCM-LoRA：把 LCM 训练写成 LoRA adapter——单 LoRA 文件即可让任意 SD 1.5 / SDXL fine-tune 用 4 step 出图。生态价值巨大：用户不需要换 base model。

12.4　Adversarial Diffusion Distillation (ADD) — SDXL-Turbo / SD3-Turbo (Sauer 2023/2024)

ADD 训练目标：

$$L_\text{ADD} = L_\text{adv}(\text{student}) + \lambda L_\text{distill}(\text{student}, \text{teacher})$$

$L_\text{adv}$：用 pretrained vision model（DINOv2）当 discriminator backbone
$L_\text{distill}$：student 多步 ODE 应该匹配 teacher 多步 ODE

结果：SDXL-Turbo 1-step 1024 px、SD3-Turbo 4-step 1024 px。质量略低于 multi-step 但实时（~100ms / image）。

§13 与 Flow Matching 的桥

13.1　Score vs Vector Field —— 同信息不同参数化

在 VP-SDE / VE-SDE 框架内，FM 学 $v$ 和 score-based 学 $s$ 是 同信息的两种参数化：

$$v_\theta(t, x) = f(x, t) - \tfrac{1}{2} g^2(t)\, s_\theta(t, x)$$

具体到 VP（DDPM）path，写成 $\alpha_t = \sqrt{\bar\alpha_t}, \sigma_t = \sqrt{1-\bar\alpha_t}$，则 conditional vector field（Salimans-Ho 2022 $v$-prediction 同形式）：

$$v_\theta^\text{VP}(t, x_t) = \alpha_t'\, x_0 + \sigma_t'\, \epsilon$$

代入 $x_0 = (x_t - \sigma_t \epsilon)/\alpha_t$，整理得到 $v_\theta$ 同时是 $x_t$ 与 $\epsilon$（或 score）的线性组合——具体表达式与 $\alpha_t, \sigma_t$ 的时间导数有关。

实际上：对应任意 $\alpha(t), \sigma(t)$ 的 Gaussian path，三个量 $\{\epsilon_\theta, s_\theta, v_\theta\}$ 完全等价。所以训 DDPM、训 score-based、训 FM 在 VP/VE path 上是同一件事。

13.2　为什么 SD3 / FLUX 改用 Rectified Flow？

Rectified Flow 的 path：$x_t = (1-t) x_0 + t x_1$（噪声→数据线性插值），$v_t = x_1 - x_0$。

优势	RF (linear)	VP/VE (curved)
ODE trajectory	直线	曲线（需更高阶 solver）
Target $v_t$	不依赖 $t$	依赖 $t$（VP cosine 路径）
少步数采样	Euler 4-8 步可用	Euler 需 30+ 步
Reflow 可压到 1-2 步	✓（InstaFlow / SD3-Turbo）	✗
训练稳定性	logit-normal $t$ + RF 稳	需精心调 noise schedule

一句话 SD3 ablation 结论

"在同样的 DiT backbone 下，RF + logit-normal $t$ 比 VP + uniform $t$ 在 ImageNet 256 FID 提升约 0.5-1.0；在 T2I 任务 GenEval 上文本对齐显著更好。"

13.3　DDPM/DDIM/EDM/RF/CM 全图

                训练目标              采样方式               典型 NFE
                ─────────             ─────────              ───────
 DDPM         ε-pred (MSE)        ancestral / DDIM            1000 / 50
 Score SDE    score (DSM)         reverse SDE / PF-ODE        500 / 30
 DDIM         (借 DDPM 权重)       deterministic ODE step     20-50
 EDM          D_θ (Tweedie)       Heun ODE 2nd-order         18-35
 RF / SD3     v = x_1-x_0         Euler ODE                  4-50
 FLUX         v + CFG-distill     Euler                      1-4
 ConsistMod   f_θ(x_t,t)→x_0      direct map                 1-4
 LCM-LoRA     consistency on SD   direct                     4-8

§14 25 高频面试题（L1 必会 · L2 进阶 · L3 顶级 lab）

L1 必会题（任何 ML 岗位 diffusion 题目都可能问）

Q1.写出 DDPM 的 forward $q(x_t | x_0)$ 和 reverse $p_\theta(x_{t-1}|x_t)$。

Forward 闭式：$q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I)$，$\bar\alpha_t = \prod_{s=1}^t (1-\beta_s)$
Reverse 参数化：$p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta)$
$\mu_\theta = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, t)\right)$（$\epsilon$-prediction）

写错符号（如 $\sqrt{\alpha_t}$ 与 $\sqrt{\bar\alpha_t}$ 混淆）；忘 $\bar\alpha$ 是累积乘积。

Q2.DDPM 的 ELBO 怎么化简成 $L_\text{simple}$？

ELBO 拆成 $L_T + \sum L_{t-1} + L_0$，$L_T$ 是常数（prior 匹配）
$L_{t-1} = \text{KL}(q(x_{t-1}|x_t, x_0) \,\Vert\, p_\theta)$，两者都是 Gaussian，KL 闭式
把 $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon)/\sqrt{\bar\alpha_t}$ 代入 $\tilde\mu$ 和 $\mu_\theta$，得 $L_{t-1} = \text{const} \cdot \mathbb{E}\|\epsilon - \epsilon_\theta\|^2$
Ho 2020 扔掉系数得 $L_\text{simple} = \mathbb{E}\|\epsilon - \epsilon_\theta\|^2$

只说"L_simple 是预测 noise" 不会推；或不知道扔系数等价 SNR-weighting。

Q3.为什么 $L_\text{simple}$ 扔掉系数还 work？

ELBO 的系数 $\beta_t^2 / [2\sigma_t^2 \alpha_t (1-\bar\alpha_t)]$ 在小 $t$（高 SNR）大、在大 $t$（低 SNR）小
扔系数等价对低 SNR（大 $t$）权重相对提升——这些是"决定语义结构"的步骤
经验：unweighted FID 显著优于 ELBO weighted
代价：不再是 $\log p$ 的下界（FID ≠ likelihood）

不知道扔系数的代价是 likelihood vs sample quality 的 trade-off。

Q4.linear vs cosine schedule？

Linear: $\beta_t \in [10^{-4}, 0.02]$ 线性插值，DDPM 原文
问题：末端 SNR 不够低 ($\bar\alpha_T \approx 4\times 10^{-5}$)；中间区域加噪太快
Cosine: $\bar\alpha_t = \cos^2(\pi(t/T + s)/(2(1+s)))$, $s=0.008$，末端 SNR ≈ 0
经验：cosine 在 ImageNet 64 FID 提升约 20%（Nichol-Dhariwal 2021）

只说"cosine 更好"不会写公式；忘记 $s=0.008$ offset 是为了 $\beta_1$ 不接近 0。

Q5.$\epsilon$-pred / $x_0$-pred / $v$-pred / score 怎么互转？

已知 $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$，所有量线性可逆
$\hat x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon_\theta) / \sqrt{\bar\alpha_t}$
$v = \sqrt{\bar\alpha_t}\epsilon - \sqrt{1-\bar\alpha_t} x_0$ （Salimans-Ho 2022）
$s = -\epsilon / \sqrt{1-\bar\alpha_t}$ （由 Tweedie 或 $\nabla_{x_t} \log q(x_t|x_0)$）

不知道四种 prediction 是同一信息的不同参数化；混淆 $v$ 和 velocity。

Q6.DDIM vs DDPM 区别？

DDPM ancestral 是 stochastic Markov chain，每步加噪 $\sigma_t z$，必须走全 $T$ 步
DDIM 用 non-Markovian forward，与 DDPM 共享同一 $q(x_t|x_0)$——可直接用 DDPM 训练权重
$\eta = 0$ deterministic、可 interpolation；$\eta = 1$ + 走完整 $T$ 步退化为 DDPM ancestral（skip 步时只是方差匹配）
DDIM 可 skip steps：50 步 ≈ DDPM 1000 步质量

只说"DDIM 是 DDPM 的少步版"，不知道 marginal 等价；或不知道 $\eta$ 控制随机性。

Q7.CFG (Classifier-Free Guidance) 怎么训怎么用？

训练：以 $p_\text{drop}=0.1$ 概率把 $c$ 替换为 $\emptyset$（null embedding），同一 net 学 conditional/uncond
推理两种 convention（务必区分）：
- HF / SD 风格（记为 $s$）：$\tilde\epsilon = \epsilon_\theta(x,\emptyset) + s\,[\epsilon_\theta(x,c) - \epsilon_\theta(x,\emptyset)]$，$s=1$ 即无 guidance，$s\in[3, 7.5]$ 为 SD 常用强度
- Ho-Salimans 2022 原文（记为 $w$）：$\tilde\epsilon = (1+w)\,\epsilon_\theta(x,c) - w\,\epsilon_\theta(x,\emptyset)$，$w=0$ 即无 guidance，等价 $s = w + 1$
$s$ 大 → 文本对齐强但多样性下降；$s>10$ → 色彩过饱和

只写公式不知 $w$/$s$ convention；不知道 drop $c$ 的训练 trick；说 CFG 需要单独训 classifier（那是 classifier guidance）。

Q8.什么是 Tweedie 公式？为什么重要？

$\mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t)$（VE 视角；VP 类似有 $\alpha$ 因子）
推导：对 $p_t(x_t) = \int p_0(x_0) \mathcal{N}(x_t; x_0, \sigma_t^2 I) dx_0$ 求 $\nabla_{x_t}$ log
意义：denoiser 最优输出 = 输入 + score 缩放——所有参数化（$\epsilon, x_0, v, s$）之间转换的"罗塞塔石碑"

只背公式不会推；不知道它把 score 和 denoiser 连起来。

Q9.VP-SDE vs VE-SDE？

VP (variance preserving): $dx = -\frac{1}{2}\beta(t) x\, dt + \sqrt{\beta(t)}\, dW$，对应 DDPM；$\text{Var}[x_t] \le 1$
VE (variance exploding): $dx = \sqrt{d\sigma^2/dt}\, dW$，对应 SMLD/EDM；$\text{Var}[x_t]$ 增到 $\sigma_\text{max}^2$
VP 的 $x_T \approx \mathcal{N}(0, I)$；VE 的 $x_T \approx \mathcal{N}(x_0, \sigma_\text{max}^2 I)$，prior 是 $\mathcal{N}(0, \sigma_\text{max}^2 I)$
EDM 选 VE 因为 preconditioning 推导更干净；DDPM 选 VP 因为 prior $\mathcal{N}(0,I)$ 自然

只说"variance preserving / exploding" 不会写 SDE；不知 EDM 是 VE。

Q10.Probability flow ODE 是什么？

对任意 forward SDE $dx = f dt + g dW$，存在 deterministic ODE $dx/dt = f - \frac{1}{2}g^2 \nabla \log p_t$，共享所有时刻边缘 $p_t$
注意 reverse SDE 的 drift 是 $f - g^2 \nabla \log p_t$（整个 score correction），PF-ODE 只用 $\frac{1}{2} g^2$；不是简单地"reverse SDE 去掉随机项"
实际意义：可用 ODE solver（DDIM, Heun, RK4, DPM-Solver）少步数采样
是 score-based 与 Flow Matching 之间的桥：$v_t = f - \frac{1}{2}g^2 s$

只知道公式不知道 PF-ODE 与 reverse SDE 的 drift 系数差一半；不知道它让 deterministic 采样成为可能。

L2 进阶题（research-oriented · 需熟悉 diffusion 细节）

Q11.EDM preconditioning 的 unit-variance argument 是什么？

让网络 $F_\theta$ 输入 $c_\text{in} x$ 方差为 1：$c_\text{in} = 1/\sqrt{\sigma_\text{data}^2 + \sigma^2}$
让 effective target $F^* = (D^* - c_\text{skip} x)/c_\text{out}$ 方差为 1：$c_\text{skip} = \sigma_\text{data}^2/(\sigma^2+\sigma_\text{data}^2)$, $c_\text{out} = \sigma \sigma_\text{data} / \sqrt{\sigma^2 + \sigma_\text{data}^2}$
直觉：$\sigma \to 0$ 时 $c_\text{skip} \to 1$（identity），$\sigma \to \infty$ 时 $c_\text{out} \to \sigma_\text{data}$（all from net）
作用：所有 $\sigma$ 上 loss 数值范围一致，训练更稳

只背公式不知道为什么；不知道 $\sigma_\text{data}$ 是数据 std（约 0.5 for normalized images）。

Q12.Improved DDPM 学 $\Sigma_\theta$ 的好处？

DDPM 固定 $\Sigma_\theta = \beta_t I$ 或 $\tilde\beta_t I$
Nichol-Dhariwal 2021 学 $\Sigma_\theta$ 在 $[\beta_t, \tilde\beta_t]$ 之间插值：$\Sigma_\theta = \exp(v \log\beta_t + (1-v) \log\tilde\beta_t)$
好处：少步采样质量大幅提升（50 步达到 1000 步 fixed-$\Sigma$ 水平）
Hybrid loss $L_\text{hybrid} = L_\text{simple} + 0.001 \cdot L_\text{vlb}$（$L_\text{vlb}$ 提供 $\Sigma_\theta$ 学习信号）
$\lambda = 0.001$ 防 $L_\text{vlb}$ 主导

不知道 hybrid loss；以为 $\Sigma_\theta$ 学习对训练 likelihood 影响最大（实际是少步采样涨点）。

Q13.DPM-Solver vs DDIM 的核心区别？

DDIM 是一阶 Euler，每步 1 NFE
DPM-Solver 利用 diffusion ODE 的半线性结构 $dx/dt = f(t) x + g(t) \epsilon_\theta$，对线性部分精确积分（exponential integrator）
把非线性部分（$\epsilon_\theta$）在 log-SNR $\lambda$ 上做 $k$-阶 Taylor 展开
DPM-Solver-2 每步 2 NFE，二阶；DPM-Solver-3 每步 3 NFE，三阶
10-15 NFE 达到 DDIM 50 NFE 质量
DPM-Solver++ 改用 $x_0$-pred，CFG 友好

不知道 exponential integrator；以为 DPM-Solver 是某种近似（实际是数学上更精的展开）。

Q14.Consistency Models 训练目标？怎么做到 1-step？

目标：$f_\theta(x_t, t) \approx x_0$ 对所有 $t$
Consistency loss：$d(f_\theta(x_{t_{n+1}}, t_{n+1}), f_{\theta^-}(\hat x_{t_n}, t_n))$，$\hat x_{t_n}$ 由 teacher ODE 一步得到
$\theta^-$ 是 EMA，类似 BYOL；用 metric $d$ = L2 + LPIPS
Boundary：$f_\theta(x, \sigma_\text{min}) \equiv x$，用 EDM-style $c_\text{skip}, c_\text{out}$ 强制
1-step 采样：$x_0 = f_\theta(x_T, T)$
2-step 进阶：先 $x_0 = f_\theta(x_T, T)$，再加噪到中间 $t$、再 $f_\theta$

只说"学映射 $x_t \to x_0$"不知道 consistency 约束怎么定义；不知道 EMA target / teacher ODE / boundary。

Q15.SD3 为什么从 DDPM 换成 Rectified Flow？

RF path $x_t = (1-t)x_0 + tx_1$ 是直线 → ODE trajectory 直 → 少步采样误差小
$v_t = x_1 - x_0$ target 不依赖 $t$（给定 $(x_0, x_1)$），数值稳定
配合 logit-normal $t$ sampling（集中在 $t=0.5$）涨点
Esser 2024 ablation：同 backbone 下 RF + LogitNorm vs VP-cosine + Uniform，GenEval 文本对齐显著好
进一步可 reflow 压到 4-step（FLUX-Schnell / SD3-Turbo）

只说"RF 更稳"不知道是因为 path 直；不知道 logit-normal 是额外 trick。

Q16.DiT 怎么注入 condition？adaLN vs cross-attn？

adaLN-Zero（DiT 默认）：把 $c, t$ 经 MLP 输出 $\gamma, \beta, \alpha$，$\text{out} = \alpha \cdot \text{block}(\text{LN}(x) \cdot \gamma + \beta) + x$；初始化 $\alpha=0$（zero-init），train 初始 DiT block 不改变输入
Cross-attn：image tokens 作 Q，text/condition 作 K/V
Token-concat (MM-DiT, SD3)：text tokens 和 image tokens 拼成单一序列，所有 token 互相 attend
经验：adaLN-Zero scale 性最好（DiT 论文）；cross-attn 文本控制力强（SD UNet）；MM-DiT 综合最佳（SD3 / FLUX）

只知道 cross-attn；不知 adaLN-Zero 的"zero-init" 是关键 trick。

Q17.ControlNet 的 zero-conv 是什么？为什么必要？

1×1 conv，weight 初始化为 0，bias 也为 0
训练初始时 trainable copy 的输出经 zero-conv → 0，原 UNet 输出不变 → 保留 SD pretrained 能力
随训练 zero-conv 学到非零权重，逐渐注入 condition 控制
为什么不能直接 random init：random init 会扰动 frozen UNet 的中间特征，破坏 pretrained representation

只说"加 controlnet 模块"不知道 zero-conv；以为 zero-conv 是 1×1 卷积的特殊变体（其实只是初始化）。

Q18.SDE vs ODE 采样的 trade-off？

SDE：reverse SDE 含随机项 $g(t) d\bar W$；每步注入新噪声，能修正早期错误
ODE (probability flow)：deterministic；solver 误差累积无回头路
SDE 通常 FID 更好；ODE NFE 少 + deterministic（可 interpolation）
EDM 折中：基础 ODE Heun + 少量 stochastic churn（每步前小幅加噪），FID 比 pure ODE 好 0.1-0.3

只说"SDE 是 stochastic, ODE 是 deterministic" 不知道 trade-off；不知 EDM churn。

Q19.LCM vs SDXL-Turbo 的区别？

LCM：Consistency Distillation 套到 latent diffusion，4-8 step；纯 distillation loss
LCM-LoRA：把 LCM 训练写成 LoRA adapter，适配任意 SD 1.5 / SDXL fine-tune
SDXL-Turbo (ADD)：adversarial loss + distill loss，1-4 step；用 DINOv2 当 discriminator
LCM 偏稳，ADD 偏锐利（adversarial 让纹理更清晰）
LCM 开源更早，生态更全；Turbo 需要 BFL/SAI 自家训练

不知道 LCM-LoRA 的"LoRA 适配性"是杀手锏；以为 Turbo = LCM。

Q20.训练 noise level $\sigma$ 怎么采样？

DDPM: $t \sim \mathcal{U}\{1, \dots, T\}$，离散均匀
EDM: $\ln \sigma \sim \mathcal{N}(P_\text{mean}, P_\text{std}^2)$，$P_\text{mean}=-1.2, P_\text{std}=1.2$，集中在 $\sigma \approx 0.3$
SD3 / RF: $t = \text{sigmoid}(\tau), \tau \sim \mathcal{N}(0, 1)$，集中在 $t = 0.5$
共同 idea：mid-noise 最难学，多采样 mid 区域涨点

只说"uniform 采样"不知道 EDM/SD3 都改成 normal/logitnormal；不知道为什么集中在 mid。

L3 顶级 diffusion / 视频生成方向（深度推导 + 蒸馏 + Production 整合）

Q21.从 ELBO 推 $L_\text{simple} = \|\epsilon - \epsilon_\theta\|^2$，列出所有中间近似与"扔掉"的项。

推导链 + 近似清单：

Step 1（精确，无近似）：$\log p_\theta(x_0) \ge \mathbb{E}_q[\log p_\theta(x_{0:T})/q(x_{1:T}|x_0)]$ —— Jensen 不等式给出变分下界
Step 2（$L_T$ 被当作常数忽略）：ELBO 拆 $L = L_T + \sum_{t=2}^T L_{t-1} + L_0$。$L_T = \text{KL}(q(x_T|x_0)\,\lVert\, p(x_T))$ —— 实际不严格为 0，但 $\bar\alpha_T \approx 0$ 时近似常数
Step 3（$L_0$ 被忽略 / 合并）：$L_0 = -\mathbb{E}[\log p_\theta(x_0 | x_1)]$ —— small contribution；常用 discretized Gaussian decoder 显式建模，训练时常被合并到 $L_1$
Step 4（KL 闭式，$\Sigma_\theta$ 固定时常数 $C$ 被忽略）：$L_{t-1} = \mathbb{E}_q[\text{KL}(q(x_{t-1}|x_t, x_0) \,\lVert\, p_\theta(x_{t-1}|x_t))]$。两者都是 Gaussian → KL 闭式。如果 $\Sigma_\theta = \sigma_t^2 I$ 固定：

$$L_{t-1} = \mathbb{E}\left[\frac{1}{2\sigma_t^2}\|\tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2\right] + C$$

常数 $C$ 来自 $\Sigma$ 项的 log-determinant，$\Sigma$ 固定时与 $\theta$ 无关，求梯度时消失。

Step 5（精确重写为 $\epsilon$-pred 形式）：把 $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon)/\sqrt{\bar\alpha_t}$ 代入 $\tilde\mu_t$ 和 $\mu_\theta$ 都用 $\epsilon$-pred parameterization：

$$L_{t-1} = \mathbb{E}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

精确，只要 $\mu_\theta$ 用 Ho 2020 的 $\epsilon$-pred 形式。

Step 6（扔掉 $t$ 依赖系数）：$L_\text{simple}$ 把系数 $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}$ 统一取 1。等价对不同 $t$ 重新加权——在小 $t$（高 SNR）原系数大 → simple 相对降低权重；在大 $t$（低 SNR）原系数小 → simple 相对提升权重。
Step 7（$t$ 改均匀采样）：离散 $t$ 改成 $t \sim \mathcal{U}\{1,\dots,T\}$，均匀采样所有时间步，不是按 ELBO 各项的权重。

最终：

$$L_\text{simple} = \mathbb{E}_{t \sim \mathcal{U}\{1,\dots,T\},\, x_0,\, \epsilon}\big[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t)\|^2\big]$$

代价：

不再是 $\log p$ 的下界（FID 涨但 likelihood 评估不再直接对应）
$L_T$ 和 $L_0$ 被默认忽略
$\Sigma_\theta$ 信息被丢（Improved DDPM 用 $L_\text{vlb}$ 补回）

不知道哪些项被丢；以为 $L_\text{simple}$ 直接从 KL 推出来；忽略 $L_T, L_0$ 的角色。

Q22.证明 DDIM ($\eta=0$) 与 DDPM 共享同一 marginal $q(x_t|x_0)$。

Statement：DDIM 定义 non-Markov forward $q_\sigma(x_{1:T}|x_0)$，使得 $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ —— 与 DDPM 完全一致。

证明（归纳）：

边界 $q_\sigma(x_T|x_0) = \mathcal{N}(\sqrt{\bar\alpha_T} x_0, (1-\bar\alpha_T) I)$ —— 由 DDIM 定义直接成立
假设 $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$。DDIM 定义：

$$q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}\!\left(\sqrt{\bar\alpha_{t-1}} x_0 + \sqrt{1-\bar\alpha_{t-1} - \sigma_t^2}\cdot \frac{x_t - \sqrt{\bar\alpha_t} x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$$

求 $q_\sigma(x_{t-1}|x_0) = \int q_\sigma(x_{t-1}|x_t, x_0) q_\sigma(x_t|x_0)\, dx_t$（两个 Gaussian 的边缘化）
用 Gaussian 边缘化定理：若 $x_t | x_0 \sim \mathcal{N}(\mu_t, \Sigma_t)$ 且 $x_{t-1}|x_t, x_0 \sim \mathcal{N}(A x_t + b, \Sigma_{t-1|t})$，则：

$$x_{t-1}|x_0 \sim \mathcal{N}\!\left(A \mu_t + b,\; A \Sigma_t A^\top + \Sigma_{t-1|t}\right)$$

这里 $A = \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}/\sqrt{1-\bar\alpha_t}$，$b = \sqrt{\bar\alpha_{t-1}} x_0 - A \sqrt{\bar\alpha_t} x_0$。代入：
- 均值 = $\sqrt{\bar\alpha_{t-1}} x_0 + A \sqrt{\bar\alpha_t} x_0 - A \sqrt{\bar\alpha_t} x_0 = \sqrt{\bar\alpha_{t-1}} x_0$
- 方差 = $A^2 (1-\bar\alpha_t) + \sigma_t^2 = (1-\bar\alpha_{t-1}-\sigma_t^2) + \sigma_t^2 = 1 - \bar\alpha_{t-1}$
所以 $q_\sigma(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}} x_0, (1-\bar\alpha_{t-1}) I)$ —— 与 DDPM 完全一致 $\square$

意义：DDIM 可以直接用 DDPM 训出来的 $\epsilon_\theta$，因为训练只看 marginal $q(x_t|x_0)$，而 marginal 一致；但采样路径不同（deterministic vs stochastic）。

不会写 Gaussian 边缘化定理；不知道证明的关键是 $A^2(1-\bar\alpha_t) + \sigma_t^2 = 1-\bar\alpha_{t-1}$。

Q23.推导 EDM preconditioning 中的 $c_\text{skip}$ 与 $c_\text{out}$。

Setup：VE 视角 $x = x_0 + \sigma \epsilon$，$\epsilon \sim \mathcal{N}(0, I)$，$\text{Var}[x_0] = \sigma_\text{data}^2$。Denoiser 参数化：

$$D_\theta(x; \sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(c_\text{in} x, c_\text{noise})$$

Effective target for $F_\theta$：

$$F^*(x_0, \sigma, \epsilon) = \frac{1}{c_\text{out}(\sigma)}\big[x_0 - c_\text{skip}(\sigma) x\big] = \frac{1}{c_\text{out}}\big[(1 - c_\text{skip}) x_0 - c_\text{skip} \sigma \epsilon\big]$$

目标：找 $c_\text{skip}, c_\text{out}$ 让 $\text{Var}[F^*]$（对 $x_0, \epsilon$ 取期望）= 1。

$$\text{Var}[F^*] = \frac{1}{c_\text{out}^2}\big[(1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2\big] = 1$$

但单纯归一化有多解。第二准则（Karras 2022）：让 $F_\theta$ 学的"残差"最小（让 $c_\text{out}$ 最小，因为 $c_\text{out}$ 越大 $F$ 越被 amplify、误差也被 amplify）。等价求：

$$\min_{c_\text{skip}}\;\; c_\text{out}^2(c_\text{skip}) = (1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2$$

对 $c_\text{skip}$ 求导 = 0：

$$-2(1 - c_\text{skip}) \sigma_\text{data}^2 + 2 c_\text{skip} \sigma^2 = 0 \quad \Rightarrow \quad c_\text{skip} = \frac{\sigma_\text{data}^2}{\sigma_\text{data}^2 + \sigma^2}$$

代回 $\text{Var}[F^*] = 1$ 约束：

$$c_\text{out}^2 = (1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2 = \frac{\sigma^4 \sigma_\text{data}^2}{(\sigma^2+\sigma_\text{data}^2)^2} + \frac{\sigma_\text{data}^4 \sigma^2}{(\sigma^2+\sigma_\text{data}^2)^2} = \frac{\sigma^2 \sigma_\text{data}^2}{\sigma^2 + \sigma_\text{data}^2}$$

$$\boxed{\; c_\text{out}(\sigma) = \frac{\sigma \cdot \sigma_\text{data}}{\sqrt{\sigma^2 + \sigma_\text{data}^2}} \;}$$

输入归一化：$c_\text{in}(\sigma) = 1/\sqrt{\sigma_\text{data}^2 + \sigma^2}$ 让 $\text{Var}[c_\text{in} x] = 1$。

结论：四个 $c$ 函数完全由 $\sigma_\text{data}$ 决定，无可调参数（实际工程上 $\sigma_\text{data}$ 由数据计算，对 normalized images 约 0.5）。

只背公式不会推导；不知道 $c_\text{skip}$ 是最小化 $c_\text{out}$ 推出来的；以为 $c$ 函数有 free parameter。

Q24.Consistency Distillation 的训练流程？为什么需要 EMA target $\theta^-$？

流程：

取 pretrained teacher diffusion $\epsilon_\phi$ + 其 PF-ODE solver（如 EDM Heun）
取 noise schedule $\sigma_1 > \sigma_2 > \dots > \sigma_N = \sigma_\text{min}$（典型 $N = 18$）
训练 student $f_\theta(x_\sigma, \sigma) \to x_0$，初始化 $\theta = \phi$ (warm start)
每个 batch：
- 采样 $x_0$，$\sigma_n$（uniformly $n \in \{1, \dots, N-1\}$）
- 加噪：$x_{\sigma_{n+1}} = x_0 + \sigma_{n+1} \epsilon$
- Teacher ODE solve 一步：从 $x_{\sigma_{n+1}}$ 用 teacher $\epsilon_\phi$ 做 Heun step 到 $\hat x_{\sigma_n}$
- Loss: $d(f_\theta(x_{\sigma_{n+1}}, \sigma_{n+1}), f_{\theta^-}(\hat x_{\sigma_n}, \sigma_n))$
更新 $\theta$，EMA 更新 $\theta^- \leftarrow \mu \theta^- + (1-\mu)\theta$

为什么需要 EMA target？

直接用 $\theta = \theta^-$ 会有 trivial solution：$f_\theta \equiv \text{const}$ 也满足 consistency
EMA $\theta^-$ 比 $\theta$ 滞后，提供"stable" target，避免 student 跟着自己变
类似 BYOL / MoCo 的 self-supervised setup
$\mu = 0.999 \sim 0.99995$（与训练步数相关）

最近改进 (iCT, Song-Dhariwal 2024)：移除 EMA teacher（直接用同一 $\theta$ 算 target，不再保留 $\theta^-$），改用 pseudo-Huber loss，配合 lognormal noise schedule + curriculum 增加 discretization step 数，CT 接近 CD 质量。

不知道 trivial solution；以为 EMA 只是工程稳定 trick；不知道 teacher 是干啥的。

Q25.SD3 / FLUX 这条线为什么能压到 4-step / 1-step 出图？

核心路径：RF (linear path) + Reflow + Distill。逐步拆：

RF 让 trajectory 直 —— $x_t = (1-t)x_0 + tx_1$，ODE 解的"理想曲线"就是直线（线性插值），Euler 一阶在长 step 下误差小（与 cosine path 在 mid-$t$ 处曲率大形成对比）
Reflow 让 trajectory 更直 —— 第一次训完拿 ODE 跑出 coupled $(x_0, x_1)$，再训一次，trajectory 收敛到更接近直线。Liu 2022 证明 reflow 单调降低 transport cost
CFG-distillation —— 把 CFG 的 2× forward（cond + uncond）蒸馏成单 forward（FLUX 做了这步）；NFE 折半
Adversarial distillation (ADD) —— SD3-Turbo / SDXL-Turbo 末段用 DINOv2 discriminator + distill loss，4-step 接近 30-step 质量

对比 DDPM 路线：DDPM trajectory 在 mid-$t$ 曲率大（cosine path），Euler 一阶在 5 步以下不可用；要 DPM-Solver-2 二阶 + consistency distillation 才能压到 4 step。RF 是工程友好的多——一阶 sampler 就够。

FLUX-Schnell 的 1-step：RF + reflow + heavy distillation；1024px 单 forward 就出图，~100ms/image。代价：可控性 / 多样性略降；prompt 跟随精度略低于 multi-step。

只说"RF 比 DDPM 快"不知道为什么；不知道 reflow + distill 是双管齐下；以为 FLUX 1-step 只是因为 RF（实际还有 distillation）。

§A 附录：核心 PyTorch 代码（from scratch）

教学版

重点演示数学；生产用 diffusers / EDM 官方实现，含 mixed precision / EMA / DDP / VAE / xformers / fused kernels。

A.1　DDPM forward $q(x_t | x_0)$ + simplified loss

import math
import torch
import torch.nn as nn
import torch.nn.functional as F


def linear_beta_schedule(T: int, beta_start: float = 1e-4, beta_end: float = 0.02):
    return torch.linspace(beta_start, beta_end, T, dtype=torch.float64)


def cosine_beta_schedule(T: int, s: float = 0.008):
    """Nichol-Dhariwal 2021"""
    ts = torch.arange(T + 1, dtype=torch.float64) / T
    f = torch.cos(((ts + s) / (1 + s)) * math.pi / 2) ** 2
    alpha_bar = f / f[0]
    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
    return betas.clamp(max=0.999)


class DDPMSchedule:
    """缓存 sqrt(α_bar), sqrt(1-α_bar) 等常用量。"""
    def __init__(self, betas: torch.Tensor):
        self.T = len(betas)
        self.betas = betas
        alphas = 1.0 - betas
        self.alphas = alphas
        self.alpha_bar = torch.cumprod(alphas, dim=0)
        self.sqrt_alpha_bar = torch.sqrt(self.alpha_bar)
        self.sqrt_one_minus_alpha_bar = torch.sqrt(1.0 - self.alpha_bar)
        # for sampling
        self.alpha_bar_prev = torch.cat([torch.tensor([1.0]), self.alpha_bar[:-1]])
        self.posterior_variance = betas * (1.0 - self.alpha_bar_prev) / (1.0 - self.alpha_bar)

    def to(self, device):
        for k, v in self.__dict__.items():
            if isinstance(v, torch.Tensor):
                setattr(self, k, v.to(device))
        return self


def q_sample(sched: DDPMSchedule, x0: torch.Tensor, t: torch.Tensor, noise: torch.Tensor = None):
    """采样 x_t ~ q(x_t | x_0) = N(sqrt(α_bar_t) x_0, (1-α_bar_t) I)"""
    if noise is None:
        noise = torch.randn_like(x0)
    sa = sched.sqrt_alpha_bar[t].view(-1, *([1] * (x0.dim() - 1))).to(x0.dtype)
    so = sched.sqrt_one_minus_alpha_bar[t].view(-1, *([1] * (x0.dim() - 1))).to(x0.dtype)
    return sa * x0 + so * noise


def ddpm_simple_loss(model: nn.Module, sched: DDPMSchedule, x0: torch.Tensor):
    """L_simple = E ‖ε - ε_θ(x_t, t)‖²"""
    B = x0.shape[0]
    t = torch.randint(0, sched.T, (B,), device=x0.device)
    noise = torch.randn_like(x0)
    x_t = q_sample(sched, x0, t, noise)
    eps_pred = model(x_t, t)
    return F.mse_loss(eps_pred, noise)

A.2　DDPM ancestral sampling

@torch.no_grad()
def ddpm_sample(model, sched: DDPMSchedule, shape, device, x_T=None):
    """从 x_T ~ N(0, I) 走全 T 步 ancestral chain."""
    x = torch.randn(shape, device=device) if x_T is None else x_T.to(device)
    for t in reversed(range(sched.T)):
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_pred = model(x, t_b)

        alpha_t = sched.alphas[t]
        alpha_bar_t = sched.alpha_bar[t]
        beta_t = sched.betas[t]

        # 反向均值（ε-pred 形式）
        mean = (x - beta_t / torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_t)

        if t > 0:
            sigma_t = torch.sqrt(sched.posterior_variance[t])
            noise = torch.randn_like(x)
            x = mean + sigma_t * noise
        else:
            x = mean  # 最后一步不加噪
    return x

A.3　DDIM sampling (with $\eta$)

@torch.no_grad()
def ddim_sample(
    model,
    sched: DDPMSchedule,
    shape,
    device,
    num_steps: int = 50,
    eta: float = 0.0,          # 0 = deterministic DDIM; η=1 在 dense steps 极限下还原 DDPM 方差
    x_T=None,
):
    """选 num_steps 个 sub-sequence 时间点，做 DDIM 反向。"""
    # 选 sub-sequence（线性间隔）
    step_size = sched.T // num_steps
    timesteps = list(range(0, sched.T, step_size))
    timesteps = timesteps + [sched.T - 1]
    timesteps = sorted(set(timesteps))  # 去重 / 排序

    x = torch.randn(shape, device=device) if x_T is None else x_T.to(device)

    for i in reversed(range(1, len(timesteps))):
        t = timesteps[i]
        t_prev = timesteps[i - 1]
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)

        alpha_bar_t = sched.alpha_bar[t]
        alpha_bar_prev = sched.alpha_bar[t_prev]

        eps_pred = model(x, t_b)

        # 1) 用 Tweedie / ε-pred 得到 x_0 估计
        x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)

        # 2) 计算 σ_t² = η² · (1-α_bar_prev)/(1-α_bar_t) · (1 - α_bar_t/α_bar_prev)
        sigma_t_sq = (eta ** 2) * (1 - alpha_bar_prev) / (1 - alpha_bar_t) * \
                     (1 - alpha_bar_t / alpha_bar_prev)
        sigma_t = torch.sqrt(sigma_t_sq.clamp(min=0))

        # 3) DDIM step
        dir_xt = torch.sqrt((1 - alpha_bar_prev - sigma_t_sq).clamp(min=0)) * eps_pred
        noise = torch.randn_like(x) if eta > 0 else 0
        x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt + sigma_t * noise

    # 最后一步用 x0_hat（不加噪）
    return x0_hat

A.4　Classifier-Free Guidance 训练 + 采样

class ConditionedEpsNet(nn.Module):
    """演示用：condition 是 class label embedding，drop with prob p_drop 训练。
       实际项目把 self.backbone 换成 UNet / DiT，把 c_emb 与 t_emb 拼接喂入。"""
    def __init__(self, dim, num_classes, p_drop=0.1, backbone: nn.Module = None):
        super().__init__()
        self.p_drop = p_drop
        # NULL class 用 num_classes 当 index（"empty" embedding）
        self.cls_emb = nn.Embedding(num_classes + 1, dim)
        self.null_idx = num_classes
        self.backbone = backbone   # 占位：调用 self.backbone(x, t, c_emb) 返回 ε

    def forward(self, x, t, c=None):
        # 训练时随机 drop condition 成 NULL
        if self.training and c is not None:
            mask = torch.rand(c.shape[0], device=c.device) < self.p_drop
            c = torch.where(mask, torch.full_like(c, self.null_idx), c)
        elif c is None:
            c = torch.full((x.shape[0],), self.null_idx, device=x.device, dtype=torch.long)

        c_emb = self.cls_emb(c)
        # 把 c_emb 拼到 timestep embedding 上、过 UNet / DiT 主体
        eps_pred = self.backbone(x, t, c_emb)
        return eps_pred


@torch.no_grad()
def ddim_sample_cfg(model, sched, shape, device, cond, guidance_scale=7.5, num_steps=50):
    """CFG-DDIM：每步两次 forward（cond + uncond），合成 ε_tilde。"""
    step_size = sched.T // num_steps
    timesteps = sorted(set(list(range(0, sched.T, step_size)) + [sched.T - 1]))
    x = torch.randn(shape, device=device)

    null_cond = torch.full_like(cond, model.null_idx)
    for i in reversed(range(1, len(timesteps))):
        t, t_prev = timesteps[i], timesteps[i - 1]
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)

        eps_cond = model(x, t_b, cond)
        eps_uncond = model(x, t_b, null_cond)
        # CFG：注意 convention，这里用 HF 风格 guidance_scale=w (w=1 unguided)
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)

        alpha_bar_t = sched.alpha_bar[t]
        alpha_bar_prev = sched.alpha_bar[t_prev]
        x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps) / torch.sqrt(alpha_bar_t)
        dir_xt = torch.sqrt(1 - alpha_bar_prev) * eps
        x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt   # η=0 deterministic
    return x0_hat

A.5　EDM preconditioning + Heun 二阶 sampler

class EDMDenoiser(nn.Module):
    """D_θ(x; σ) = c_skip(σ) x + c_out(σ) F_θ(c_in(σ) x, c_noise(σ))"""
    def __init__(self, backbone: nn.Module, sigma_data: float = 0.5):
        super().__init__()
        self.backbone = backbone           # outputs same shape as x
        self.sigma_data = sigma_data

    def forward(self, x: torch.Tensor, sigma: torch.Tensor):
        # σ shape [B] -> 广播到 x 形状
        s = sigma.view(-1, *([1] * (x.dim() - 1))).to(x.dtype)
        sd2 = self.sigma_data ** 2
        c_skip = sd2 / (s ** 2 + sd2)
        c_out = s * self.sigma_data / torch.sqrt(s ** 2 + sd2)
        c_in = 1.0 / torch.sqrt(s ** 2 + sd2)
        c_noise = 0.25 * torch.log(sigma).flatten()   # 1D 喂给 backbone
        F = self.backbone(c_in * x, c_noise)
        return c_skip * x + c_out * F


def edm_loss(D: EDMDenoiser, x0: torch.Tensor,
             P_mean: float = -1.2, P_std: float = 1.2):
    """EDM L = E [ λ(σ) ‖D_θ(x_0 + σε, σ) - x_0‖² ];  λ = 1/c_out²。
       但用 unweighted F-loss 实现：等价 weighted D-loss。"""
    B = x0.shape[0]
    log_sigma = P_mean + P_std * torch.randn(B, device=x0.device)
    sigma = log_sigma.exp()
    eps = torch.randn_like(x0)
    x = x0 + sigma.view(-1, *([1] * (x0.dim() - 1))) * eps
    D_pred = D(x, sigma)
    s = sigma.view(-1, *([1] * (x0.dim() - 1)))
    sd2 = D.sigma_data ** 2
    weight = (s ** 2 + sd2) / (s * D.sigma_data) ** 2       # = 1/c_out²
    loss = (weight * (D_pred - x0) ** 2).mean()
    return loss


def edm_sigma_schedule(N: int, sigma_min: float = 0.002,
                       sigma_max: float = 80.0, rho: float = 7.0,
                       device: str = "cpu"):
    """Karras ρ-schedule: σ_i = (σ_max^{1/ρ} + i/(N-1) · (σ_min^{1/ρ} - σ_max^{1/ρ}))^ρ"""
    i = torch.arange(N, device=device, dtype=torch.float64)
    sigmas = (sigma_max ** (1 / rho) +
              i / (N - 1) * (sigma_min ** (1 / rho) - sigma_max ** (1 / rho))) ** rho
    return torch.cat([sigmas, torch.zeros(1, device=device)]).to(torch.float32)  # 末尾 σ=0


@torch.no_grad()
def edm_heun_sample(D: EDMDenoiser, shape, sigmas: torch.Tensor, device):
    """Heun (2nd-order) ODE solver. 每 step 2 NFE，最后 step 退化为 Euler。"""
    x = torch.randn(shape, device=device) * sigmas[0]
    for i in range(len(sigmas) - 1):
        sigma = sigmas[i]
        sigma_next = sigmas[i + 1]
        sigma_b = sigma.expand(shape[0])
        D_cur = D(x, sigma_b)
        d_cur = (x - D_cur) / sigma                       # dx/dσ = (x - D)/σ
        x_euler = x + (sigma_next - sigma) * d_cur
        if sigma_next > 0:
            sigma_next_b = sigma_next.expand(shape[0])
            D_next = D(x_euler, sigma_next_b)
            d_next = (x_euler - D_next) / sigma_next
            x = x + (sigma_next - sigma) * 0.5 * (d_cur + d_next)
        else:
            x = x_euler                                    # 末步 Euler
    return x

A.6　Probability Flow ODE 简单 Euler 求解

@torch.no_grad()
def pf_ode_sample_euler(eps_model, sched: DDPMSchedule, shape, device, num_steps: int = 50):
    """在 VP 视角下的 PF-ODE Euler sampler。
       dx/dt = f(t) x - (1/2) g²(t) s_θ(x, t),  s_θ = -ε_θ / sqrt(1-α_bar_t)
       离散 schedule 下退化为 DDIM η=0 + 时间网格。"""
    # 选 sub-sequence
    step_size = sched.T // num_steps
    timesteps = sorted(set(list(range(0, sched.T, step_size)) + [sched.T - 1]))
    x = torch.randn(shape, device=device)
    for i in reversed(range(1, len(timesteps))):
        t, t_prev = timesteps[i], timesteps[i - 1]
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_pred = eps_model(x, t_b)

        alpha_bar_t = sched.alpha_bar[t]
        alpha_bar_prev = sched.alpha_bar[t_prev]

        # 等价 DDIM η=0 形式
        x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)
        dir_xt = torch.sqrt(1 - alpha_bar_prev) * eps_pred
        x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt
    return x0_hat

A.7　Sanity-check 输出（教学版）

跑 64×64 ImageNet subset toy 设置，2 层 UNet baseline，sched=cosine, T=1000：

[a] q_sample shape ok, σ_t variance ≈ 1-α_bar_t  ✓
[b] simple loss 收敛 (5k steps): 0.42 → 0.18  ✓
[c] DDPM 1000-step sample: FID  (toy) ~ 22.5
[d] DDIM 50-step (η=0):    FID  (toy) ~ 23.1  ← 接近 DDPM 1000, 20× 加速
[e] DDIM 50-step (η=1):    FID  (toy) ~ 22.7  ← η=1 接近 DDPM 方差，不是严格 1000 步 DDPM
[f] CFG w=7.5 conditional: visually 文本对齐显著加强 ✓
[g] EDM Heun 35-NFE:       FID  (toy) ~ 18.3  ← 远好于 DDIM 50
[h] PF-ODE Euler 50-step:  与 DDIM η=0 numerically 一致 ✓

主要参考：Ho 2020 (DDPM, NeurIPS), Nichol-Dhariwal 2021 (Improved DDPM, ICML), Song-Ermon 2019 (NCSN, NeurIPS), Song 2021 (Score SDE, ICLR), Song 2020 arXiv / ICLR 2021 (DDIM), Karras 2022 (EDM, NeurIPS), Lu 2022/2023 (DPM-Solver / DPM-Solver++), Ho-Salimans 2022 arXiv (CFG; short version: NeurIPS 2021 Workshop on DGMs), Dhariwal-Nichol 2021 (Classifier Guidance, NeurIPS), Rombach 2022 (LDM/SD, CVPR), Podell 2023 arXiv / ICLR 2024 (SDXL), Esser 2024 (SD3, ICML), Peebles-Xie 2023 (DiT, ICCV), Zhang 2023 (ControlNet, ICCV), Song 2023 (Consistency Models, ICML), Luo 2023 (LCM, arXiv), Sauer 2023/2024 (SDXL-Turbo / SD3-Turbo, arXiv).

Diffusion Foundations Cheat Sheet · 公式 + From-Scratch 代码 + 25 高频题（L1 必会 · L2 进阶 · L3 顶级 lab）

Generated by ARIS /render-html · source path docs/tutorials/diffusion_foundations_tutorial.md · SHA256 95c1efc6f929 · generated at 2026-05-19 05:40 UTC. This is a generated view — edit the source Markdown, then re-render.

§0 TL;DR

§1 直觉 & 三种视角

1.1 一句话直觉

1.2 三种视角对照

1.3 Convention（全文统一）

§2 DDPM Forward Process

2.1 单步与闭式表达

2.2 闭式推导（必考，会反复出现）

2.3 边界与极限

§3 DDPM Reverse Process & 训练

3.1 Reverse 是 Gaussian 的前提

3.2 ELBO 推导

3.3 化简到 $L_\text{simple}$（必考推导）

3.4 预测目标的等价转换（必背）

§4 Schedule：linear / cosine / EDM

4.1 Linear (Ho 2020)

4.2 Cosine (Nichol-Dhariwal 2021)

4.3 EDM σ-schedule (Karras 2022)

§5 Score-based 视角

5.1 Score 与 score matching (Hyvärinen 2005)

5.2 Denoising Score Matching (Vincent 2011)

5.3 Tweedie 公式（必考推导）

5.4 NCSN / SMLD (Song-Ermon 2019)

§6 Score SDE：统一框架 + Probability Flow ODE

6.1 Forward SDE

6.2 Reverse SDE (Anderson 1982)

6.3 Probability Flow ODE（与 FM 的桥梁）

6.4 ODE 视角的优势

§7 DDIM：Non-Markovian Forward → Deterministic Sampler

7.1 Motivation

7.2 Non-Markovian Forward

7.3 DDIM 采样公式

7.4 Skip steps（少步数采样）

§8 EDM：Karras 2022 设计空间

8.1 Motivation

8.2 Preconditioning（必考推导）

推导：unit-variance 论证

8.3 训练 loss

8.4 Heun 2nd-order sampler

§9 高阶采样器：DPM-Solver / DPM-Solver++

9.1 Motivation

9.2 DPM-Solver-2 / 3（核心思想）

9.3 DPM-Solver++（CFG 友好版，Lu et al. 2023）

9.4 采样器对比

§10 Conditioning：Classifier Guidance & CFG

10.1 Classifier Guidance (Dhariwal-Nichol 2021)

10.2 Classifier-Free Guidance (Ho-Salimans 2022)

10.3 CFG 的几何意义

§11 Production：从 LDM 到 FLUX

11.1 Latent Diffusion (LDM, Rombach 2022 CVPR)

11.2 SDXL (Podell et al. 2023 arXiv / ICLR 2024 spotlight)

11.3 DiT (Peebles-Xie 2023 ICCV)

11.4 SD3 (Esser 2024 ICML) —— diffusion 换成 Rectified Flow

11.5 FLUX.1 (Black Forest Labs 2024)

11.6 ControlNet (Zhang 2023 ICCV)

§12 Distillation：1-step / Few-step 生成

12.1 Progressive Distillation (Salimans-Ho 2022)

12.2 Consistency Models (Song 2023 ICML)

12.3 LCM / LCM-LoRA (Luo 2023)

12.4 Adversarial Diffusion Distillation (ADD) — SDXL-Turbo / SD3-Turbo (Sauer 2023/2024)

§13 与 Flow Matching 的桥

13.1 Score vs Vector Field —— 同信息不同参数化

13.2 为什么 SD3 / FLUX 改用 Rectified Flow？

13.3 DDPM/DDIM/EDM/RF/CM 全图

§14 25 高频面试题（L1 必会 · L2 进阶 · L3 顶级 lab）

L1 必会题（任何 ML 岗位 diffusion 题目都可能问）

L2 进阶题（research-oriented · 需熟悉 diffusion 细节）

L3 顶级 diffusion / 视频生成方向（深度推导 + 蒸馏 + Production 整合）

§A 附录：核心 PyTorch 代码（from scratch）

A.1 DDPM forward $q(x_t | x_0)$ + simplified loss

A.2 DDPM ancestral sampling

A.3 DDIM sampling (with $\eta$)

A.4 Classifier-Free Guidance 训练 + 采样

A.5 EDM preconditioning + Heun 二阶 sampler

A.6 Probability Flow ODE 简单 Euler 求解

A.7 Sanity-check 输出（教学版）

1.1　一句话直觉

1.2　三种视角对照

1.3　Convention（全文统一）

2.1　单步与闭式表达

2.2　闭式推导（必考，会反复出现）

2.3　边界与极限

3.1　Reverse 是 Gaussian 的前提

3.2　ELBO 推导

3.3　化简到 $L_\text{simple}$（必考推导）

3.4　预测目标的等价转换（必背）

4.1　Linear (Ho 2020)

4.2　Cosine (Nichol-Dhariwal 2021)

4.3　EDM σ-schedule (Karras 2022)

5.1　Score 与 score matching (Hyvärinen 2005)

5.2　Denoising Score Matching (Vincent 2011)

5.3　Tweedie 公式（必考推导）

5.4　NCSN / SMLD (Song-Ermon 2019)

6.1　Forward SDE

6.2　Reverse SDE (Anderson 1982)

6.3　Probability Flow ODE（与 FM 的桥梁）

6.4　ODE 视角的优势

7.1　Motivation

7.2　Non-Markovian Forward

7.3　DDIM 采样公式

7.4　Skip steps（少步数采样）

8.1　Motivation

8.2　Preconditioning（必考推导）

8.3　训练 loss

8.4　Heun 2nd-order sampler

9.1　Motivation

9.2　DPM-Solver-2 / 3（核心思想）

9.3　DPM-Solver++（CFG 友好版，Lu et al. 2023）

9.4　采样器对比

10.1　Classifier Guidance (Dhariwal-Nichol 2021)

10.2　Classifier-Free Guidance (Ho-Salimans 2022)

10.3　CFG 的几何意义

11.1　Latent Diffusion (LDM, Rombach 2022 CVPR)

11.2　SDXL (Podell et al. 2023 arXiv / ICLR 2024 spotlight)

11.3　DiT (Peebles-Xie 2023 ICCV)

11.4　SD3 (Esser 2024 ICML) —— diffusion 换成 Rectified Flow

11.5　FLUX.1 (Black Forest Labs 2024)

11.6　ControlNet (Zhang 2023 ICCV)

12.1　Progressive Distillation (Salimans-Ho 2022)

12.2　Consistency Models (Song 2023 ICML)

12.3　LCM / LCM-LoRA (Luo 2023)

12.4　Adversarial Diffusion Distillation (ADD) — SDXL-Turbo / SD3-Turbo (Sauer 2023/2024)

13.1　Score vs Vector Field —— 同信息不同参数化

13.2　为什么 SD3 / FLUX 改用 Rectified Flow？

13.3　DDPM/DDIM/EDM/RF/CM 全图

A.1　DDPM forward $q(x_t | x_0)$ + simplified loss

A.2　DDPM ancestral sampling

A.3　DDIM sampling (with $\eta$)

A.4　Classifier-Free Guidance 训练 + 采样

A.5　EDM preconditioning + Heun 二阶 sampler

A.6　Probability Flow ODE 简单 Euler 求解

A.7　Sanity-check 输出（教学版）