【分布外检测】《YOUR CLASSIFIER IS SECRETLY AN ENERGY BASED MODEL AND YOU SHOULD TREAT IT LIKE ONE》 ICLR‘20-CFANZ编程社区

https://arxiv.org/pdf/1912.03263v3.pdf

常用的分类器模型都是在建模 $p_{\theta}(y \mid \mathbf{x})$ ，这篇文章从能量的视角解释分类模型，进而得到一个生成模型和分类模型的混合模型。该模型能够同时建模 $p_{\theta}(y \mid \mathbf{x})$ 和 $p_{\theta}(\mathbf{x})$ ，从而提高分类精度和样本生成质量。

这篇文章也被用作OOD检测的baseline。

Joint Energy-based Model（JEM）

先来overview一下模型结构：

一个神经网络分类模型输入到Softmax函数的值称之为 $f_{\theta}(x)$ ，传统的分类器模型用 $f_{\theta}(x)$ 输入到softmax函数中估计 $\mid \mathbf{x})$ ，这篇文章里同时还用 $f_{\theta}(x)$ 来估计 $\mathbf{x},y)$ 和 $p(\mathbf{x})$ 。

本文的方法

EBM

Energy-based model：
$p_{\theta}(\mathrm{x})=\frac{\exp \left(-E_{\theta}(\mathrm{x})\right)}{Z(\theta)} \tag{1}$
其中 $E_{\theta}(\mathrm{x}): \mathbb{R}^{D} \rightarrow \mathbb{R}$ 是能量函数， $Z(\theta)=\int_{\mathbf{x}} \exp \left(-E_{\theta}(\mathbf{x})\right)$ 是配分函数（这个不用管）。要训练这个函数可以考虑优化对数似然的方法，对 $\theta$ 求梯度（这是本文的两个loss之一）：
$\frac{\partial \log p_{\theta}(\mathrm{x})}{\partial \theta}=\mathbb{E}_{p \theta\left(\mathrm{x}^{\prime}\right)}\left[\frac{\partial E_{\theta}\left(\mathrm{x}^{\prime}\right)}{\partial \theta}\right]-\frac{\partial E_{\theta}(\mathrm{x})}{\partial \theta} \tag{2}$
比较困难的是从 $p_{\theta}(x)$ 中采样，早期训练EBM使用MCMC方法，本文用较新的Stochastic Gradient Langevin Dynamics (SGLD)：
$\mathbf{x}_{0} \sim p_{0}(\mathbf{x}), \quad \mathbf{x}_{i+1}=\mathbf{x}_{i}-\frac{\alpha}{2} \frac{\partial E_{\theta}\left(\mathbf{x}_{i}\right)}{\partial \mathbf{x}_{i}}+\epsilon, \quad \epsilon \sim \mathcal{N}(0, \alpha)\tag{3}$
这个方法和PGD有些相似，这里直观的解释就是采样的 $x$ 朝着能量低的地方去，一次训练采样 $N$ 次。近期的工作显示SGLD的结果已经接近公式（2）。

提出的JEM

考虑一个 $K$ 分类问题， $f_θ : R^D → R^K$ ，其能将每个数据点 $x ∈ R^D$ 映射成被称为 logit 的实数值。使用所谓的 softmax 迁移函数，可将这些 logit 用于对类别分布执行参数化：
$p_{\theta}(y \mid \mathbf{x})=\frac{\exp \left(f_{\theta}(\mathbf{x})[y]\right)}{\sum_{y^{\prime}} \exp \left(f_{\theta}(\mathbf{x})\left[y^{\prime}\right]\right)} \tag{4}$
其中 $f_{\theta}(x)[y]$ 是网络输出向量的第 $k$ 个分量。用这些logit，无需改变模型，为x和y的联合分布重新定义一个基于能量的模型：
$p_{\theta}(\mathbf{x}, y)=\frac{\exp \left(f_{\theta}(\mathbf{x})[y]\right)}{Z(\theta)} \tag{5}$
通过对 $y$ 边缘化（积分），也可为 $x$ 获得一个非归一化的密度模型：
$p_{\theta}(\mathbf{x})=\sum_{y} p_{\theta}(\mathbf{x}, y)=\frac{\sum_{y} \exp \left(f_{\theta}(\mathbf{x})[y]\right)}{Z(\theta)}\tag{6}$
某个数据 $x$ 的能量为：
$E_{\theta}(\mathbf{x})=-\log \operatorname{SumExp}_{y}\left(f_{\theta}(\mathbf{x})[y]\right)=-\log \sum_{y} \exp \left(f_{\theta}(\mathbf{x})[y]\right)\tag{7}$
定义完了来优化我们的模型，我们的优化目标是最大化似然 $p (x, y)$ ，对其做分解：
$\log p_{\theta}(\mathbf{x}, y)=\log p_{\theta}(\mathbf{x})+\log p_{\theta}(y \mid \mathbf{x}) \tag{8}$
通过优化后两项来达到优化目标， $\log p_{\theta}(y \mid \mathbf{x})$ 用标准的交叉熵优化， $\log p_{\theta}(\mathbf{x})$ 用SGLD的公式（2）优化。

以上是提出的方法，有用的公式是(2)(3)(8)。

应用

文章提出的混合模型除了可以分类，还有很多功能，挑三个主要的讲：

还是生成模型

可以生成样本：

我和家明学长讨论很久，结合文章的代码，我猜是从公式（3）中生成的图片，即采样出的图片

OOD detection

有能量函数，可以用来异常检测是很自然的事情。没用直接用 $E (x)$ 来检测，而是提出了一个指标：
$s_{\theta}(\mathbf{x})=-\left\|\frac{\partial \log p_{\theta}(\mathbf{x})}{\partial \mathbf{x}}\right\|_{2}$
效果：

鲁棒性

公式（3）的SGLD过程本身就很像PGD，采出了很多并不是真实的样本参与训练，能提高鲁棒性也是理所当然的事情。