1. A Simple Neural Network

(a)

首先写出forward过程：
$\begin{aligned} z^{[1]} &=W^{[1]} x+W_{0}^{[1]} \\ h &=\sigma\left(z^{[1]}\right) \\ z^{[2]} &=W^{[2]} h+W_{0}^{[2]} \\ o &=\sigma\left(z^{[2]}\right) \end{aligned}$
损失函数为：
$\begin{aligned} \ell &=\frac{1}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right)^{2}=\frac{1}{m} \sum_{i=1}^{m} J^{(i)} \end{aligned}$
对于一个样本，通过链式法则，先求对于 $W^{[2]}$ 的导数，利用其中结果，再求对于 $W^{[1]}$ 的导数：
$\frac{\partial J}{\partial W^{[1]}}=\frac{\partial J}{\partial o} \frac{\partial o}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial h} \frac{\partial h}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial W^{[1]}} \\ = 2(o-y)o(1-o) x^{T} h \cdot (1 - h) \cdot W^{[2]}$
其中" $\cdot$ "表示element-wise乘法。

则，对于 $w^{[1]}_{1,2}$ ,就是RHS所表示的(2x3)矩阵中的第（1，2）个元素，同时将上标(i)加入，表示第i个样本：
$\frac{\partial J}{\partial w^{[1]}_{1, 2}} = 2\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)}$
其中 $h_{2}=\sigma(w_{1,2}^{[1]} x_{1}+w_{2,2}^{[1]} x_{2}+w_{0,2}^{[1]})$

从而对于 $\ell$ 的求导为：
$\frac{\partial \ell}{\partial w^{[1]}_{1, 2}} = \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)}$
则 $w^{[1]}_{1,2}$ 的更新规则为：
$w_{1,2}^{[1]}:=w_{1,2}^{[1]}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)}$

(b)

是可能的。可以将三个神经元看作三个独立的线性分类器，每个分类器代表一个超平面，分别落在散点图中0-1两类的三个三角形边边界上。则每个数据经过其中一个神经元的线性weights时，得出正负表示在超平面的上下（左右），然后利用step函数，直接将其划分为两类。在输出层线形计算部分，判断三个神经元是否都判断样本点在各个超平面1（0）一侧，如果都在则使其值为正（负或0），则最后经历一个step函数，将各自归类。100%可以保证是因为每个边界都线性可分。一个例子如下：

def optimal_step_weights():
    """Return the optimal weights for the neural network with a step activation function.
    
    This function will not be graded if there are no optimal weights.
    See the PDF for instructions on what each weight represents.
    
    The hidden layer weights are notated by [1] on the problem set and 
    the output layer weights are notated by [2].

    This function should return a dict with elements for each weight, see example_weights above.

    """
    w = example_weights()

    # *** START CODE HERE ***
    # x1 = 0.5 超平面
    w['hidden_layer_0_1'] = 0.5
    w['hidden_layer_1_1'] = -1
    w['hidden_layer_2_1'] = 0
    # x2 = 0.5 超平面
    w['hidden_layer_0_2'] = 0.5
    w['hidden_layer_1_2'] = 0
    w['hidden_layer_2_2'] = -1
    # x1 + x2 = 4 超平面
    w['hidden_layer_0_3'] = -4
    w['hidden_layer_1_3'] = 1
    w['hidden_layer_2_3'] = 1
    # 使得以上三个条件均为0的为0类
    w['output_layer_0'] = -0.5
    w['output_layer_1'] = 1
    w['output_layer_2'] = 1
    w['output_layer_3'] = 1
    # *** END CODE HERE ***

    return w

def example_weights():
    """This is an example function that returns weights.
    Use this function as a template for optimal_step_weights and optimal_sigmoid_weights.
    You do not need to modify this class for this assignment.
    """
    w = {}

    w['hidden_layer_0_1'] = 0
    w['hidden_layer_1_1'] = 0
    w['hidden_layer_2_1'] = 0
    w['hidden_layer_0_2'] = 0
    w['hidden_layer_1_2'] = 0
    w['hidden_layer_2_2'] = 0
    w['hidden_layer_0_3'] = 0
    w['hidden_layer_1_3'] = 0
    w['hidden_layer_2_3'] = 0

    w['output_layer_0'] = 0
    w['output_layer_1'] = 0
    w['output_layer_2'] = 0
    w['output_layer_3'] = 0

    return w

example_w = optimal_step_weights()
example_w

{'hidden_layer_0_1': 0.5,
 'hidden_layer_1_1': -1,
 'hidden_layer_2_1': 0,
 'hidden_layer_0_2': 0.5,
 'hidden_layer_1_2': 0,
 'hidden_layer_2_2': -1,
 'hidden_layer_0_3': -4,
 'hidden_layer_1_3': 1,
 'hidden_layer_2_3': 1,
 'output_layer_0': -0.5,
 'output_layer_1': 1,
 'output_layer_2': 1,
 'output_layer_3': 1}

©

不可能。这个在讲义中讲到了。如果没有在每层加激活函数（activation function），那么会导致本质上所有层等效为一个线性的运算过程。在这个题目中，如果隐藏层激活函数为线性，就是自身函数，则：
$\begin{aligned} o &=\sigma\left(z^{[2]}\right) \\ &=\sigma\left(W^{[2]} h+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]}\left(W^{[1]} x+W_{0}^{[1]}\right)+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]} W^{[1]} x+W^{[2]} W_{0}^{[1]}+W_{0}^{[2]}\right) \\ &=\sigma\left(\tilde{W}^{[1}+\tilde{W}_{0}\right) \end{aligned}$
其中， $\tilde{W}=W^{[2]} W^{[1]} \text { and } \tilde{W}_{0}=W^{[2]} W_{0}^{[1]}+W_{0}^{[2]}$ ，等效于只做一次线性分类，而图中可以看出，这是一个线性不可分数据集，因此无法完成100%正确分类的目标。

2. KL divergence and Maximum Likelihood

(a)

关键是利用Jensens不等式：
$\begin{aligned} D_{\mathrm{KL}}(P \| Q) &=\sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} \\ &=-\sum_{x \in \mathcal{X}} P(x) \log \frac{Q(x)}{P(x)} \\ &=E\left[-\log \frac{Q(x)}{P(x)}\right] \\ & \geq-\log E\left[\frac{Q(x)}{P(x)}\right] \\ &=-\log \left(\sum_{x \in \mathcal{X}} P(x) \frac{Q(x)}{P(x)}\right) \\ &=-\log \sum_{x \in \mathcal{X}} Q(x) \\ &=-\log 1 \\ &=0 \end{aligned}$
对于等号的取得，当 $P = Q$ 时， $D_{\mathrm{KL}}(P \| Q)=\sum_{x \in \mathcal{X}} P(x) \log 1=0$ ；

而若 $D_{\mathrm{KL}}(P \| Q)=0$ ，则由Jensens不等式取等号的条件， $\frac{Q(x)}{P(x)}=E\left[\frac{Q(x)}{P(x)}\right]=\sum_{x \in \mathcal{X}} P(x) \frac{P(x)}{Q(x)}=\sum_{x \in \mathcal{X}} Q(x)=1$ ,即： $P = Q$ ；

所以，当且仅当 $P = Q$ ， $D_{\mathrm{KL}}(P \| Q)=0$

(b)

证明：
$\begin{aligned} D_{\mathrm{KL}}(P(X, Y) \| Q(X, Y)) &=\sum_{x} \sum_{y} P(x, y) \log \frac{P(x, y)}{Q(x, y)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x) P(y \mid x)}{Q(x) Q(y \mid x)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x)\left(\log \frac{P(x)}{Q(x)}+\log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x)}{Q(x)}+\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)} \sum_{y} P(y \mid x)+\sum_{x} P(x) \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)}+\sum_{x} P(x)\left(\sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=D_{\mathrm{KL}}(P(X) \| Q(X))+D_{\mathrm{KL}}(P(Y \mid X) \| Q(Y \mid X)) \end{aligned}$

©

证明：
$\begin{aligned} \arg \min _{\theta} D_{\mathrm{KL}}\left(\hat{P} \| P_{\theta}\right) &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \frac{\log \hat{P}(x)}{P_{\theta}(x)} \\ &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log \hat{P}(x)-\sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}}\left(\frac{1}{m} \sum_{i=1}^{m} 1\left\{x^{(i)}=x\right\}\right) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{i=1}^{m} \log P_{\theta}\left(x^{(i)}\right) \end{aligned}$

3.KL divergence, Fisher Information, and the Natural Gradient

(a)

证明：
$\begin{aligned} \nabla_{\theta} \log p(y ; \theta) &=\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)}\right] \\ &=\int_{-\infty}^{\infty} p(y ; \theta) \frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} d y \\ &=\int_{-\infty}^{\infty} \nabla_{\theta} p(y ; \theta) d y \\ &=\nabla_{\theta} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\nabla_{\theta} 1 \\ &=0 \end{aligned}$

(b)

证明：
$\begin{array}{c}\operatorname{Cov}[X]=E\left[(X-E[X])(X-E[X])^{T}\right] \\ =E\left[X X^{T}\right] \quad \text { when } E[X]=0 \\ \mathcal{I}(\theta)=\operatorname{Cov}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ =\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]\end{array}$
其中用到了(a)中，score function的均值为0的结论。

©

证明：
$\begin{aligned} & \frac{\partial \log p(y ; \theta)}{\partial \theta_{i}}=\frac{1}{p(y ; \theta)} \frac{\partial p(y ; \theta)}{\partial \theta_{i}} \\ \mathcal{I}(\theta)_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]_{i j} \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\partial \log p(y ; \theta)}{\partial \theta_{i}} \frac{\partial \log p(y ; \theta)}{\partial \theta_{j}}\right] \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ \frac{\partial^{2} \log p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} &=-\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}+\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}-\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\int_{-\infty}^{\infty} p(y ; \theta) \frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\frac{\partial^{2}}{\partial \theta_{i} \partial \theta_{j}} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathcal{I}(\theta)_{i j} \\ 从而： \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]=\mathcal{I}(\theta) \end{aligned}$

(d)

证明：
$\begin{aligned} \log p(y ; \tilde{\theta}) & \approx \log p(y ; \theta)+\left.(\tilde{\theta}-\theta)^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2}(\tilde{\theta}-\theta)^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right)(\tilde{\theta}-\theta) \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2} d^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right) d \end{aligned}$
$\begin{aligned} \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] d \\=& \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) &=D_{\mathrm{KL}}\left(p_{\theta} \| p_{\tilde{\theta}}\right) \\ &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]-\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] \\ & \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned}$

(e)

第一步，用泰勒展开近似目标函数和约束：
$\begin{aligned} \ell(\theta+d) & \approx \ell(\theta)+\left.d^{T} \nabla_{\theta^{\prime}} \ell\left(\theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ & D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned}$
第二步，写出拉格朗日函数：
$\begin{aligned} \mathcal{L}(d, \lambda) &=\ell(\theta+d)-\lambda\left[D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right)-c\right] \\ & \approx \log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda\left[\frac{1}{2} d^{T} \mathcal{I}(\theta) d-c\right] \end{aligned}$
第三步，拉格朗日函数对参数求导为0，其中关于d的求导为：
$\begin{aligned} \nabla_{d} \mathcal{L}(d, \lambda) & \approx \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda \mathcal{I}(\theta) d=0 \\ \tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \end{aligned}$
则此时尽管不知 $\lambda$ 的具体值，但是其为一个正实数，所以仍然已经得到了natural gradient的方向，下面利用关于 $\lambda$ 求梯度的方程，
解出 $\lambda$ 的具体值：
$\begin{aligned} \nabla_{\lambda} \mathcal{L}(d, \lambda) & \approx c-\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ &=c-\frac{1}{2} \cdot \frac{1}{\lambda} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T}}{p(y ; \theta)} \mathcal{I}(\theta)^{-1} \cdot \mathcal{I}(\theta) \cdot \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=c-\left.\left.\frac{1}{2 \lambda^{2}(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=0 \\ & \lambda=\sqrt{\left.\left.\frac{1}{2 c(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}} \end{aligned}$
从而可知natural gradient $d^{*}$ :
$\begin{aligned} d^{*} &=\sqrt{\frac{2 c(p(y ; \theta))^{2}}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=\left.\sqrt{\frac{2 c}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \end{aligned}$

(f)

由上一问：
$\tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} = \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, p\left(y ; \theta^{\prime}\right)|_{\theta^{\prime}=\theta}= \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, \ell(\theta^{\prime})|_{\theta^{\prime}=\theta}$
又注意到：
$\begin{aligned} \mathcal{I}(\theta) &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\nabla_{\theta}^{2} \ell(\theta)\right] \\ &=-\mathbb{E}_{y \sim p(y ; \theta)}[H] \end{aligned}$
从而：
$\begin{aligned} \theta: &=\theta+\tilde{d} \\ &=\theta+\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta} \ell(\theta) \\ &=\theta-\frac{1}{\lambda} \mathbb{E}_{y \sim p(y ; \theta)}[H]^{-1} \nabla_{\theta} \ell(\theta) \end{aligned}$
对于牛顿法，更新规则为：
$\theta:=\theta-H^{-1} \nabla_{\theta} \ell(\theta)$
可以发现更新的放向都是黑塞矩阵你矩阵与目标函数关于参数梯度的乘积方向。