1. A Simple Neural Network
(a)
首先写出forward过程:
z
[
1
]
=
W
[
1
]
x
+
W
0
[
1
]
h
=
σ
(
z
[
1
]
)
z
[
2
]
=
W
[
2
]
h
+
W
0
[
2
]
o
=
σ
(
z
[
2
]
)
\begin{aligned} z^{[1]} &=W^{[1]} x+W_{0}^{[1]} \\ h &=\sigma\left(z^{[1]}\right) \\ z^{[2]} &=W^{[2]} h+W_{0}^{[2]} \\ o &=\sigma\left(z^{[2]}\right) \end{aligned}
z[1]hz[2]o=W[1]x+W0[1]=σ(z[1])=W[2]h+W0[2]=σ(z[2])
损失函数为:
ℓ
=
1
m
∑
i
=
1
m
(
o
(
i
)
−
y
(
i
)
)
2
=
1
m
∑
i
=
1
m
J
(
i
)
\begin{aligned} \ell &=\frac{1}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right)^{2}=\frac{1}{m} \sum_{i=1}^{m} J^{(i)} \end{aligned}
ℓ=m1i=1∑m(o(i)−y(i))2=m1i=1∑mJ(i)
对于一个样本,通过链式法则,先求对于
W
[
2
]
W^{[2]}
W[2]的导数,利用其中结果,再求对于
W
[
1
]
W^{[1]}
W[1]的导数:
∂
J
∂
W
[
1
]
=
∂
J
∂
o
∂
o
∂
z
[
2
]
∂
z
[
2
]
∂
h
∂
h
∂
z
[
1
]
∂
z
[
1
]
∂
W
[
1
]
=
2
(
o
−
y
)
o
(
1
−
o
)
x
T
h
⋅
(
1
−
h
)
⋅
W
[
2
]
\frac{\partial J}{\partial W^{[1]}}=\frac{\partial J}{\partial o} \frac{\partial o}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial h} \frac{\partial h}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial W^{[1]}} \\ = 2(o-y)o(1-o) x^{T} h \cdot (1 - h) \cdot W^{[2]}
∂W[1]∂J=∂o∂J∂z[2]∂o∂h∂z[2]∂z[1]∂h∂W[1]∂z[1]=2(o−y)o(1−o)xTh⋅(1−h)⋅W[2]
其中"
⋅
\cdot
⋅"表示element-wise乘法。
则,对于
w
1
,
2
[
1
]
w^{[1]}_{1,2}
w1,2[1],就是RHS所表示的(2x3)矩阵中的第(1,2)个元素,同时将上标(i)加入,表示第i个样本:
∂
J
∂
w
1
,
2
[
1
]
=
2
(
o
(
i
)
−
y
(
i
)
)
⋅
o
(
i
)
(
1
−
o
(
i
)
)
⋅
w
2
[
2
]
⋅
h
2
(
i
)
(
1
−
h
2
(
i
)
)
⋅
x
1
(
i
)
\frac{\partial J}{\partial w^{[1]}_{1, 2}} = 2\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)}
∂w1,2[1]∂J=2(o(i)−y(i))⋅o(i)(1−o(i))⋅w2[2]⋅h2(i)(1−h2(i))⋅x1(i)
其中
h
2
=
σ
(
w
1
,
2
[
1
]
x
1
+
w
2
,
2
[
1
]
x
2
+
w
0
,
2
[
1
]
)
h_{2}=\sigma(w_{1,2}^{[1]} x_{1}+w_{2,2}^{[1]} x_{2}+w_{0,2}^{[1]})
h2=σ(w1,2[1]x1+w2,2[1]x2+w0,2[1])
从而对于
ℓ
\ell
ℓ的求导为:
∂
ℓ
∂
w
1
,
2
[
1
]
=
2
m
∑
i
=
1
m
(
o
(
i
)
−
y
(
i
)
)
⋅
o
(
i
)
(
1
−
o
(
i
)
)
⋅
w
2
[
2
]
⋅
h
2
(
i
)
(
1
−
h
2
(
i
)
)
⋅
x
1
(
i
)
\frac{\partial \ell}{\partial w^{[1]}_{1, 2}} = \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)}
∂w1,2[1]∂ℓ=m2i=1∑m(o(i)−y(i))⋅o(i)(1−o(i))⋅w2[2]⋅h2(i)(1−h2(i))⋅x1(i)
则
w
1
,
2
[
1
]
w^{[1]}_{1,2}
w1,2[1]的更新规则为:
w
1
,
2
[
1
]
:
=
w
1
,
2
[
1
]
−
α
2
m
∑
i
=
1
m
(
o
(
i
)
−
y
(
i
)
)
⋅
o
(
i
)
(
1
−
o
(
i
)
)
⋅
w
2
[
2
]
⋅
h
2
(
i
)
(
1
−
h
2
(
i
)
)
⋅
x
1
(
i
)
w_{1,2}^{[1]}:=w_{1,2}^{[1]}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)}
w1,2[1]:=w1,2[1]−αm2i=1∑m(o(i)−y(i))⋅o(i)(1−o(i))⋅w2[2]⋅h2(i)(1−h2(i))⋅x1(i)
(b)
是可能的。可以将三个神经元看作三个独立的线性分类器,每个分类器代表一个超平面,分别落在散点图中0-1两类的三个三角形边边界上。则每个数据经过其中一个神经元的线性weights时,得出正负表示在超平面的上下(左右),然后利用step函数,直接将其划分为两类。在输出层线形计算部分,判断三个神经元是否都判断样本点在各个超平面1(0)一侧,如果都在则使其值为正(负或0),则最后经历一个step函数,将各自归类。100%可以保证是因为每个边界都线性可分。一个例子如下:
def optimal_step_weights():
"""Return the optimal weights for the neural network with a step activation function.
This function will not be graded if there are no optimal weights.
See the PDF for instructions on what each weight represents.
The hidden layer weights are notated by [1] on the problem set and
the output layer weights are notated by [2].
This function should return a dict with elements for each weight, see example_weights above.
"""
w = example_weights()
# *** START CODE HERE ***
# x1 = 0.5 超平面
w['hidden_layer_0_1'] = 0.5
w['hidden_layer_1_1'] = -1
w['hidden_layer_2_1'] = 0
# x2 = 0.5 超平面
w['hidden_layer_0_2'] = 0.5
w['hidden_layer_1_2'] = 0
w['hidden_layer_2_2'] = -1
# x1 + x2 = 4 超平面
w['hidden_layer_0_3'] = -4
w['hidden_layer_1_3'] = 1
w['hidden_layer_2_3'] = 1
# 使得以上三个条件均为0的为0类
w['output_layer_0'] = -0.5
w['output_layer_1'] = 1
w['output_layer_2'] = 1
w['output_layer_3'] = 1
# *** END CODE HERE ***
return w
def example_weights():
"""This is an example function that returns weights.
Use this function as a template for optimal_step_weights and optimal_sigmoid_weights.
You do not need to modify this class for this assignment.
"""
w = {}
w['hidden_layer_0_1'] = 0
w['hidden_layer_1_1'] = 0
w['hidden_layer_2_1'] = 0
w['hidden_layer_0_2'] = 0
w['hidden_layer_1_2'] = 0
w['hidden_layer_2_2'] = 0
w['hidden_layer_0_3'] = 0
w['hidden_layer_1_3'] = 0
w['hidden_layer_2_3'] = 0
w['output_layer_0'] = 0
w['output_layer_1'] = 0
w['output_layer_2'] = 0
w['output_layer_3'] = 0
return w
example_w = optimal_step_weights()
example_w
{'hidden_layer_0_1': 0.5,
'hidden_layer_1_1': -1,
'hidden_layer_2_1': 0,
'hidden_layer_0_2': 0.5,
'hidden_layer_1_2': 0,
'hidden_layer_2_2': -1,
'hidden_layer_0_3': -4,
'hidden_layer_1_3': 1,
'hidden_layer_2_3': 1,
'output_layer_0': -0.5,
'output_layer_1': 1,
'output_layer_2': 1,
'output_layer_3': 1}
©
不可能。这个在讲义中讲到了。如果没有在每层加激活函数(activation function),那么会导致本质上所有层等效为一个线性的运算过程。在这个题目中,如果隐藏层激活函数为线性,就是自身函数,则:
o
=
σ
(
z
[
2
]
)
=
σ
(
W
[
2
]
h
+
W
0
[
2
]
)
=
σ
(
W
[
2
]
(
W
[
1
]
x
+
W
0
[
1
]
)
+
W
0
[
2
]
)
=
σ
(
W
[
2
]
W
[
1
]
x
+
W
[
2
]
W
0
[
1
]
+
W
0
[
2
]
)
=
σ
(
W
~
[
1
+
W
~
0
)
\begin{aligned} o &=\sigma\left(z^{[2]}\right) \\ &=\sigma\left(W^{[2]} h+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]}\left(W^{[1]} x+W_{0}^{[1]}\right)+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]} W^{[1]} x+W^{[2]} W_{0}^{[1]}+W_{0}^{[2]}\right) \\ &=\sigma\left(\tilde{W}^{[1}+\tilde{W}_{0}\right) \end{aligned}
o=σ(z[2])=σ(W[2]h+W0[2])=σ(W[2](W[1]x+W0[1])+W0[2])=σ(W[2]W[1]x+W[2]W0[1]+W0[2])=σ(W~[1+W~0)
其中,
W
~
=
W
[
2
]
W
[
1
]
and
W
~
0
=
W
[
2
]
W
0
[
1
]
+
W
0
[
2
]
\tilde{W}=W^{[2]} W^{[1]} \text { and } \tilde{W}_{0}=W^{[2]} W_{0}^{[1]}+W_{0}^{[2]}
W~=W[2]W[1] and W~0=W[2]W0[1]+W0[2],等效于只做一次线性分类,而图中可以看出,这是一个线性不可分数据集,因此无法完成100%正确分类的目标。
2. KL divergence and Maximum Likelihood
(a)
关键是利用Jensens不等式:
D
K
L
(
P
∥
Q
)
=
∑
x
∈
X
P
(
x
)
log
P
(
x
)
Q
(
x
)
=
−
∑
x
∈
X
P
(
x
)
log
Q
(
x
)
P
(
x
)
=
E
[
−
log
Q
(
x
)
P
(
x
)
]
≥
−
log
E
[
Q
(
x
)
P
(
x
)
]
=
−
log
(
∑
x
∈
X
P
(
x
)
Q
(
x
)
P
(
x
)
)
=
−
log
∑
x
∈
X
Q
(
x
)
=
−
log
1
=
0
\begin{aligned} D_{\mathrm{KL}}(P \| Q) &=\sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} \\ &=-\sum_{x \in \mathcal{X}} P(x) \log \frac{Q(x)}{P(x)} \\ &=E\left[-\log \frac{Q(x)}{P(x)}\right] \\ & \geq-\log E\left[\frac{Q(x)}{P(x)}\right] \\ &=-\log \left(\sum_{x \in \mathcal{X}} P(x) \frac{Q(x)}{P(x)}\right) \\ &=-\log \sum_{x \in \mathcal{X}} Q(x) \\ &=-\log 1 \\ &=0 \end{aligned}
DKL(P∥Q)=x∈X∑P(x)logQ(x)P(x)=−x∈X∑P(x)logP(x)Q(x)=E[−logP(x)Q(x)]≥−logE[P(x)Q(x)]=−log(x∈X∑P(x)P(x)Q(x))=−logx∈X∑Q(x)=−log1=0
对于等号的取得,当
P
=
Q
P=Q
P=Q时,
D
K
L
(
P
∥
Q
)
=
∑
x
∈
X
P
(
x
)
log
1
=
0
D_{\mathrm{KL}}(P \| Q)=\sum_{x \in \mathcal{X}} P(x) \log 1=0
DKL(P∥Q)=∑x∈XP(x)log1=0;
而若 D K L ( P ∥ Q ) = 0 D_{\mathrm{KL}}(P \| Q)=0 DKL(P∥Q)=0,则由Jensens不等式取等号的条件, Q ( x ) P ( x ) = E [ Q ( x ) P ( x ) ] = ∑ x ∈ X P ( x ) P ( x ) Q ( x ) = ∑ x ∈ X Q ( x ) = 1 \frac{Q(x)}{P(x)}=E\left[\frac{Q(x)}{P(x)}\right]=\sum_{x \in \mathcal{X}} P(x) \frac{P(x)}{Q(x)}=\sum_{x \in \mathcal{X}} Q(x)=1 P(x)Q(x)=E[P(x)Q(x)]=∑x∈XP(x)Q(x)P(x)=∑x∈XQ(x)=1,即: P = Q P=Q P=Q;
所以,当且仅当 P = Q P=Q P=Q, D K L ( P ∥ Q ) = 0 D_{\mathrm{KL}}(P \| Q)=0 DKL(P∥Q)=0
(b)
证明:
D
K
L
(
P
(
X
,
Y
)
∥
Q
(
X
,
Y
)
)
=
∑
x
∑
y
P
(
x
,
y
)
log
P
(
x
,
y
)
Q
(
x
,
y
)
=
∑
x
∑
y
P
(
x
)
P
(
y
∣
x
)
log
P
(
x
)
P
(
y
∣
x
)
Q
(
x
)
Q
(
y
∣
x
)
=
∑
x
∑
y
P
(
x
)
P
(
y
∣
x
)
(
log
P
(
x
)
Q
(
x
)
+
log
P
(
y
∣
x
)
Q
(
y
∣
x
)
)
=
∑
x
∑
y
P
(
x
)
P
(
y
∣
x
)
log
P
(
x
)
Q
(
x
)
+
∑
x
∑
y
P
(
x
)
P
(
y
∣
x
)
log
P
(
y
∣
x
)
Q
(
y
∣
x
)
=
∑
x
P
(
x
)
log
P
(
x
)
Q
(
x
)
∑
y
P
(
y
∣
x
)
+
∑
x
P
(
x
)
∑
y
P
(
y
∣
x
)
log
P
(
y
∣
x
)
Q
(
y
∣
x
)
=
∑
x
P
(
x
)
log
P
(
x
)
Q
(
x
)
+
∑
x
P
(
x
)
(
∑
y
P
(
y
∣
x
)
log
P
(
y
∣
x
)
Q
(
y
∣
x
)
)
=
D
K
L
(
P
(
X
)
∥
Q
(
X
)
)
+
D
K
L
(
P
(
Y
∣
X
)
∥
Q
(
Y
∣
X
)
)
\begin{aligned} D_{\mathrm{KL}}(P(X, Y) \| Q(X, Y)) &=\sum_{x} \sum_{y} P(x, y) \log \frac{P(x, y)}{Q(x, y)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x) P(y \mid x)}{Q(x) Q(y \mid x)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x)\left(\log \frac{P(x)}{Q(x)}+\log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x)}{Q(x)}+\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)} \sum_{y} P(y \mid x)+\sum_{x} P(x) \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)}+\sum_{x} P(x)\left(\sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=D_{\mathrm{KL}}(P(X) \| Q(X))+D_{\mathrm{KL}}(P(Y \mid X) \| Q(Y \mid X)) \end{aligned}
DKL(P(X,Y)∥Q(X,Y))=x∑y∑P(x,y)logQ(x,y)P(x,y)=x∑y∑P(x)P(y∣x)logQ(x)Q(y∣x)P(x)P(y∣x)=x∑y∑P(x)P(y∣x)(logQ(x)P(x)+logQ(y∣x)P(y∣x))=x∑y∑P(x)P(y∣x)logQ(x)P(x)+x∑y∑P(x)P(y∣x)logQ(y∣x)P(y∣x)=x∑P(x)logQ(x)P(x)y∑P(y∣x)+x∑P(x)y∑P(y∣x)logQ(y∣x)P(y∣x)=x∑P(x)logQ(x)P(x)+x∑P(x)(y∑P(y∣x)logQ(y∣x)P(y∣x))=DKL(P(X)∥Q(X))+DKL(P(Y∣X)∥Q(Y∣X))
©
证明:
arg
min
θ
D
K
L
(
P
^
∥
P
θ
)
=
arg
min
θ
∑
x
∈
X
P
^
(
x
)
log
P
^
(
x
)
P
θ
(
x
)
=
arg
min
θ
∑
x
∈
X
P
^
(
x
)
log
P
^
(
x
)
−
∑
x
∈
X
P
^
(
x
)
log
P
θ
(
x
)
=
arg
max
θ
∑
x
∈
X
P
^
(
x
)
log
P
θ
(
x
)
=
arg
max
θ
∑
x
∈
X
(
1
m
∑
i
=
1
m
1
{
x
(
i
)
=
x
}
)
log
P
θ
(
x
)
=
arg
max
θ
∑
i
=
1
m
log
P
θ
(
x
(
i
)
)
\begin{aligned} \arg \min _{\theta} D_{\mathrm{KL}}\left(\hat{P} \| P_{\theta}\right) &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \frac{\log \hat{P}(x)}{P_{\theta}(x)} \\ &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log \hat{P}(x)-\sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}}\left(\frac{1}{m} \sum_{i=1}^{m} 1\left\{x^{(i)}=x\right\}\right) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{i=1}^{m} \log P_{\theta}\left(x^{(i)}\right) \end{aligned}
argθminDKL(P^∥Pθ)=argθminx∈X∑P^(x)Pθ(x)logP^(x)=argθminx∈X∑P^(x)logP^(x)−x∈X∑P^(x)logPθ(x)=argθmaxx∈X∑P^(x)logPθ(x)=argθmaxx∈X∑(m1i=1∑m1{x(i)=x})logPθ(x)=argθmaxi=1∑mlogPθ(x(i))
3.KL divergence, Fisher Information, and the Natural Gradient
(a)
证明:
∇
θ
log
p
(
y
;
θ
)
=
∇
θ
p
(
y
;
θ
)
p
(
y
;
θ
)
E
y
∼
p
(
y
;
θ
)
[
∇
θ
′
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
]
=
E
y
∼
p
(
y
;
θ
)
[
∇
θ
p
(
y
;
θ
)
p
(
y
;
θ
)
]
=
∫
−
∞
∞
p
(
y
;
θ
)
∇
θ
p
(
y
;
θ
)
p
(
y
;
θ
)
d
y
=
∫
−
∞
∞
∇
θ
p
(
y
;
θ
)
d
y
=
∇
θ
∫
−
∞
∞
p
(
y
;
θ
)
d
y
=
∇
θ
1
=
0
\begin{aligned} \nabla_{\theta} \log p(y ; \theta) &=\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)}\right] \\ &=\int_{-\infty}^{\infty} p(y ; \theta) \frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} d y \\ &=\int_{-\infty}^{\infty} \nabla_{\theta} p(y ; \theta) d y \\ &=\nabla_{\theta} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\nabla_{\theta} 1 \\ &=0 \end{aligned}
∇θlogp(y;θ)Ey∼p(y;θ)[∇θ′logp(y;θ′)∣θ′=θ]=p(y;θ)∇θp(y;θ)=Ey∼p(y;θ)[p(y;θ)∇θp(y;θ)]=∫−∞∞p(y;θ)p(y;θ)∇θp(y;θ)dy=∫−∞∞∇θp(y;θ)dy=∇θ∫−∞∞p(y;θ)dy=∇θ1=0
(b)
证明:
Cov
[
X
]
=
E
[
(
X
−
E
[
X
]
)
(
X
−
E
[
X
]
)
T
]
=
E
[
X
X
T
]
when
E
[
X
]
=
0
I
(
θ
)
=
Cov
y
∼
p
(
y
;
θ
)
[
∇
θ
′
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
]
=
E
y
∼
p
(
y
;
θ
)
[
∇
θ
′
log
p
(
y
;
θ
′
)
∇
θ
′
log
p
(
y
;
θ
′
)
T
∣
θ
′
=
θ
]
\begin{array}{c}\operatorname{Cov}[X]=E\left[(X-E[X])(X-E[X])^{T}\right] \\ =E\left[X X^{T}\right] \quad \text { when } E[X]=0 \\ \mathcal{I}(\theta)=\operatorname{Cov}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ =\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]\end{array}
Cov[X]=E[(X−E[X])(X−E[X])T]=E[XXT] when E[X]=0I(θ)=Covy∼p(y;θ)[∇θ′logp(y;θ′)∣θ′=θ]=Ey∼p(y;θ)[∇θ′logp(y;θ′)∇θ′logp(y;θ′)T∣∣∣θ′=θ]
其中用到了(a)中,score function的均值为0的结论。
©
证明:
∂
log
p
(
y
;
θ
)
∂
θ
i
=
1
p
(
y
;
θ
)
∂
p
(
y
;
θ
)
∂
θ
i
I
(
θ
)
i
j
=
E
y
∼
p
(
y
;
θ
)
[
∇
θ
′
log
p
(
y
;
θ
′
)
∇
θ
′
log
p
(
y
;
θ
′
)
T
∣
θ
′
=
θ
]
i
j
=
E
y
∼
p
(
y
;
θ
)
[
∂
log
p
(
y
;
θ
)
∂
θ
i
∂
log
p
(
y
;
θ
)
∂
θ
j
]
=
E
y
∼
p
(
y
;
θ
)
[
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
∂
2
log
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
=
−
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
+
1
p
(
y
;
θ
)
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
E
y
∼
p
(
y
;
θ
)
[
−
∇
θ
′
2
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
]
i
j
=
E
y
∼
p
(
y
;
θ
)
[
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
−
1
p
(
y
;
θ
)
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
=
E
y
∼
p
(
y
;
θ
)
[
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
−
E
y
∼
p
(
y
;
θ
)
[
1
p
(
y
;
θ
)
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
=
E
y
∼
p
(
y
;
θ
)
[
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
−
∫
−
∞
∞
p
(
y
;
θ
)
1
p
(
y
;
θ
)
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
d
y
=
E
y
∼
p
(
y
;
θ
)
[
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
−
∂
2
∂
θ
i
∂
θ
j
∫
−
∞
∞
p
(
y
;
θ
)
d
y
=
E
y
∼
p
(
y
;
θ
)
[
1
(
p
(
y
;
θ
)
)
2
∂
2
p
(
y
;
θ
)
∂
θ
i
∂
θ
j
]
=
I
(
θ
)
i
j
从
而
:
E
y
∼
p
(
y
;
θ
)
[
−
∇
θ
′
2
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
]
=
I
(
θ
)
\begin{aligned} & \frac{\partial \log p(y ; \theta)}{\partial \theta_{i}}=\frac{1}{p(y ; \theta)} \frac{\partial p(y ; \theta)}{\partial \theta_{i}} \\ \mathcal{I}(\theta)_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]_{i j} \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\partial \log p(y ; \theta)}{\partial \theta_{i}} \frac{\partial \log p(y ; \theta)}{\partial \theta_{j}}\right] \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ \frac{\partial^{2} \log p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} &=-\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}+\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}-\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\int_{-\infty}^{\infty} p(y ; \theta) \frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\frac{\partial^{2}}{\partial \theta_{i} \partial \theta_{j}} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathcal{I}(\theta)_{i j} \\ 从而: \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]=\mathcal{I}(\theta) \end{aligned}
I(θ)ij==∂θi∂θj∂2logp(y;θ)Ey∼p(y;θ)[−∇θ′2logp(y;θ′)∣∣θ′=θ]ij从而:Ey∼p(y;θ)[−∇θ′2logp(y;θ′)∣∣θ′=θ]=I(θ)∂θi∂logp(y;θ)=p(y;θ)1∂θi∂p(y;θ)=Ey∼p(y;θ)[∇θ′logp(y;θ′)∇θ′logp(y;θ′)T∣∣∣θ′=θ]ijEy∼p(y;θ)[∂θi∂logp(y;θ)∂θj∂logp(y;θ)]Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]=−(p(y;θ))21∂θi∂θj∂2p(y;θ)+p(y;θ)1∂θi∂θj∂2p(y;θ)=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)−p(y;θ)1∂θi∂θj∂2p(y;θ)]=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]−Ey∼p(y;θ)[p(y;θ)1∂θi∂θj∂2p(y;θ)]=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]−∫−∞∞p(y;θ)p(y;θ)1∂θi∂θj∂2p(y;θ)dy=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]−∂θi∂θj∂2∫−∞∞p(y;θ)dy=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]=I(θ)ij
(d)
证明:
log
p
(
y
;
θ
~
)
≈
log
p
(
y
;
θ
)
+
(
θ
~
−
θ
)
T
∇
θ
′
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
+
1
2
(
θ
~
−
θ
)
T
(
∇
θ
′
2
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
)
(
θ
~
−
θ
)
=
log
p
(
y
;
θ
)
+
d
T
∇
θ
′
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
+
1
2
d
T
(
∇
θ
′
2
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
)
d
\begin{aligned} \log p(y ; \tilde{\theta}) & \approx \log p(y ; \theta)+\left.(\tilde{\theta}-\theta)^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2}(\tilde{\theta}-\theta)^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right)(\tilde{\theta}-\theta) \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2} d^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right) d \end{aligned}
logp(y;θ~)≈logp(y;θ)+(θ~−θ)T∇θ′logp(y;θ′)∣∣∣θ′=θ+21(θ~−θ)T(∇θ′2logp(y;θ′)∣∣θ′=θ)(θ~−θ)=logp(y;θ)+dT∇θ′logp(y;θ′)∣∣θ′=θ+21dT(∇θ′2logp(y;θ′)∣∣θ′=θ)d
E
y
∼
p
(
y
;
θ
)
[
log
p
(
y
;
θ
~
)
]
=
E
y
∼
p
(
y
;
θ
)
[
log
p
(
y
;
θ
)
]
+
1
2
d
T
E
y
∼
p
(
y
;
θ
)
[
∇
θ
′
2
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
]
d
=
E
y
∼
p
(
y
;
θ
)
[
log
p
(
y
;
θ
)
]
+
1
2
d
T
I
(
θ
)
d
D
K
L
(
p
θ
∥
p
θ
+
d
)
=
D
K
L
(
p
θ
∥
p
θ
~
)
=
E
y
∼
p
(
y
;
θ
)
[
log
p
(
y
;
θ
)
]
−
E
y
∼
p
(
y
;
θ
)
[
log
p
(
y
;
θ
~
)
]
≈
1
2
d
T
I
(
θ
)
d
\begin{aligned} \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] d \\=& \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) &=D_{\mathrm{KL}}\left(p_{\theta} \| p_{\tilde{\theta}}\right) \\ &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]-\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] \\ & \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned}
Ey∼p(y;θ)[logp(y;θ~)]=DKL(pθ∥pθ+d)=Ey∼p(y;θ)[logp(y;θ)]+21dTEy∼p(y;θ)[∇θ′2logp(y;θ′)∣∣θ′=θ]dEy∼p(y;θ)[logp(y;θ)]+21dTI(θ)d=DKL(pθ∥pθ~)=Ey∼p(y;θ)[logp(y;θ)]−Ey∼p(y;θ)[logp(y;θ~)]≈21dTI(θ)d
(e)
第一步,用泰勒展开近似目标函数和约束:
ℓ
(
θ
+
d
)
≈
ℓ
(
θ
)
+
d
T
∇
θ
′
ℓ
(
θ
′
)
∣
θ
′
=
θ
=
log
p
(
y
;
θ
)
+
d
T
∇
θ
′
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
=
log
p
(
y
;
θ
)
+
d
T
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
D
K
L
(
p
θ
∥
p
θ
+
d
)
≈
1
2
d
T
I
(
θ
)
d
\begin{aligned} \ell(\theta+d) & \approx \ell(\theta)+\left.d^{T} \nabla_{\theta^{\prime}} \ell\left(\theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ & D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned}
ℓ(θ+d)≈ℓ(θ)+dT∇θ′ℓ(θ′)∣∣θ′=θ=logp(y;θ)+dT∇θ′logp(y;θ′)∣∣θ′=θ=logp(y;θ)+dTp(y;θ)∇θ′p(y;θ′)∣θ′=θDKL(pθ∥pθ+d)≈21dTI(θ)d
第二步,写出拉格朗日函数:
L
(
d
,
λ
)
=
ℓ
(
θ
+
d
)
−
λ
[
D
K
L
(
p
θ
∥
p
θ
+
d
)
−
c
]
≈
log
p
(
y
;
θ
)
+
d
T
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
−
λ
[
1
2
d
T
I
(
θ
)
d
−
c
]
\begin{aligned} \mathcal{L}(d, \lambda) &=\ell(\theta+d)-\lambda\left[D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right)-c\right] \\ & \approx \log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda\left[\frac{1}{2} d^{T} \mathcal{I}(\theta) d-c\right] \end{aligned}
L(d,λ)=ℓ(θ+d)−λ[DKL(pθ∥pθ+d)−c]≈logp(y;θ)+dTp(y;θ)∇θ′p(y;θ′)∣θ′=θ−λ[21dTI(θ)d−c]
第三步,拉格朗日函数对参数求导为0,其中关于d的求导为:
∇
d
L
(
d
,
λ
)
≈
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
−
λ
I
(
θ
)
d
=
0
d
~
=
1
λ
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
\begin{aligned} \nabla_{d} \mathcal{L}(d, \lambda) & \approx \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda \mathcal{I}(\theta) d=0 \\ \tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \end{aligned}
∇dL(d,λ)d~=λ1I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ≈p(y;θ)∇θ′p(y;θ′)∣θ′=θ−λI(θ)d=0
则此时尽管不知
λ
\lambda
λ的具体值,但是其为一个正实数,所以仍然已经得到了natural gradient的方向,下面利用关于
λ
\lambda
λ求梯度的方程,
解出
λ
\lambda
λ的具体值:
∇
λ
L
(
d
,
λ
)
≈
c
−
1
2
d
T
I
(
θ
)
d
=
c
−
1
2
⋅
1
λ
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
T
p
(
y
;
θ
)
I
(
θ
)
−
1
⋅
I
(
θ
)
⋅
1
λ
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
=
c
−
1
2
λ
2
(
p
(
y
;
θ
)
)
2
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
T
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
=
0
λ
=
1
2
c
(
p
(
y
;
θ
)
)
2
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
T
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
\begin{aligned} \nabla_{\lambda} \mathcal{L}(d, \lambda) & \approx c-\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ &=c-\frac{1}{2} \cdot \frac{1}{\lambda} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T}}{p(y ; \theta)} \mathcal{I}(\theta)^{-1} \cdot \mathcal{I}(\theta) \cdot \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=c-\left.\left.\frac{1}{2 \lambda^{2}(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=0 \\ & \lambda=\sqrt{\left.\left.\frac{1}{2 c(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}} \end{aligned}
∇λL(d,λ)≈c−21dTI(θ)d=c−21⋅λ1p(y;θ)∇θ′p(y;θ′)∣θ′=θTI(θ)−1⋅I(θ)⋅λ1I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ=c−2λ2(p(y;θ))21∇θ′p(y;θ′)∣∣∣∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣∣∣θ′=θ=0λ=2c(p(y;θ))21∇θ′p(y;θ′)∣∣∣∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣∣∣θ′=θ
从而可知natural gradient
d
∗
d^{*}
d∗:
d
∗
=
2
c
(
p
(
y
;
θ
)
)
2
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
T
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
=
2
c
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
T
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
\begin{aligned} d^{*} &=\sqrt{\frac{2 c(p(y ; \theta))^{2}}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=\left.\sqrt{\frac{2 c}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \end{aligned}
d∗=∇θ′p(y;θ′)∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣θ′=θ2c(p(y;θ))2I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ=∇θ′p(y;θ′)∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣θ′=θ2cI(θ)−1∇θ′p(y;θ′)∣∣∣∣∣∣∣θ′=θ
(f)
由上一问:
d
~
=
1
λ
I
(
θ
)
−
1
∇
θ
′
p
(
y
;
θ
′
)
∣
θ
′
=
θ
p
(
y
;
θ
)
=
1
λ
I
(
θ
)
−
1
∇
θ
′
l
o
g
p
(
y
;
θ
′
)
∣
θ
′
=
θ
=
1
λ
I
(
θ
)
−
1
∇
θ
′
l
o
g
ℓ
(
θ
′
)
∣
θ
′
=
θ
\tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} = \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, p\left(y ; \theta^{\prime}\right)|_{\theta^{\prime}=\theta}= \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, \ell(\theta^{\prime})|_{\theta^{\prime}=\theta}
d~=λ1I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ=λ1I(θ)−1∇θ′logp(y;θ′)∣θ′=θ=λ1I(θ)−1∇θ′logℓ(θ′)∣θ′=θ
又注意到:
I
(
θ
)
=
E
y
∼
p
(
y
;
θ
)
[
−
∇
θ
′
2
log
p
(
y
;
θ
′
)
∣
θ
′
=
θ
]
=
E
y
∼
p
(
y
;
θ
)
[
−
∇
θ
2
ℓ
(
θ
)
]
=
−
E
y
∼
p
(
y
;
θ
)
[
H
]
\begin{aligned} \mathcal{I}(\theta) &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\nabla_{\theta}^{2} \ell(\theta)\right] \\ &=-\mathbb{E}_{y \sim p(y ; \theta)}[H] \end{aligned}
I(θ)=Ey∼p(y;θ)[−∇θ′2logp(y;θ′)∣∣θ′=θ]=Ey∼p(y;θ)[−∇θ2ℓ(θ)]=−Ey∼p(y;θ)[H]
从而:
θ
:
=
θ
+
d
~
=
θ
+
1
λ
I
(
θ
)
−
1
∇
θ
ℓ
(
θ
)
=
θ
−
1
λ
E
y
∼
p
(
y
;
θ
)
[
H
]
−
1
∇
θ
ℓ
(
θ
)
\begin{aligned} \theta: &=\theta+\tilde{d} \\ &=\theta+\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta} \ell(\theta) \\ &=\theta-\frac{1}{\lambda} \mathbb{E}_{y \sim p(y ; \theta)}[H]^{-1} \nabla_{\theta} \ell(\theta) \end{aligned}
θ:=θ+d~=θ+λ1I(θ)−1∇θℓ(θ)=θ−λ1Ey∼p(y;θ)[H]−1∇θℓ(θ)
对于牛顿法,更新规则为:
θ
:
=
θ
−
H
−
1
∇
θ
ℓ
(
θ
)
\theta:=\theta-H^{-1} \nabla_{\theta} \ell(\theta)
θ:=θ−H−1∇θℓ(θ)
可以发现更新的放向都是黑塞矩阵你矩阵与目标函数关于参数梯度的乘积方向。




