7.1 感知机
7.1.1 感知机定义
定义:假设输入空间是
X
⊆
R
n
\mathcal{X} \subseteq R^{n}
X⊆Rn, 输出空间是
Y
=
{
1
,
0
}
\mathcal{Y}=\{1,0\}
Y={1,0}。输入
x
∈
X
x \in \mathcal{X}
x∈X表示实例的特征向量, 对应于输入空间的点;输出
y
∈
Y
y \in \mathcal{Y}
y∈Y表示实例的类别。由输入空间到输出空间的如下函数
f
(
x
)
=
sgn
(
w
T
x
+
b
)
f(\boldsymbol{x})=\operatorname{sgn}\left(\boldsymbol{w}^{T} \boldsymbol{x}+b\right)
f(x)=sgn(wTx+b)
成为感知机。其中
w
\boldsymbol{w}
w和
b
b
b为感知机模型参数,
sgn
\text{sgn}
sgn 是阶跃函数,即
sgn
(
z
)
=
{
1
,
z
⩾
0
0
,
z
<
0
\operatorname{sgn}(z)= \begin{cases}1, & z \geqslant 0 \\ 0, & z<0\end{cases}
sgn(z)={1,0,z⩾0z<0
7.1.2 感知机几何解释
线性方程 w T x + b = 0 \boldsymbol{w}^{T} \boldsymbol{x}+b=0 wTx+b=0对应于特征空间(输入空间) R n R^{n} Rn中的一个超平面S,其中 w \boldsymbol{w} w是超平面的法向量,b是超平面的截距。这个超平面将特征空间划分为两个部分。位于两边的点 (特征向量)分别被分为正、负两类。因此,超平面S称为分离超平面,如图所示
7.1.3 学习策略
学习策略: 假设训练数据集是线性可分的,感知机学习的目标是求得一个能够将训练集正实例点和负实例点完全正确分开的超平面。为了找出这样的超平面S,即确定感知机模型参数 w \boldsymbol{w} w和b,需要确定一个学习策略,即定义损失函数并将损失函数极小化。损失函数的一个自然选择是误分类点的总数。但是,这样的损失函数不是参数 w \boldsymbol{w} w和b的连续可导函数,不易优化,所以感知机采用的损失函数为误分类点到超平面的总距离。
输入空间
R
n
R^{n}
Rn中点
x
0
x_{0}
x0到超平面S的距离公式为
∣
w
T
x
0
+
b
∣
∥
w
∥
\frac{\left|\boldsymbol{w}^{T} \boldsymbol{x}_{0}+b\right|}{\|\boldsymbol{w}\|}
∥w∥∣∣wTx0+b∣∣
其中,
∥
w
∥
\|\boldsymbol{w}\|
∥w∥表示向量
w
\boldsymbol{w}
w的
L
2
L_2
L2范数,也即模长。若将b看成哑结点,也即合并进
w
\boldsymbol{w}
w可得
∣
w
^
T
x
^
0
∣
∥
w
^
∥
\frac{\left|\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{0}\right|}{\|\hat{\boldsymbol{w}}\|}
∥w^∥∣∣∣w^Tx^0∣∣∣
设误分类点集合为M,那么所有误分类点到超平面S的总距离为
∑
x
^
i
∈
N
∣
w
^
T
x
^
i
∣
∥
w
^
∥
\sum_{\hat{\boldsymbol{x}}_{i} \in \mathbb{N}} \frac{\left|\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right|}{\|\hat{\boldsymbol{w}}\|}
x^i∈N∑∥w^∥∣∣∣w^Tx^i∣∣∣
又因为,对于任意误分类点
x
^
i
∈
M
\hat{x}_{i} \in M
x^i∈M来说都有
(
y
^
i
−
y
i
)
w
^
T
x
^
i
>
0
\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}>0
(y^i−yi)w^Tx^i>0
其中,
y
^
i
\hat{y}_{i}
y^i为当前感知机的输出。于是所有误分类点到超平面S的总距离可改写为
∑
x
^
i
∈
M
(
y
^
i
−
y
i
)
w
^
T
x
^
i
∥
w
^
∥
\sum_{\hat{\boldsymbol{x}}_{i} \in M} \frac{\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}}{\|\hat{\boldsymbol{w}}\|}
x^i∈M∑∥w^∥(y^i−yi)w^Tx^i
不考虑
1
∥
w
^
∥
\dfrac{1}{\|\hat{\boldsymbol{w}}\|}
∥w^∥1就得到感知机学习的损失函数
L
(
w
^
)
=
∑
x
^
i
∈
M
(
y
^
i
−
y
i
)
w
^
T
x
^
i
L(\hat{\boldsymbol{w}})=\sum_{\hat{\boldsymbol{x}}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}
L(w^)=x^i∈M∑(y^i−yi)w^Tx^i
显然,损失函数
L
(
w
^
)
L(\hat{\boldsymbol{w}})
L(w^)是非负的。如果没有误分类点,损失函数值是0。而且,误分类点越少,误分类点离超平面越近,损失函数值就越小,在误分类时是参数
w
^
\hat{\boldsymbol{w}}
w^的线性函数,在正确分类时是0。因此,给定训练数据集,损失函数
L
(
w
^
)
L(\hat{\boldsymbol{w}})
L(w^)是
w
^
\hat{\boldsymbol{w}}
w^的连续可导函数。
7.1.4 算法
感知机学习算法是对以下最优化问题的算法,给定训练数据集
T
=
{
(
x
^
1
,
y
1
)
,
(
x
^
2
,
y
2
)
,
⋯
,
(
x
^
N
,
y
N
)
}
T=\left\{\left(\hat{\boldsymbol{x}}_{1}, y_{1}\right),\left(\hat{\boldsymbol{x}}_{2}, y_{2}\right), \cdots,\left(\hat{\boldsymbol{x}}_{N}, y_{N}\right)\right\}
T={(x^1,y1),(x^2,y2),⋯,(x^N,yN)}
其中
x
^
i
∈
R
n
+
1
⋅
y
i
∈
{
1
,
0
}
\hat{\boldsymbol{x}}_{i} \in R^{n+1} \cdot y_{i} \in\{1,0\}
x^i∈Rn+1⋅yi∈{1,0}, 求参数
w
^
\hat{\boldsymbol{w}}
w^使其为以下损失函数极小化问题的解
L
(
w
^
)
=
∑
x
^
i
∈
M
(
y
^
i
−
y
i
)
w
^
T
x
^
i
L(\hat{\boldsymbol{w}})=\sum_{\hat{\boldsymbol{x}}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}
L(w^)=x^i∈M∑(y^i−yi)w^Tx^i
其中M为误分类点的集合。
感知机学习算法是误分类驱动的,具体采用随机梯度下降法。首先,任意选取一个超平面
w
^
0
T
x
^
=
0
\hat{\boldsymbol{w}}_{0}^{T} \hat{\boldsymbol{x}}=0
w^0Tx^=0用梯度下降法不断地极小化损失函数
L
(
w
^
)
L(\hat{\boldsymbol{w}})
L(w^), 极小化过程中不是一次使M中所有误分类点的梯度下降,而是一次随机选取一个误分类点使其梯度下降。已知损失函数的梯度为
∇
L
(
w
^
)
=
∂
L
(
w
^
)
∂
w
^
=
∂
∂
w
^
[
∑
x
^
i
∈
M
(
y
^
i
−
y
i
)
w
^
T
x
^
i
]
=
∑
x
^
i
∈
M
[
(
y
^
i
−
y
i
)
∂
∂
w
^
(
w
^
T
x
^
i
)
]
\begin{aligned} \nabla L(\hat{\boldsymbol{w}})=\frac{\partial L(\hat{\boldsymbol{w}})}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\sum_{\hat{\boldsymbol{x}}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right] \\ &=\sum_{\hat{x}_{i} \in \mathbb{M}}\left[\left(\hat{y}_{i}-y_{i}\right) \frac{\partial}{\partial \hat{\boldsymbol{w}}}\left(\hat{\boldsymbol{w}}^{T} \hat{\boldsymbol{x}}_{i}\right)\right] \end{aligned}
∇L(w^)=∂w^∂L(w^)=∂w^∂[x^i∈M∑(y^i−yi)w^Tx^i]=x^i∈M∑[(y^i−yi)∂w^∂(w^Tx^i)]
由矩阵微分公式
∂
x
T
a
∂
x
=
a
\dfrac{\partial x^{T} a}{\partial x}=a
∂x∂xTa=a,可得
∇
L
(
w
^
)
=
∂
L
(
w
^
)
∂
w
^
=
∑
x
^
i
∈
M
(
y
^
i
−
y
i
)
x
^
i
\nabla L(\hat{\boldsymbol{w}})=\frac{\partial L(\hat{\boldsymbol{w}})}{\partial \hat{\boldsymbol{w}}}=\sum_{\hat{\boldsymbol{x}}_{i} \in M}\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i}
∇L(w^)=∂w^∂L(w^)=x^i∈M∑(y^i−yi)x^i
那么随机选取一个误分类点
x
^
i
\hat{x}_{i}
x^i进行梯度下降可得参数
w
^
\hat{\boldsymbol{w}}
w^的更新公式为
w
^
←
w
^
+
Δ
w
^
Δ
w
^
=
−
η
∇
L
(
w
^
)
w
^
←
w
^
−
η
∇
L
(
w
^
)
w
^
←
w
^
−
η
(
y
^
i
−
y
i
)
x
^
i
=
w
^
+
η
(
y
i
−
y
^
i
)
x
^
i
Δ
w
^
=
η
(
y
i
−
y
^
i
)
x
^
i
\begin{gathered} \hat{\boldsymbol{w}} \leftarrow \hat{\boldsymbol{w}}+\Delta \hat{\boldsymbol{w}} \\ \Delta \hat{\boldsymbol{w}}=-\eta \nabla L(\hat{\boldsymbol{w}}) \\ \hat{\boldsymbol{w}} \leftarrow \hat{\boldsymbol{w}}-\eta \nabla L(\hat{\boldsymbol{w}}) \\ \hat{\boldsymbol{w}} \leftarrow \hat{\boldsymbol{w}}-\eta\left(\hat{y}_{i}-y_{i}\right) \hat{\boldsymbol{x}}_{i}=\hat{\boldsymbol{w}}+\eta\left(y_{i}-\hat{y}_{i}\right) \hat{\boldsymbol{x}}_{i}\\ \Delta \hat{\boldsymbol{w}}=\eta\left(y_{i}-\hat{y}_{i}\right) \hat{\boldsymbol{x}}_{i} \end{gathered}
w^←w^+Δw^Δw^=−η∇L(w^)w^←w^−η∇L(w^)w^←w^−η(y^i−yi)x^i=w^+η(yi−y^i)x^iΔw^=η(yi−y^i)x^i
此即为西瓜书式5.2,其中
η
∈
(
0
,
1
)
\eta \in(0,1)
η∈(0,1)成为学习率。
7.2 神经网络模型结构
7.2.1 模型结构
单隐层前馈网络模型结构图如下:
7.2.2 BP算法
标准BP算法:
给定一个(第k个)训练样本
(
x
k
,
y
k
)
\left(\boldsymbol{x}_{k}, \boldsymbol{y}_{k}\right)
(xk,yk), 假设模型输出为
y
^
k
=
(
y
^
1
k
,
y
^
2
k
,
…
,
y
^
l
k
)
\hat{\boldsymbol{y}}_{k}=\left(\hat{y}_{1}^{k}, \hat{y}_{2}^{k}, \ldots, \hat{y}_{l}^{k}\right)
y^k=(y^1k,y^2k,…,y^lk), 则均方误差为
E
k
=
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}
Ek=21j=1∑l(y^jk−yjk)2
此为西瓜书式5.4
如果按照梯度下降法更新模型的参数,那么各个参数的更新公式为
w
h
j
←
w
h
j
+
Δ
w
h
j
=
w
h
j
−
η
∂
E
k
∂
w
h
j
θ
j
←
θ
j
+
Δ
θ
j
=
θ
j
−
η
∂
E
k
∂
θ
j
v
i
h
←
v
i
h
+
Δ
v
i
h
=
v
i
h
−
η
∂
E
k
∂
v
i
h
γ
h
←
γ
h
+
Δ
γ
h
=
γ
h
−
η
∂
E
k
∂
γ
h
\begin{aligned} w_{h j} & \leftarrow w_{h j}+\Delta w_{h j}=w_{h j}-\eta \frac{\partial E_{k}}{\partial w_{h j}} \\ \theta_{j} & \leftarrow \theta_{j}+\Delta \theta_{j}=\theta_{j}-\eta \frac{\partial E_{k}}{\partial \theta_{j}} \\ v_{i h} & \leftarrow v_{i h}+\Delta v_{i h}=v_{i h}-\eta \frac{\partial E_{k}}{\partial v_{ih}} \\ \gamma_{h} & \leftarrow \gamma_{h}+\Delta \gamma_{h}=\gamma_{h}-\eta \frac{\partial E_{k}}{\partial \gamma_{h}} \end{aligned}
whjθjvihγh←whj+Δwhj=whj−η∂whj∂Ek←θj+Δθj=θj−η∂θj∂Ek←vih+Δvih=vih−η∂vih∂Ek←γh+Δγh=γh−η∂γh∂Ek
依次考虑针对不同参数求导。
针对 w h j w_{hj} whj
已知
E
k
E_{k}
Ek和
w
h
j
w_{hj}
whj的函数链式关系为
E
k
=
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
y
^
j
k
=
f
(
β
j
−
θ
j
)
β
j
=
∑
h
=
1
q
w
h
j
b
h
E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right)\\ \beta_{j}=\sum_{h=1}^{q} w_{h j} b_{h}
Ek=21j=1∑l(y^jk−yjk)2y^jk=f(βj−θj)βj=h=1∑qwhjbh
f为Sigmoid函数(
f
′
(
x
)
=
f
(
x
)
(
1
−
f
(
x
)
)
f^{\prime}(x)=f(x)(1-f(x))
f′(x)=f(x)(1−f(x)))。所以
∂
E
k
∂
w
h
j
=
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
w
h
j
\frac{\partial E_{k}}{\partial w_{h j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}}
∂whj∂Ek=∂y^jk∂Ek⋅∂βj∂y^jk⋅∂whj∂βj
又
∂
E
k
∂
y
^
j
k
=
∂
[
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
]
∂
y
^
j
k
=
1
2
×
2
×
(
y
^
j
k
−
y
j
k
)
×
1
=
y
^
j
k
−
y
j
k
\begin{aligned} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} &=\frac{\partial\left[\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\right]}{\partial \hat{y}_{j}^{k}} \\ &=\frac{1}{2} \times 2 \times\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \times 1 \\ &=\hat{y}_{j}^{k}-y_{j}^{k} \end{aligned}
∂y^jk∂Ek=∂y^jk∂[21∑j=1l(y^jk−yjk)2]=21×2×(y^jk−yjk)×1=y^jk−yjk
∂ y ^ j k ∂ β j = ∂ [ f ( β j − θ j ) ] ∂ β j = f ′ ( β j − θ j ) × 1 = f ( β j − θ j ) × [ 1 − f ( β j − θ j ) ] = y ^ j k ( 1 − y ^ j k ) \begin{aligned} \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} &=\frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \beta_{j}} \\ &=f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times 1\\ &=f\left(\beta_{j}-\theta_{j}\right) \times\left[1-f\left(\beta_{j}-\theta_{j}\right)\right] \\ &=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right) \end{aligned} ∂βj∂y^jk=∂βj∂[f(βj−θj)]=f′(βj−θj)×1=f(βj−θj)×[1−f(βj−θj)]=y^jk(1−y^jk)
∂ β j ∂ w h j = ∂ ( ∑ h = 1 q w h j b h ) ∂ w h j = b h \begin{aligned} \frac{\partial \beta_{j}}{\partial w_{h j}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial w_{h j}} \\ &=b_{h} \end{aligned} ∂whj∂βj=∂whj∂(∑h=1qwhjbh)=bh
令
g
j
=
−
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
=
−
(
y
^
j
k
−
y
j
k
)
⋅
y
^
j
k
(
1
−
y
^
j
k
)
=
y
^
j
k
(
1
−
y
^
j
k
)
(
y
j
k
−
y
^
j
k
)
g_{j}=-\dfrac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \dfrac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}}=-\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \cdot \hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right)
gj=−∂y^jk∂Ek⋅∂βj∂y^jk=−(y^jk−yjk)⋅y^jk(1−y^jk)=y^jk(1−y^jk)(yjk−y^jk),此即为西瓜书式5.10。所以
Δ
w
h
j
=
−
η
∂
E
k
∂
w
h
j
=
−
η
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
w
h
j
=
η
g
j
b
h
\begin{aligned} \Delta w_{h j} &=-\eta \frac{\partial E_{k}}{\partial w_{h j}} \\ &=-\eta \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}}\\ &=\eta g_{j} b_{h} \end{aligned}
Δwhj=−η∂whj∂Ek=−η∂y^jk∂Ek⋅∂βj∂y^jk⋅∂whj∂βj=ηgjbh
此即为西瓜书式5.11。
针对 θ j \theta_{j} θj
已知
E
k
E_{k}
Ek和
θ
j
\theta_{j}
θj的函数链式关系为
E
k
=
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
y
^
j
k
=
f
(
β
j
−
θ
j
)
E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right)
Ek=21j=1∑l(y^jk−yjk)2y^jk=f(βj−θj)
所以
∂
E
k
∂
θ
j
=
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
θ
j
\frac{\partial E_{k}}{\partial \theta_{j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}}
∂θj∂Ek=∂y^jk∂Ek⋅∂θj∂y^jk
∂ E k ∂ θ j = ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ θ j = ( y ^ j k − y j k ) ⋅ ∂ y ^ j k ∂ θ j = ( y ^ j k − y j k ) ⋅ ∂ [ f ( β j − θ j ) ] ∂ θ j = ( y ^ j k − y j k ) ⋅ f ′ ( β j − θ j ) × − 1 = ( y j k − y ^ j k ) ⋅ f ′ ( β j − θ j ) = ( y j k − y ^ j k ) y ^ j k ( 1 − y ^ j k ) \begin{aligned} \frac{\partial E_{k}}{\partial \theta_{j}} &=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} \\ &=\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \theta_{j}} \\ &=\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \cdot \frac{\partial\left[f\left(\beta_{j}-\theta_{j}\right)\right]}{\partial \theta_{j}} \\ &=\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) \cdot f^{\prime}\left(\beta_{j}-\theta_{j}\right) \times-1 \\ &=\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \cdot f^{\prime}\left(\beta_{j}-\theta_{j}\right) \\ &=\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right) \end{aligned} ∂θj∂Ek=∂y^jk∂Ek⋅∂θj∂y^jk=(y^jk−yjk)⋅∂θj∂y^jk=(y^jk−yjk)⋅∂θj∂[f(βj−θj)]=(y^jk−yjk)⋅f′(βj−θj)×−1=(yjk−y^jk)⋅f′(βj−θj)=(yjk−y^jk)y^jk(1−y^jk)
所以
Δ
θ
j
=
−
η
∂
E
k
∂
θ
j
=
−
η
(
y
j
k
−
y
^
j
k
)
y
^
j
k
(
1
−
y
^
j
k
)
=
−
η
g
j
\begin{aligned} \Delta \theta_{j} &=-\eta \frac{\partial E_{k}}{\partial \theta_{j}} \\ &=-\eta\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) \hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\\ &=-\eta g_{j} \end{aligned}
Δθj=−η∂θj∂Ek=−η(yjk−y^jk)y^jk(1−y^jk)=−ηgj
此即为西瓜书式5.12
针对 v i h v_{ih} vih
已知
E
k
E_{k}
Ek和
v
i
h
v_{ih}
vih的函数链式关系为
E
k
=
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
y
^
j
k
=
f
(
β
j
−
θ
j
)
β
j
=
∑
h
=
1
q
w
h
j
b
h
b
h
=
f
(
α
h
−
γ
h
)
α
h
=
∑
i
=
1
d
v
i
h
x
i
E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right)\\ \beta_{j}=\sum_{h=1}^{q} w_{h j} b_{h}\\ b_{h}=f\left(\alpha_{h}-\gamma_{h}\right)\\ \alpha_{h}=\sum_{i=1}^{d} v_{i h} x_{i}
Ek=21j=1∑l(y^jk−yjk)2y^jk=f(βj−θj)βj=h=1∑qwhjbhbh=f(αh−γh)αh=i=1∑dvihxi
所以 ∂ E k ∂ v i h = ∑ j = 1 i ∂ E k ∂ y ^ j k ⋅ ∂ y ^ j k ∂ β j ⋅ ∂ β j ∂ b h ⋅ ∂ b h ∂ α h ⋅ ∂ α h ∂ v i h \dfrac{\partial E_{k}}{\partial v_{i h}}=\sum_{j=1}^{i} \dfrac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \dfrac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \dfrac{\partial \beta_{j}}{\partial b_{h}} \cdot \dfrac{\partial b_{h}}{\partial \alpha_{h}} \cdot \dfrac{\partial \alpha_{h}}{\partial v_{i h}} ∂vih∂Ek=∑j=1i∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅∂αh∂bh⋅∂vih∂αh
∂ β j ∂ b h = ∂ ( ∑ h = 1 q w h j b h ) ∂ b h = w h j \begin{aligned} \frac{\partial \beta_{j}}{\partial b_{h}} &=\frac{\partial\left(\sum_{h=1}^{q} w_{h j} b_{h}\right)}{\partial b_{h}} \\ &=w_{h j} \end{aligned} ∂bh∂βj=∂bh∂(∑h=1qwhjbh)=whj
∂ b h ∂ α h = ∂ [ f ( α h − γ h ) ] ∂ α h = f ′ ( α h − γ h ) × 1 = f ( α h − γ h ) × [ 1 − f ( α h − γ h ) ] = b h ( 1 − b h ) \begin{aligned} \frac{\partial b_{h}}{\partial \alpha_{h}} &=\frac{\partial\left[f\left(\alpha_{h}-\gamma_{h}\right)\right]}{\partial \alpha_{h}} \\ &=f^{\prime}\left(\alpha_{h}-\gamma_{h}\right) \times 1 \\ &=f\left(\alpha_{h}-\gamma_{h}\right) \times\left[1-f\left(\alpha_{h}-\gamma_{h}\right)\right] \\ &=b_{h}\left(1-b_{h}\right) \end{aligned} ∂αh∂bh=∂αh∂[f(αh−γh)]=f′(αh−γh)×1=f(αh−γh)×[1−f(αh−γh)]=bh(1−bh)
∂ α h ∂ v i h = ∂ ( ∑ i = 1 d v i h x i ) ∂ v i h = x i \begin{aligned} \frac{\partial \alpha_{h}}{\partial v_{i h}} &=\frac{\partial\left(\sum_{i=1}^{d} v_{i h} x_{i}\right)}{\partial v_{i h}} \\ &=x_{i} \end{aligned} ∂vih∂αh=∂vih∂(∑i=1dvihxi)=xi
令
e
h
=
−
∂
E
k
∂
α
h
=
−
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
∂
b
h
∂
α
h
=
b
h
(
1
−
b
h
)
∑
j
=
1
l
w
h
j
g
j
e_{h}=-\dfrac{\partial E_{k}}{\partial \alpha_{h}}=-\sum_{j=1}^{l} \dfrac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \dfrac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \dfrac{\partial \beta_{j}}{\partial b_{h}} \cdot \dfrac{\partial b_{h}}{\partial \alpha_{h}}=b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j}
eh=−∂αh∂Ek=−∑j=1l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅∂αh∂bh=bh(1−bh)∑j=1lwhjgj,此即为西瓜书式5.15
Δ
v
i
h
=
−
η
∂
E
k
∂
v
i
h
=
−
η
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
∂
b
h
∂
α
h
⋅
∂
α
h
∂
v
i
h
=
η
e
h
x
i
\begin{aligned} \Delta v_{i h} &=-\eta \frac{\partial E_{k}}{\partial v_{i h}} \\ &=-\eta \sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \cdot \frac{\partial \alpha_{h}}{\partial v_{i h}}\\ &=\eta e_{h} x_{i} \end{aligned}
Δvih=−η∂vih∂Ek=−ηj=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅∂αh∂bh⋅∂vih∂αh=ηehxi
此即为西瓜书式5.13
针对 γ h \gamma_{h} γh
已知
E
k
E_{k}
Ek和
γ
h
\gamma_{h}
γh的函数链式关系为
E
k
=
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
y
^
j
k
=
f
(
β
j
−
θ
j
)
β
j
=
∑
h
=
1
q
w
h
j
b
h
b
h
=
f
(
α
h
−
γ
h
)
E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2}\\ \hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right)\\ \beta_{j}=\sum_{h=1}^{q} w_{h j} b_{h}\\ b_{h}=f\left(\alpha_{h}-\gamma_{h}\right)
Ek=21j=1∑l(y^jk−yjk)2y^jk=f(βj−θj)βj=h=1∑qwhjbhbh=f(αh−γh)
所以
∂
E
k
∂
γ
h
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
∂
b
h
∂
γ
h
\dfrac{\partial E_{k}}{\partial \gamma_{h}}=\sum_{j=1}^{l} \dfrac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \dfrac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \dfrac{\partial \beta_{j}}{\partial b_{h}} \cdot \dfrac{\partial b_{h}}{\partial \gamma_{h}}
∂γh∂Ek=∑j=1l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅∂γh∂bh
∂
E
k
∂
γ
h
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
∂
b
h
∂
γ
h
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
∂
[
f
(
α
h
−
γ
h
)
]
∂
γ
h
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
f
′
(
α
h
−
γ
h
)
⋅
(
−
1
)
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
f
(
α
h
−
γ
h
)
×
[
1
−
f
(
α
h
−
γ
h
)
]
⋅
(
−
1
)
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
b
h
⋅
b
h
(
1
−
b
h
)
⋅
(
−
1
)
=
∑
j
=
1
l
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
w
h
j
⋅
b
h
(
1
−
b
h
)
⋅
(
−
1
)
=
∑
j
=
1
l
g
j
⋅
w
h
j
⋅
b
h
(
1
−
b
h
)
=
e
h
\begin{aligned} \frac{\partial E_{k}}{\partial \gamma_{h}} &=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \gamma_{h}} \\ &=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot \frac{\partial\left[f\left(\alpha_{h}-\gamma_{h}\right)\right]}{\partial \gamma_{h}} \\ &=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot f^{\prime}\left(\alpha_{h}-\gamma_{h}\right) \cdot(-1) \\ &=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot f\left(\alpha_{h}-\gamma_{h}\right) \times\left[1-f\left(\alpha_{h}-\gamma_{h}\right)\right] \cdot(-1) \\ &=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} \cdot b_{h}\left(1-b_{h}\right) \cdot(-1)\\ &=\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot w_{h j} \cdot b_{h}\left(1-b_{h}\right) \cdot(-1) \\ &=\sum_{j=1}^{l} g_{j} \cdot w_{h j} \cdot b_{h}\left(1-b_{h}\right) \\ &=e_{h} \end{aligned}
∂γh∂Ek=j=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅∂γh∂bh=j=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅∂γh∂[f(αh−γh)]=j=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅f′(αh−γh)⋅(−1)=j=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅f(αh−γh)×[1−f(αh−γh)]⋅(−1)=j=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅∂bh∂βj⋅bh(1−bh)⋅(−1)=j=1∑l∂y^jk∂Ek⋅∂βj∂y^jk⋅whj⋅bh(1−bh)⋅(−1)=j=1∑lgj⋅whj⋅bh(1−bh)=eh
所以
Δ
γ
h
=
−
η
∂
E
k
∂
γ
h
=
−
η
ℓ
ϵ
h
\begin{aligned} \Delta \gamma_{h} &=-\eta \frac{\partial E_{k}}{\partial \gamma_{h}} \\ &=-\eta \ell \epsilon_{h} \end{aligned}
Δγh=−η∂γh∂Ek=−ηℓϵh
此即为西瓜书式5.14