模型选择，过拟合和欠拟合

0. 环境介绍

环境使用 Kaggle 里免费建立的 Notebook

教程使用李沐老师的动手学深度学习网站和视频讲解

小技巧：当遇到函数看不懂的时候可以按 Shift+Tab 查看函数详解。

1. 选择模型

1.1 训练误差和泛化误差

训练误差：模型在训练数据上的误差
泛化误差：模型在新数据集上的误差
例子：根据模拟考试成绩来预测未来考试成绩
- 在过去的考试中表现很好（训练误差）不代表未来考试一定会好（泛化误差）。
- 学生 A 通过背书在模拟考试中拿到很好成绩（过拟合）。
- 学生 B 知道答案后面的原因。
- 真实考试中学生 B 的成绩很大概率上要比 A 好。

1.2 验证数据集和测试数据集

验证数据集合：一个用来评估模型好坏的数据集
- 例如拿出 50% 的训练数据
- 不要跟训练数据混在一起（常犯错误）
测试数据集：只用一次的数据集，例如：
- 未来的考试
- 我出价的房子的实际成交价
- 用在 kaggle 私有排行榜中的数据集

1.3 K-折交叉验证

在没有足够多数据时使用（这是常态）
算法：
- 将训练数据分割成 $K$ 块
- $f o r$ $i = 1, . . ., K$
  - 使用第 $i$ 块作为验证数据集，其余的作为训练数据集
- 报告 $K$ 个验证集误差的平均
常用 $K = 5$ 或 $10$ 、

2. 过拟合和欠拟合

在这里插入图片描述

2.1 模型容量

拟合各种函数的能力
低容量的模型难以拟合训练数据
高容量的模型可以记住所有的训练数据

模型容量的影响：
在这里插入图片描述

2.2 估计模型容量

难以在不同种类算法之间比较
- 例如树模型和神经网络
给定一个模型种类，将有两个主要因素
- 参数的个数
- 参数值的选择范围

2.3 $V C$ 维

统计学习理论的一个核心思想
对于一个分类模型， $V C$ 等于一个最大数据集的大小，不管如何给定标号（label），都存在一个模型来对它进行完美分类。

线性分类器的 $V C$ 维：

$2$ 维输入的感知机， $V C 维 = 3$
- 能够分类任何 $3$ 个点，但不是 $4$ 个。
支持 $N$ 维输入的感知机的 $V C$ 维是 $N + 1$
一些多层感知机的 VC 维 $O(N\log_{2}{N})$

2.4 数据复杂度

多个重要因素
- 样本个数
- 每个样本的元素个数
- 时间、空间结构
- 多样性

3. 通过多项式拟合来交互探索上述概念

3.0 导入模块

# !pip install -U d2l
import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

3.1 生成数据集

给定 $x$ ，我们将使用以下三阶多项式来生成训练和测试数据的标签：
$3.4\frac{x^2}{2!} + 5.6 \frac{x^3}{3!} + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, 0.1^2).$

max_degree = 20  # 多项式的最大阶数
n_train, n_test = 100, 100  # 训练和测试数据集大小
true_w = np.zeros(max_degree)  # 分配大量的空间
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))

for i in range(max_degree):
    # 避免过大的梯度值或损失值
    poly_features[:, i] /= math.gamma(i + 1)  # gamma(n)=(n-1)!
# labels的维度:(n_train+n_test,)
labels = np.dot(poly_features, true_w)
# 噪音
labels += np.random.normal(scale=0.1, size=labels.shape)

将 numpy array 转换为 tensor：

true_w, features, poly_features, labels = [torch.tensor(x, dtype=
    torch.float32) for x in [true_w, features, poly_features, labels]]

features[:2], poly_features[:2, :], labels[:2]

在这里插入图片描述
从生成的数据集中查看一下前 2 个样本，第一个值是与偏置相对应的常量特征。

3.2 定义模型

3.2.1 定义计算损失

def evaluate_loss(net, data_iter, loss):  #@save
    """评估给定数据集上模型的损失"""
    metric = d2l.Accumulator(2)  # 损失的总和,样本数量
    for X, y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]

最后输出结果是平均损失。

3.2.2 定义训练函数

def train(train_features, test_features, train_labels, test_labels,
          num_epochs=400):
    # 均方差损失函数
    loss = nn.MSELoss(reduction='none')
    # train_features.shape 为 (400，20)
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
                                     evaluate_loss(net, test_iter, loss)))
    print('weight:', net[0].weight.data.numpy())

3.3 三阶多项式函数拟合（正常）

# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
      labels[:n_train], labels[n_train:])

在这里插入图片描述

3.3 线性函数拟合（欠拟合）

# 从多项式特征中选择前2个维度，即1和x
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
      labels[:n_train], labels[n_train:])

在这里插入图片描述

3.4 高阶多项式函数拟合（过拟合）

# 从多项式特征中选取所有维度
train(poly_features[:n_train, :], poly_features[n_train:, :],
      labels[:n_train], labels[n_train:], num_epochs=1500)

在这里插入图片描述

Pytorch 模型选择，过拟合和欠拟合

模型选择，过拟合和欠拟合

0. 环境介绍

1. 选择模型

1.1 训练误差和泛化误差

1.2 验证数据集和测试数据集

1.3 K-折交叉验证

2. 过拟合和欠拟合

2.1 模型容量

2.2 估计模型容量

2.3 $V C$ 维

2.4 数据复杂度

3. 通过多项式拟合来交互探索上述概念

3.0 导入模块

3.1 生成数据集

3.2 定义模型

3.2.1 定义计算损失

3.2.2 定义训练函数

3.3 三阶多项式函数拟合（正常）

3.3 线性函数拟合（欠拟合）

3.4 高阶多项式函数拟合（过拟合）

4. 小结

Pytorch 模型选择，过拟合和欠拟合

模型选择，过拟合和欠拟合

0. 环境介绍

1. 选择模型

1.1 训练误差和泛化误差

1.2 验证数据集和测试数据集

1.3 K-折交叉验证

2. 过拟合和欠拟合

2.1 模型容量

2.2 估计模型容量

2.3 V C VC VC 维

2.4 数据复杂度

3. 通过多项式拟合来交互探索上述概念

3.0 导入模块

3.1 生成数据集

3.2 定义模型

3.2.1 定义计算损失

3.2.2 定义训练函数

3.3 三阶多项式函数拟合（正常）

3.3 线性函数拟合（欠拟合）

3.4 高阶多项式函数拟合（过拟合）

4. 小结

2.3 $V C$ 维