学习笔记,仅供参考,有错必纠
PS : 本BLOG采用中英混合模式
非线性回归模型
支持向量机
SVMs are a class of powerful, highly flexible modeling techniques.
For regression, we follow Smola (1996) and Drucker et al. (1997) and motivate this technique inthe framework of robust regression(稳健回归) where we seek to minimize the effect of outliers(最小化异常值的影响) on the regression equations.
Also, there are several flavors of support vector regression and we focus on one particular technique called -insensitive regression(
敏感回归).
Recall that linear regression seeks to find parameter estimates that minimize SSE(最小化SSE).One drawback of minimizing SSE is that the parameter estimates can be influenced by just one observation that falls far from the overall trend in the data.
为了缓解这个问题,SVM会让用户设定一个临界值,那些残差位于临界值内的点将不对回归模型做出贡献,而残差的绝对值超过该临界值的点将对模型做出线性比例的贡献。
There are several consequences to this approach.
First, since the squared residuals are not used(残差的平方将不被使用), large outliers have a limited effect on the regression equation.
Second, samples that the model fits well (the residuals are small) have no effect on the regression equation.
In fact, if the threshold(临界值) is set to a relatively large value, then the outliers are the only points that define the regression line!
This is somewhat counterintuitive(这看起来有点反直觉): the poorly predicted points define the line(那些预测效果不佳的点定义了回归直线). However, this approach has been shown to be very effective in defining the model.
为了估计模型的参数,SVM使用了上图的损失函数(横轴为残差,纵轴为贡献),同时还增加了惩罚项。SVM的回归系数将最小化:
其中为
不敏感函数,Cost参数是由用户设置的代价惩罚,它惩罚大的残差项。
Recall that the simple linear regression model predicted new samples using linear combinations of the data and parameters. For a new sample, , the prediction equation is:
The linear support vector machine prediction function is very similar. The parameter estimates can be written as functions of a set ofunknown parameters () and the training set data points (一系列未知参数和训练集数据点的函数)so that:
There are several aspects of this equation worth pointing out.
First, there are as many parameters as there are data points.(
参数的个数与数据点个数相同).From the standpoint of classical regression modeling, this model would be considered overparameterized(过度参数化);typically, it is better to estimate fewer parameters than data points(参数个数应该小于数据点个数).
However, the use of the cost value effectively regularizes the model to help alleviate this problem.(模型使用的代价函数可以有效的对模型进行正则化,从而减轻这一问题)
Second, the individual training set data points (the ) are required for new predictions(训练集中的每个数据点都被用于预测值的计算). When the training set is large, this makes the prediction equations less compact(不候简约) than other techniques. However, for some percentage of the training set samples, the
parameters will be exactly zero, indicating that they have no impact on the prediction equation. The data points associated with an
parameter of zero are the training set samples that are within ± of the regression line (are within the “funnel” or “tube” around the regression line). As a consequence, only a subset of training set data points, where
, are needed for prediction.
Since the regression line is determined using these samples, they are called the support vectors as they
support the regression line.(由于回归线是由这些观测决定的,因此他们被称为支持向量,原因是他们支撑起了最终的回归线)
新样本点进入预测函数的形式是它们与已有数据点叉积的和,在矩阵代数中,这对应了一个点积(即),这是一个重要的特征,因为这个回归方程可以改写为更一般的形式:
其中被称为核函数(kernel function),当预测变量在模型中是线性时,这个核函数就变为了简单的叉积求和:
然而,还有许多其他类型的核函数可以用于扩展回归模型,并针对预测变量引入非线性函数:
其中和
是尺度参数,由于这些预测变量的函数将生成非线性的模型,因此这种推广往往被称为"核方法"(kernel trick).
Which kernel function(核函数) should be used?
This depends on the problem.
When the regression line is truly linear, the linear kernel function(线性核函数) will be a better choice.
Note that some of the kernel functions have extra parameters. For example, the polynomial degree(多项式的阶数) in the polynomial kernel(多项式核函数) must be specified. Similarly, the radial basis function (径向基函数)has a parameter () that controls the scale. These parameters, along with the cost value(代价参数), constitute the tuning parameters(调优参数) for the model.
In the case of the radial basis function, there is a possible computational shortcut to estimating the kernel parameter. Caputo et al. (2002) suggested that the parameter can be estimated using combinations of the training set points(训练集样本点的组合) to calculate the distribution of , then use the 10th and 90th percentiles as a range for
.
Instead of tuning this parameter over a grid of candidate values, we can use the midpoint of these two percentiles.(我们可以利用这两个分位点的中点作为参数的估计,而不是在一系列网格点上进行调优)
The cost parameter is the main tool for adjusting the complexity of the model.
When the cost is large, the model becomes very flexible since the effect of errors is amplified(误差的影响被放大). When the cost is small, the model will stiffen(僵硬) and become less likely to over-fit (but more likely to underfit) because the contribution of the squared parameters is proportionally large in the modified error function. (在修改后的误差函数中,参数的平方对误差的贡献将成比例的增大)
One could also tune the model over the size of the funnel().(建模者还可以针对漏斗的大小进行调优)However, there is a relationship between
and the cost parameter. In our experience, we have found that the cost parameter provides more flexibility for tuning the model. So we suggest fixing a value for
and tuning over the other kernel parameters(固定
的取值,对其他核函数的参数进行调优).
Since the predictors enter into the model as the sum of cross products(叉积的和), differences in the predictor scales can affect the model(预测变量的标度会影响模型). Therefore, we recommend centering and scaling(中心化标准化) the predictors prior to building an SVM model.