R语言机器学习 MLR包（以及一个简单的案例实现）-CFANZ编程社区

最近在尝试用Rstudio写Ensemble Learning的代码，尝试了mlr（2019年后更新的部分在mlr3中，mlr不再更新）和caret两个包，做点笔记。

mlr的功能更集中于机器学习接口（也可以训练自定义的ML模型），而且相比caret，mlr的使用看起来更接近Python的sklearn, 对Pythoners比较友好。

文章目录

0 MLR包基础

类似于Python，R包mlr实现机器学习也主要依靠四个过程：

generate tasks
主要描述了使用的数据集信息，也可以通过不同的子类来指定任务的类型（回归或分类等）。
generate learner
构建学习器使用的命令，主要包括用目前主流的机器学习接口，如classification, regression, survival analysis, and clustering.
train model
对已创建好的学习器传入一个数据集。通过创建一个train()实例来实现，创建时所需的参数为此前创建的task类和learner类的实例。（有些抽象，可以看成需要传入的参数类型为以上两步创建的结果。）
predicttion
对创建的模型使用验证集数据进行验证。

此外，调参和ROC分析等也可以通过mlr来实现，具体见Advanced模块的Advanced Tuning和ROC Analysis and Performance Curves部分。

Reference & Index from mlr-org website:

1 Task

Task() / make< Task Name >():Constructing a task

tasks 主要描述了用于训练的数据集信息，例如分类问题中的target variabe/outcome variable。
Task最基本的类为Task()，使用Task()可以创建一个Task实例。此外还有基于Task类继承的子类，分别适用于回归、分类等问题。创建一个task的格式为make < Task Type >，如makeClassifTask()。

Task Type	Function
RegrTask()	回归问题
ClassifTask()	二分类与class-dependent cost多分类问题
ClusterTask()	聚类分析
SurvTask()	生存分析
MultilabelTask()	多分类问题
CostSensTask()	cost-sensitive classification

栗子：

###load packages
library(mlr)
library(caret)

###load data
setwd( "D:/R_Code/data")
df <- read.csv('bank.csv',sep = ';',stringsAsFactors = TRUE)
df <- df[,-12]

###Change weights 对于不同类别的观测赋予不同的权重，此处用一个自定义函数实现
Weights <- function(x,n){
  weights <- ifelse(x['y'] == 'yes',n,1)
  return(weights) #returns an array. a column of data.frame is also feasible.
}

weights_of_classes <- Weights(df,5)
###因为mlr中adaboost模型不支持传入权重，故创建task时没有使用

###Tasks
#Cassification
classif.task = makeClassifTask(id = 'Bank',
                               data = df,
                               target = 'y', #in classification task, target variable must be a factor
                               positive = 'yes') #the positive class,if not declare that the first factor level of the target variable is the positive class
                              
classif.task

上文在Task中指定了两类样本的权重，但如果误分类成本与变量相(example-dependent misclassification costs)，则应该考虑Cost-Sensitive Classification。

查看按照如上参数设置的Task（或者使用getTaskDesc(< task_name>)）：

> classif.task
Supervised task: Bank
Type: classif
Target: y
Observations: 4521
Features:
   numerics     factors     ordered functionals 
          7          10           0           0 
Missings: FALSE
Has weights: TRUE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 2
  no  yes 
4000  521 
Positive class: yes

subsetTask()

此外，可以使用subsetTask()为训练集创建一个sub-task而不用重新建立：

train_task <-  subsetTask(classif.task, subset = as.numeric(train_index))
train_task

#results:
> train_task

Supervised task: Bank
Type: classif
Target: y
Observations: 3165
Features:
   numerics     factors     ordered functionals 
          7          10           0           0 
Missings: FALSE
Has weights: TRUE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 2
  no  yes 
2800  365 
Positive class: yes

2 Learner

makeLearner(): Constructing a learner

构建学习器的统一命令为makeLearner()，通过参数设置模型的类别和预测的类型(类别或者概率)。目前mlr中已经集成可直接调用的模型见：

classif.lrn = makeLearner("classif.ada", #model:adaboost
                          predict.type = "response", #预测输出为类别
                          par.vals = list(iter  = 100,cp = 0.01), #超参数设置
                          fix.factors.prediction = TRUE)

getParamSet()：available hyperparameters

getParamSet(< learner_name >)查看可以调整的超参数，即makeLearner()中使用参数par.vals进行设置的参数。如果makeLearner已有指定参数，则par.vals中设置的参数比makeLearner()中的设置有更高优先级。

> getParamSet(classif.lrn)
                   Type len         Def               Constr Req Tunable Trafo
loss           discrete   - exponential exponential,logistic   -    TRUE     -
type           discrete   -    discrete discrete,real,gentle   -    TRUE     -
iter            integer   -          50             1 to Inf   -    TRUE     -
nu              numeric   -         0.1             0 to Inf   -    TRUE     -
bag.frac        numeric   -         0.5               0 to 1   -    TRUE     -
model.coef      logical   -        TRUE                    -   -    TRUE     -
bag.shift       logical   -       FALSE                    -   -    TRUE     -
max.iter        integer   -          20             1 to Inf   -    TRUE     -
delta           numeric   -       1e-10             0 to Inf   -    TRUE     -
verbose         logical   -       FALSE                    -   -   FALSE     -
minsplit        integer   -          20             1 to Inf   -    TRUE     -
minbucket       integer   -           -             1 to Inf   -    TRUE     -
cp              numeric   -        0.01               0 to 1   -    TRUE     -
maxcompete      integer   -           4             0 to Inf   -    TRUE     -
maxsurrogate    integer   -           5             0 to Inf   -    TRUE     -
usesurrogate   discrete   -           2                0,1,2   -    TRUE     -
surrogatestyle discrete   -           0                  0,1   -    TRUE     -
maxdepth        integer   -          30              1 to 30   -    TRUE     -
xval            integer   -          10             0 to Inf   -   FALSE     -

3 Train

train()/mlr::train()可以对learner使用数据集进行训练。
同时加载caret包和mlr包时，可能会出现mlr包的train()报错的情况，使用mlr::train()代替即可。train()实例的创建方式为：train(learner, task, subset = NULL, weights = NULL)
函数train()返回一个WrappedModel (makeWrappedModel())类的对象，可用于对新观测集的预测。

###Training a Learner
> mod = mlr::train(classif.lrn,classif.task,subset = train_index)
> mod
Model for learner.id=classif.ada; learner.class=classif.ada
Trained on: task.id = Bank; obs = 3165; features = 15
Hyperparameters: xval=0,iter=100,cp=0.01,maxdepth=15

其中，train()中的subset参数指出了哪些观测用于训练，其对应的数据类型为数值型的向量，对应观测在原数据集中的索引。

参数	含义	数据类型
learn	learner类型，如果传入的为字符串格式，则通过makeLearner创建。	learner or string
task	task	task类
subset	选择用于训练的观测,即观测的索引	logical or index vector
weight	个案权重，以向量的形式传入	numeric,必须与subset中的向量长度相同，默认为所有个案权重相同。如果在task中已经传入了权重，则该权重在train()中被新传入的权重覆盖，train()中的权重拥有更高的优先级。

4 Predict

使用predict()对之前创建的task进行预测，如要使用验证集，通过subset参数传入观测值得索引即可。
有两种得到预测结果的方式：

Either pass the Task() via the task argument
or pass a data.frame via the newdata argument.

第一种方法是在predict中利用subset指定task所用的索引：

###Predicttion

#get test_index 
df$index <- 1:length(df[,1])
df$test <- ifelse(df$index %in% train_index,NA,df$index)
test_index <- na.omit(df$test)

#predict in test set
task.pred = predict(mod, task = classif.task, subset = test_index)
task.pred

###result
> task.pred
Prediction: 1356 observations
predict.type: response
threshold: 
time: 0.56
   id truth response
1   1    no       no
3   3    no       no
4   4    no       no
6   6    no       no
9   9    no       no
10 10    no       no
... (#rows: 1356, #cols: 3)

predict()中的第一个参数为创建的train()对象。predict()返回的结果task.predict中，task$data有三列，分别为id,真实值和预测值，可以依次得到混淆矩阵和精确度、错误率等指标。
简单查看一下混淆矩阵：

 >table(actual = task.pred$data[,2],predicted = task.pred$data[,3])
      predicted
actual   no  yes
   no  1185   15
   yes  138   18

或者指定newdata

df <- df[,-c(17,18)]#把刚才生成的两列去掉
head(df)

newdata.pred = predict(mod, newdata = df[test_index,])
newdata.pred

#比较两种方法的结果，newdata返回的数据似乎没有id这一列
table(actual = task.pred$data[,2],predicted = task.pred$data[,3])
table(actual = newdata.pred$data[,1],predicted = newdata.pred$data[,2])

###----------------------results---------------------------
newdata.pred = predict(mod, newdata = df[test_index,])
> newdata.pred
Prediction: 1356 observations
predict.type: response
threshold: 
time: 0.49
   truth response
1     no       no
3     no       no
4     no       no
6     no       no
9     no       no
10    no       no
... (#rows: 1356, #cols: 2)
> table(actual = task.pred$data[,2],predicted = task.pred$data[,3])
      predicted
actual   no  yes
   no  1185   15
   yes  138   18
> table(actual = newdata.pred$data[,1],predicted = newdata.pred$data[,2])
      predicted
actual   no  yes
   no  1185   15
   yes  138   18

此时创建一个model的主要步骤已完成，后续的调参可用Basic-Tuning或Advanced-Tuning，日后再记。