0
点赞
收藏
分享

微信扫一扫

【数据挖掘项目】Airbnb新用户的民宿预定结果预测

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据

摘要

本文主要根据对Airbnb 新用户的民宿预定结果进行预测,完整的陈述了从 ​数据探索​到 ​特征工程​到 ​构建模型​的整个过程。

其中: ​1数据探索​部分主要基于 ​pandas​,利用常见的: ​head()​, ​value_counts()​, ​describe()​, ​isnull()​, ​unique()​等函数以及通过 ​matplotlib​作图对数据进行理解和探索; ​2.特征工程​部分主要是通过从日期中提取 ​年月日​, ​季节​, ​weekday​,对年龄进行 ​分段​,计算相关特征之间的 ​差值​,根据用户id进行分组,从而统计一些特征变量的 ​次数​, ​平均值​, ​标准差​等等,以及通过 ​one hot encoding​和 ​labels encoding​对数据进行编码来提取特征; ​3.构建模型​部分主要基于 ​sklearn​, ​xgboost​,通过调用不同的模型进行预测,其中涉及到的模型有,逻辑回归模型 ​LogisticRegression​,树模型: ​DecisionTreeRandomForestAdaBoostBaggingExtraTreeGraBoost​,SVM模型: ​SVM-rbfSVM-polySVM-linear​, ​xgboost​,以及通过改变 ​模型的参数​和 ​数据量大小​,来观察 ​NDGG​的评分结果,从而了解不同模型,不同参数和不同数据量大小对预测结果的影响.

1. 背景

About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user's first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that 'NDF' is different from 'other' because 'other' means there was a booking, but is to a country not included in the list, while 'NDF' means there wasn't a booking.

2. 数据描述

总共包含6个csv文件

  1. trainusers2.csv - the training set of users (训练数据)
  2. testusers.csv - the test set of users (测试数据)
  • id: user id (用户id)
  • dateaccountcreated(帐号注册时间): the date of account creation
  • timestampfirstactive(首次活跃时间): timestamp of the first activity, note that it can be earlier than dateaccountcreated or datefirstbooking because a user can search before signing up
  • datefirstbooking(首次订房时间): date of first booking
  • gender(性别)
  • age(年龄)
  • signupmethod(注册方式)
  • signupflow(注册页面): the page a user came to signup up from
  • language(语言): international language preference
  • affiliatechannel(付费市场渠道): what kind of paid marketing
  • affiliateprovider(付费市场渠道名称): where the marketing is e.g. google, craigslist, other

affiliate

  • signupapp(注册app)
  • firstdevicetype(设备类型)
  • firstbrowser(浏览器类型)
  • countrydestination订房国家-需要预测的量): this is the target variable you are to predict
  • sessions.csv - web sessions log for users(网页浏览数据)
  • userid(用户id): to be joined with the column 'id' in users table
  • action(用户行为)
  • actiontype(用户行为类型)
  • actiondetail(用户行为具体)
  • devicetype(设备类型)
  • secs_elapsed(停留时长)
  • sample_submission.csv - correct format for submitting your predictions
  • 数据下载地址 Airbnb 新用户的民宿预定预测-数据集

3. 数据探索

  • 基于jupyter notebook 和 python3

3.1 trainusers2和test_users文件

读取文件

  1. ​train = pd.read_csv("train_users_2.csv")​
  2. ​test = pd.read_csv("test_users.csv")​

导包

  1. ​import numpy as np​
  2. ​import pandas as pd​
  3. ​import matplotlib.pyplot as plt​
  4. ​import sklearn as sk​
  5. ​%matplotlib inline​
  6. ​import datetime​
  7. ​import os​
  8. ​import seaborn as sns#数据可视化​
  9. ​from datetime import date​
  10. ​from sklearn.preprocessing import LabelEncoder​
  11. ​from sklearn.preprocessing import StandardScaler​
  12. ​from sklearn.preprocessing import LabelBinarizer​
  13. ​import pickle #用于存储模型​
  14. ​import seaborn as sns​
  15. ​from sklearn.metrics import *​
  16. ​from sklearn.model_selection import *​

查看数据包含的特征

  1. ​print('the columns name of training dataset:\n',train.columns)​
  2. ​print('the columns name of test dataset:\n',test.columns)​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_02

分析:

  1. train文件比test文件多了特征-country_destination
  2. country_destination是需要预测的目标变量
  3. 数据探索时着重分析train文件,test文件类似

查看数据信息

  1. ​print(train.info())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_03分析:

  1. trian文件包含213451行数据,16个特征
  2. 每个特征的数据类型和非空数值
  3. datefirstbooking空值较多,在特征提取时可以考虑删除

特征分析:1. dateaccountcreated

1.1 查看dateaccountcreated前几行数据

  1. ​print(train.date_account_created.head())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_04

1.2 对dateaccountcreated数据进行统计

  1. ​print(train.date_account_created.value_counts().head())​
  2. ​print(train.date_account_created.value_counts().tail())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_05

1.3获取dateaccountcreated信息

  1. ​print(train.date_account_created.describe())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_06

1.4观察用户增长情况

  1. ​dac_train = train.date_account_created.value_counts()​
  2. ​dac_test = test.date_account_created.value_counts()​
  3. ​#将数据类型转换为datatime类型​
  4. ​dac_train_date = pd.to_datetime(train.date_account_created.value_counts().index)​
  5. ​dac_test_date = pd.to_datetime(test.date_account_created.value_counts().index)​
  6. ​#计算离首次注册时间相差的天数​
  7. ​dac_train_day = dac_train_date - dac_train_date.min()​
  8. ​dac_test_day = dac_test_date - dac_train_date.min()​
  9. ​#motplotlib作图​
  10. ​plt.scatter(dac_train_day.days, dac_train.values, color = 'r', label = 'train dataset')​
  11. ​plt.scatter(dac_test_day.days, dac_test.values, color = 'b', label = 'test dataset')​

  12. ​plt.title("Accounts created vs day")​
  13. ​plt.xlabel("Days")​
  14. ​plt.ylabel("Accounts created")​
  15. ​plt.legend(loc = 'upper left')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_07分析:

  1. x轴:离首次注册时间相差的天数
  2. y轴:当天注册的用户数量
  3. 随着时间的增长,用户注册的数量在急剧上升

2. timestampfirstactive2.1查看头几行数据

  1. ​print(train.timestamp_first_active.head())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_08

2.2对数据进行统计看非重复值的数量

  1. ​print(train.timestamp_first_active.value_counts().unique())​

[1] 分析: 结果[1]表明timestampfirstactive没有重复数据

2.3将时间戳转成日期形式并获取数据信息

  1. ​tfa_train_dt = train.timestamp_first_active.astype(str).apply(lambda x:  ​
  2. ​                                                                    datetime.datetime(int(x[:4]),​
  3. ​                                                                                      int(x[4:6]), ​
  4. ​                                                                                      int(x[6:8]), ​
  5. ​                                                                                      int(x[8:10]), ​
  6. ​                                                                                      int(x[10:12]),​
  7. ​                                                                                      int(x[12:])))​
  8. ​print(tfa_train_dt.describe())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_09

3. datefirstbooking获取数据信息

  1. ​print(train.date_first_booking.describe())​
  2. ​print(test.date_first_booking.describe())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_10分析:

  1. train文件中datefirstbooking有大量缺失值
  2. test文件中datefirstbooking全是缺失值
  3. 可以删除特征datefirstbooking

4.age4.1对数据进行统计

  1. ​print(train.age.value_counts().head())​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_11分析:用户年龄主要集中在30左右4.2柱状图统计

  1. ​#首先将年龄进行分成4组missing values, too small age, reasonable age, too large age​
  2. ​age_train =[train[train.age.isnull()].age.shape[0],​
  3. ​            train.query('age < 15').age.shape[0],​
  4. ​            train.query("age >= 15 & age <= 90").age.shape[0],​
  5. ​            train.query('age > 90').age.shape[0]]​

  6. ​age_test = [test[test.age.isnull()].age.shape[0],​
  7. ​            test.query('age < 15').age.shape[0],​
  8. ​            test.query("age >= 15 & age <= 90").age.shape[0],​
  9. ​            test.query('age > 90').age.shape[0]]​

  10. ​columns = ['Null', 'age < 15', 'age', 'age > 90']​

  11. ​# plot​
  12. ​fig, (ax1,ax2) = plt.subplots(1,2,sharex=True, sharey = True,figsize=(10,5))​

  13. ​sns.barplot(columns, age_train, ax = ax1)​
  14. ​sns.barplot(columns, age_test, ax = ax2)​

  15. ​ax1.set_title('training dataset')​
  16. ​ax2.set_title('test dataset')​
  17. ​ax1.set_ylabel('counts')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_12分析:异常年龄较少,且有一定数量的缺失值

5.其他特征

  • train文件中其他特征由于labels较少,我们可以在特征工程中直接进行one hot encoding即可

统一使用柱状图进行统计

  1. ​def feature_barplot(feature, df_train = train, df_test = test, figsize=(10,5), rot = 90, saveimg = False): ​
  2. ​    feat_train = df_train[feature].value_counts()​
  3. ​    feat_test = df_test[feature].value_counts()​
  4. ​    fig_feature, (axis1,axis2) = plt.subplots(1,2,sharex=True, sharey = True, figsize = figsize)​
  5. ​    sns.barplot(feat_train.index.values, feat_train.values, ax = axis1)​
  6. ​    sns.barplot(feat_test.index.values, feat_test.values, ax = axis2)​
  7. ​    axis1.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)​
  8. ​    axis2.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)​
  9. ​    axis1.set_title(feature + ' of training dataset')​
  10. ​    axis2.set_title(feature + ' of test dataset')​
  11. ​    axis1.set_ylabel('Counts')​
  12. ​    plt.tight_layout()​
  13. ​    if saveimg == True:​
  14. ​        figname = feature + ".png"​
  15. ​        fig_feature.savefig(figname, dpi = 75)​

5.1 gender

  1. ​feature_barplot('gender', saveimg = True)​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_135.2 signup_method

  1. ​feature_barplot('signup_method')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_145.3 signup_flow

  1. ​feature_barplot('signup_flow')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_155.4 language

  1. ​feature_barplot('language')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_165.5 affiliate_channel

  1. ​feature_barplot('affiliate_channel')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_175.6 firstaffiliatetracked

  1. ​feature_barplot('first_affiliate_tracked')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_185.7 signup_app

  1. ​feature_barplot('signup_app')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_195.8 firstdevicetype

  1. ​feature_barplot('first_device_type')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_205.9 first_browser

  1. ​feature_barplot('first_browser')​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_21

3.2 sesion文件

获取数据并查看头10行数据

  1. ​df_sessions = pd.read_csv('sessions.csv')​
  2. ​df_sessions.head(10)​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_22将user_id改名为id

  1. ​#这是为了后面的数据合并​
  2. ​df_sessions['id'] = df_sessions['user_id']​
  3. ​df_sessions = df_sessions.drop(['user_id'],axis=1) #按行删除​

查看数据的shape

  1. ​df_sessions.shape​

(10567737, 6) 分析:session文件有10567737行数据,6个特征

查看缺失值

  1. ​df_sessions.isnull().sum()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_23分析:action,actiontype,actiondetail, secs_elapsed缺失值较多

填充缺失值

  1. ​df_sessions.action = df_sessions.action.fillna('NAN')​
  2. ​df_sessions.action_type = df_sessions.action_type.fillna('NAN')​
  3. ​df_sessions.action_detail = df_sessions.action_detail.fillna('NAN')​
  4. ​df_sessions.isnull().sum()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_24分析:

  1. 填充后缺失值已经为0了
  2. secs_elapsed 在后续做填充处理

4. 特征提取

  • 在对数据有一定了解后,我们进行特征提取工作

4.1 对session文件特征提取

1.action

  1. ​df_sessions.action.head()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_25

  1. ​df_sessions.action.value_counts().min()​

1 分析:对action进行统计,我们可以发现用户action有多种,且最少的发生次数只有1,接下来我们可以对用户发生次数较少的行为列为OTHER一类

1.1 将特征action次数低于阈值100的列为OTHER

  1. ​#Action values with low frequency are changed to 'OTHER'​
  2. ​act_freq = 100  #Threshold of frequency​
  3. ​act = dict(zip(*np.unique(df_sessions.action, return_counts=True)))​
  4. ​df_sessions.action = df_sessions.action.apply(lambda x: 'OTHER' if act[x] < act_freq else x)​
  5. ​#np.unique(df_sessions.action, return_counts=True) 取以数组形式返回非重复的action值和它的数量​
  6. ​#zip(*(a,b))a,b种元素一一对应,返回zip object​

2. 对特征action,actiondetail,actiontype,devicetype,secselapsed进行细化

  • 首先将用户的特征根据用户id进行分组
  • 特征action:统计每个用户总的action出现的次数,各个action类型的数量,平均值以及标准差
  • 特征actiondetail:统计每个用户总的actiondetail出现的次数,各个action_detail类型的数量,平均值以及标准差
  • 特征actiontype:统计每个用户总的actiontype出现的次数,各个action_type类型的数量,平均值,标准差以及总的停留时长(进行log处理)
  • 特征devicetype:统计每个用户总的devicetype出现的次数,各个device_type类型的数量,平均值以及标准差
  • 特征secselapsed:对缺失值用0填充,统计每个用户secselapsed时间的总和,平均值,标准差以及中位数(进行log处理),(总和/平均数),secs_elapsed(log处理后)各个时间出现的次数
  1. ​#对action特征进行细化​
  2. ​f_act = df_sessions.action.value_counts().argsort()​
  3. ​f_act_detail = df_sessions.action_detail.value_counts().argsort()​
  4. ​f_act_type = df_sessions.action_type.value_counts().argsort()​
  5. ​f_dev_type = df_sessions.device_type.value_counts().argsort()​

  6. ​#按照id进行分组​
  7. ​dgr_sess = df_sessions.groupby(['id'])​
  8. ​#Loop on dgr_sess to create all the features.​
  9. ​samples = [] #samples列表​
  10. ​ln = len(dgr_sess) #计算分组后df_sessions的长度​

  11. ​for g in dgr_sess:  #对dgr_sess中每个id的数据进行遍历​
  12. ​    gr = g[1]   #data frame that comtains all the data for a groupby value 'zzywmcn0jv'​

  13. ​    l = []  #建一个空列表,临时存放特征​

  14. ​    #the id    for example:'zzywmcn0jv'​
  15. ​    l.append(g[0]) #将id值放入空列表中​

  16. ​    # number of total actions​
  17. ​    l.append(len(gr))#将id对应数据的长度放入列表​

  18. ​    #secs_elapsed 特征中的缺失值用0填充再获取具体的停留时长值​
  19. ​    sev = gr.secs_elapsed.fillna(0).values   #These values are used later.​

  20. ​    #action features 特征-用户行为 ​
  21. ​    #每个用户行为出现的次数,各个行为类型的数量,平均值以及标准差​
  22. ​    c_act = [0] * len(f_act)​
  23. ​    for i,v in enumerate(gr.action.values): #i是从0-1对应的位置,v 是用户行为特征的值​
  24. ​        c_act[f_act[v]] += 1​
  25. ​    _, c_act_uqc = np.unique(gr.action.values, return_counts=True)​
  26. ​    #计算用户行为行为特征各个类型数量的长度,平均值以及标准差​
  27. ​    c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]​
  28. ​    l = l + c_act​

  29. ​    #action_detail features 特征-用户行为具体​
  30. ​    #(how many times each value occurs, numb of unique values, mean and std)​
  31. ​    c_act_detail = [0] * len(f_act_detail)​
  32. ​    for i,v in enumerate(gr.action_detail.values):​
  33. ​        c_act_detail[f_act_detail[v]] += 1​
  34. ​    _, c_act_det_uqc = np.unique(gr.action_detail.values, return_counts=True)​
  35. ​    c_act_detail += [len(c_act_det_uqc), np.mean(c_act_det_uqc), np.std(c_act_det_uqc)]​
  36. ​    l = l + c_act_detail​

  37. ​    #action_type features  特征-用户行为类型 click等​
  38. ​    #(how many times each value occurs, numb of unique values, mean and std​
  39. ​    #+ log of the sum of secs_elapsed for each value)​
  40. ​    l_act_type = [0] * len(f_act_type)​
  41. ​    c_act_type = [0] * len(f_act_type)​
  42. ​    for i,v in enumerate(gr.action_type.values):​
  43. ​        l_act_type[f_act_type[v]] += sev[i] #sev = gr.secs_elapsed.fillna(0).values ,求每个行为类型总的停留时长​
  44. ​        c_act_type[f_act_type[v]] += 1  ​
  45. ​    l_act_type = np.log(1 + np.array(l_act_type)).tolist() #每个行为类型总的停留时长,差异比较大,进行log处理​
  46. ​    _, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)​
  47. ​    c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]​
  48. ​    l = l + c_act_type + l_act_type    ​

  49. ​    #device_type features 特征-设备类型​
  50. ​    #(how many times each value occurs, numb of unique values, mean and std)​
  51. ​    c_dev_type  = [0] * len(f_dev_type)​
  52. ​    for i,v in enumerate(gr.device_type .values):​
  53. ​        c_dev_type[f_dev_type[v]] += 1 ​
  54. ​    c_dev_type.append(len(np.unique(gr.device_type.values))) ​
  55. ​    _, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)​
  56. ​    c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]        ​
  57. ​    l = l + c_dev_type    ​

  58. ​    #secs_elapsed features  特征-停留时长     ​
  59. ​    l_secs = [0] * 5 ​
  60. ​    l_log = [0] * 15​
  61. ​    if len(sev) > 0:​
  62. ​        #Simple statistics about the secs_elapsed values.​
  63. ​        l_secs[0] = np.log(1 + np.sum(sev))​
  64. ​        l_secs[1] = np.log(1 + np.mean(sev)) ​
  65. ​        l_secs[2] = np.log(1 + np.std(sev))​
  66. ​        l_secs[3] = np.log(1 + np.median(sev))​
  67. ​        l_secs[4] = l_secs[0] / float(l[1]) #​

  68. ​        #Values are grouped in 15 intervals. Compute the number of values​
  69. ​        #in each interval.​
  70. ​        #sev = gr.secs_elapsed.fillna(0).values ​
  71. ​        log_sev = np.log(1 + sev).astype(int)​
  72. ​        #np.bincount():Count number of occurrences of each value in array of non-negative ints.  ​
  73. ​        l_log = np.bincount(log_sev, minlength=15).tolist()                    ​
  74. ​    l = l + l_secs + l_log​

  75. ​    #The list l has the feature values of one sample.​
  76. ​    samples.append(l)​

  77. ​#preparing objects    ​
  78. ​samples = np.array(samples) ​
  79. ​samp_ar = samples[:, 1:].astype(np.float16) #取除id外的特征数据​
  80. ​samp_id = samples[:, 0]   #取id,id位于第一列​

  81. ​#为提取的特征创建一个dataframe     ​
  82. ​col_names = []    #name of the columns​
  83. ​for i in range(len(samples[0])-1):  #减1的原因是因为有个id​
  84. ​    col_names.append('c_' + str(i))  #起名字的方式    ​
  85. ​df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)​
  86. ​df_agg_sess['id'] = samp_id​
  87. ​df_agg_sess.index = df_agg_sess.id #将id作为index​
  88. ​df_agg_sess.head()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_26分析:经过特征提取后,session文件由6个特征变为458个特征

4.2 对trian和test文件进行特征提取

标记train文件的行数和存储我们进行预测的目标变量

  • labels存储了我们进行预测的目标变量country_destination
  1. ​train = pd.read_csv("train_users_2.csv")​
  2. ​test = pd.read_csv("test_users.csv")​
  3. ​#计算出train的行数,便于之后对train和test数据进行分离操作​
  4. ​train_row = train.shape[0]  ​

  5. ​# The label we need to predict​
  6. ​labels = train['country_destination'].values​

删除datefirstbooking和train文件中的country_destination

  • 数据探索时我们发现datefirstbooking在train和test文件中缺失值太多,故删除
  • 删除countrydestination,用模型预测countrydestination,再与已经存储country_destination的labels进行比较,从而判断模型优劣
  1. ​train.drop(['country_destination', 'date_first_booking'], axis = 1, inplace = True)​
  2. ​test.drop(['date_first_booking'], axis = 1, inplace = True)​

合并train和test文件

  • 便于进行相同的特征提取操作
  1. ​#连接test 和 train​
  2. ​df = pd.concat([train, test], axis = 0, ignore_index = True)​

1. timestampfirstactive1.1 转换为datetime类型

  1. ​tfa = df.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]),​
  2. ​                                                                          int(x[4:6]), ​
  3. ​                                                                          int(x[6:8]),​
  4. ​                                                                          int(x[8:10]),​
  5. ​                                                                          int(x[10:12]),​
  6. ​                                                                          int(x[12:])))​

1.2 提取特征:年,月,日

  1. ​# create tfa_year, tfa_month, tfa_day feature​
  2. ​df['tfa_year'] = np.array([x.year for x in tfa])​
  3. ​df['tfa_month'] = np.array([x.month for x in tfa])​
  4. ​df['tfa_day'] = np.array([x.day for x in tfa])​

1.3 提取特征:weekday

  • 对结果进行one hot encoding编码
  1. ​#isoweekday() 可以返回一周的星期几,e.g.星期日:0;星期一:1​
  2. ​df['tfa_wd'] = np.array([x.isoweekday() for x in tfa]) ​
  3. ​df_tfa_wd = pd.get_dummies(df.tfa_wd, prefix = 'tfa_wd')  # one hot encoding ​
  4. ​df = pd.concat((df, df_tfa_wd), axis = 1) #添加df['tfa_wd'] 编码后的特征​
  5. ​df.drop(['tfa_wd'], axis = 1, inplace = True)#删除原有未编码的特征​

1.4 提取特征:季节

  • 因为判断季节关注的是月份,故对年份进行统一
  1. ​Y = 2000​
  2. ​seasons = [(0, (date(Y,  1,  1),  date(Y,  3, 20))),  #'winter'​
  3. ​           (1, (date(Y,  3, 21),  date(Y,  6, 20))),  #'spring'​
  4. ​           (2, (date(Y,  6, 21),  date(Y,  9, 22))),  #'summer'​
  5. ​           (3, (date(Y,  9, 23),  date(Y, 12, 20))),  #'autumn'​
  6. ​           (0, (date(Y, 12, 21),  date(Y, 12, 31)))]  #'winter'​

  7. ​def get_season(dt):​
  8. ​    dt = dt.date() #获取日期​
  9. ​    dt = dt.replace(year=Y) #将年统一换成2000年​
  10. ​    return next(season for season, (start, end) in seasons if start <= dt <= end)​

  11. ​df['tfa_season'] = np.array([get_season(x) for x in tfa])​
  12. ​df_tfa_season = pd.get_dummies(df.tfa_season, prefix = 'tfa_season') # one hot encoding ​
  13. ​df = pd.concat((df, df_tfa_season), axis = 1)​
  14. ​df.drop(['tfa_season'], axis = 1, inplace = True)​

2. dateaccountcreated2.1 将dateaccountcreated转换为datetime类型

  1. ​dac = pd.to_datetime(df.date_account_created)​

2.2 提取特征:年,月,日

  1. ​# create year, month, day feature for dac​

  2. ​df['dac_year'] = np.array([x.year for x in dac])​
  3. ​df['dac_month'] = np.array([x.month for x in dac])​
  4. ​df['dac_day'] = np.array([x.day for x in dac])​

2.3 提取特征:weekday

  1. ​# create features of weekday for dac​

  2. ​df['dac_wd'] = np.array([x.isoweekday() for x in dac])​
  3. ​df_dac_wd = pd.get_dummies(df.dac_wd, prefix = 'dac_wd')​
  4. ​df = pd.concat((df, df_dac_wd), axis = 1)​
  5. ​df.drop(['dac_wd'], axis = 1, inplace = True)​

2.4 提取特征:季节

  1. ​# create season features fro dac​

  2. ​df['dac_season'] = np.array([get_season(x) for x in dac])​
  3. ​df_dac_season = pd.get_dummies(df.dac_season, prefix = 'dac_season')​
  4. ​df = pd.concat((df, df_dac_season), axis = 1)​
  5. ​df.drop(['dac_season'], axis = 1, inplace = True)​

2.5提取特征:dateaccountcreated和timestampfirstactive之间的差值

  • 即用户在airbnb平台活跃到正式注册所花的时间
  1. ​dt_span = dac.subtract(tfa).dt.days ​
  • dt_span的头十行数据
  1. ​dt_span.value_counts().head(10)​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_27分析:数据主要集中在-1,可以猜测,用户当天注册dt_span值便是-1

  • 从差值提取特征:差值为一天,一月,一年和其他
  • 即用户活跃到注册花费的时间为一天,一月,一年或其他
  1. ​# create categorical feature: span = -1; -1 < span < 30; 31 < span < 365; span > 365​
  2. ​def get_span(dt):​
  3. ​    # dt is an integer​
  4. ​    if dt == -1:​
  5. ​        return 'OneDay'​
  6. ​    elif (dt < 30) & (dt > -1):​
  7. ​        return 'OneMonth'​
  8. ​    elif (dt >= 30) & (dt <= 365):​
  9. ​        return 'OneYear'​
  10. ​    else:​
  11. ​        return 'other'​

  12. ​df['dt_span'] = np.array([get_span(x) for x in dt_span])​
  13. ​df_dt_span = pd.get_dummies(df.dt_span, prefix = 'dt_span')​
  14. ​df = pd.concat((df, df_dt_span), axis = 1)​
  15. ​df.drop(['dt_span'], axis = 1, inplace = True)​

2.6 删除原有的特征

  • 对timestampfirstactive,dateaccountcreated进行特征提取后,从特征列表中删除原有的特征
  1. ​df.drop(['date_account_created','timestamp_first_active'], axis = 1, inplace = True)​

3. age

  1. ​#Age 获取年龄​
  2. ​av = df.age.values​
  • 在数据探索阶段,我们发现大部分数据是集中在(15,90)区间的,但有部分年龄分布在(1900,2000)区间,我们猜测用户是把出生日期误填为年龄,故进行预处理
  1. ​#This are birthdays instead of age (estimating age by doing 2014 - value)​
  2. ​#数据来自2014年,故用2014-value​
  3. ​av = np.where(np.logical_and(av<2000, av>1900), 2014-av, av) ​
  4. ​df['age'] = av​

3.1 将年龄进行分段

age = df.age

  1. ​age.fillna(-1, inplace = True) #空值填充为-1​
  2. ​div = 15​
  3. ​def get_age(age):​
  4. ​    # age is a float number  将连续型转换为离散型​
  5. ​    if age < 0:​
  6. ​        return 'NA' #表示是空值​
  7. ​    elif (age < div):​
  8. ​        return div #如果年龄小于15岁,那么返回15岁​
  9. ​    elif (age <= div * 2):​
  10. ​        return div*2 #如果年龄大于15小于等于30岁,则返回30岁​
  11. ​    elif (age <= div * 3):​
  12. ​        return div * 3​
  13. ​    elif (age <= div * 4):​
  14. ​        return div * 4​
  15. ​    elif (age <= div * 5):​
  16. ​        return div * 5​
  17. ​    elif (age <= 110):​
  18. ​        return div * 6​
  19. ​    else:​
  20. ​        return 'Unphysical' #非正常年龄​
  • 将分段后的年龄作为新的特征放入特征列表中
  1. ​df['age'] = np.array([get_age(x) for x in age])​
  2. ​df_age = pd.get_dummies(df.age, prefix = 'age')​
  3. ​df = pd.concat((df, df_age), axis = 1)​
  4. ​df.drop(['age'], axis = 1, inplace = True)​

4. 其他特征

  • 在数据探索时,我们发现剩余的特征lables都比较少,故不进一步进行特征提取,只进行one-hot-encoding处理
  1. ​feat_toOHE = ['gender', ​
  2. ​             'signup_method', ​
  3. ​             'signup_flow', ​
  4. ​             'language', ​
  5. ​             'affiliate_channel', ​
  6. ​             'affiliate_provider', ​
  7. ​             'first_affiliate_tracked', ​
  8. ​             'signup_app', ​
  9. ​             'first_device_type', ​
  10. ​             'first_browser']​
  11. ​#对其他特征进行one-hot-encoding处理​
  12. ​for f in feat_toOHE:​
  13. ​    df_ohe = pd.get_dummies(df[f], prefix=f, dummy_na=True)​
  14. ​    df.drop([f], axis = 1, inplace = True)​
  15. ​    df = pd.concat((df, df_ohe), axis = 1)​

4.3 整合提取的所有特征

  • 我们将对session以及train,test文件中提取的特征进行合并
  1. ​#将对session提取的特征整合到一起​
  2. ​df_all = pd.merge(df, df_agg_sess, how='left')​
  3. ​df_all = df_all.drop(['id'], axis=1) #删除id​
  4. ​df_all = df_all.fillna(-2)  #对没有sesssion data的特征进行缺失值处理​

  5. ​#加了一列,表示每一行总共有多少空值,这也作为一个特征​
  6. ​df_all['all_null'] = np.array([sum(r<0) for r in df_all.values]) ​

5. 模型构建

5.1 数据准备

1. 将train和test数据进行分离操作

  • train_row是之前记录的train数据行数
  1. ​Xtrain = df_all.iloc[:train_row, :]​
  2. ​Xtest = df_all.iloc[train_row:, :]​

2. 将提取的特征生成csv文件

  1. ​Xtrain.to_csv("Airbnb_xtrain_v2.csv")​
  2. ​Xtest.to_csv("Airbnb_xtest_v2.csv")​
  3. ​#labels.tofile():Write array to a file as text or binary (default)​
  4. ​labels.tofile("Airbnb_ytrain_v2.csv", sep='\n', format='%s') #存放目标变量​
  • 读取特征文件
  1. ​xtrain = pd.read_csv("Airbnb_xtrain_v2.csv",index_col=0)​
  2. ​ytrain = pd.read_csv("Airbnb_ytrain_v2.csv", header=None)​
  3. ​xtrain.head()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_28

  1. ​ytrain.head()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_29分析:可以发现经过特征提取后特征文件xtrain扩展为665个特征,ytrain中包含训练集中的目标变量3. 将目标变量进行labels encoding

  1. ​le = LabelEncoder()​
  2. ​ytrain_le = le.fit_transform(ytrain.values)​
  • labels encoding前: ['AU', 'CA', 'DE', 'ES', 'FR', 'GB', 'IT', 'NDF', 'NL', 'PT', 'US','other']
  • labels encoding后: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

4. 提取10%的数据进行模型训练

  • 减少训练模型花费的时间
  1. ​# Let us take 10% of the data for faster training. ​
  2. ​n = int(xtrain.shape[0]*0.1)​
  3. ​xtrain_new = xtrain.iloc[:n, :]  #训练数据​
  4. ​ytrain_new = ytrain_le[:n]       #训练数据的目标变量​

5. StandardScaling the dataset

  • Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance)
  1. ​X_scaler = StandardScaler()​
  2. ​xtrain_new = X_scaler.fit_transform(xtrain_new)​

5.2 评分模型:NDGG

  • NDCG是一种衡量排序质量的评价指标,该指标考虑了所有元素的相关性
  • 由于我们预测的目标变量并不是二分类变量,故我们用NDGG模型来进行模型评分,判断模型优劣
  • 一般二分类变量: 我们习惯于使用 f1 score, precision, recall, auc score来进行模型评分
  1. ​from sklearn.metrics import make_scorer​

  2. ​def dcg_score(y_true, y_score, k=5):​

  3. ​    """​
  4. ​    y_true : array, shape = [n_samples] #数据​
  5. ​        Ground truth (true relevance labels).​
  6. ​    y_score : array, shape = [n_samples, n_classes] #预测的分数​
  7. ​        Predicted scores.​
  8. ​    k : int​
  9. ​    """​
  10. ​    order = np.argsort(y_score)[::-1] #分数从高到低排序​
  11. ​    y_true = np.take(y_true, order[:k]) #取出前k[0,k)个分数​

  12. ​    gain = 2 ** y_true - 1   ​

  13. ​    discounts = np.log2(np.arange(len(y_true)) + 2)​
  14. ​    return np.sum(gain / discounts)​


  15. ​def ndcg_score(ground_truth, predictions, k=5):   ​

  16. ​    """​
  17. ​    Parameters​
  18. ​    ----------​
  19. ​    ground_truth : array, shape = [n_samples]​
  20. ​        Ground truth (true labels represended as integers).​
  21. ​    predictions : array, shape = [n_samples, n_classes] ​
  22. ​        Predicted probabilities. 预测的概率​
  23. ​    k : int​
  24. ​        Rank.​
  25. ​    """​
  26. ​    lb = LabelBinarizer()​
  27. ​    lb.fit(range(len(predictions) + 1))​
  28. ​    T = lb.transform(ground_truth)    ​
  29. ​    scores = []​
  30. ​    # Iterate over each y_true and compute the DCG score​
  31. ​    for y_true, y_score in zip(T, predictions):​
  32. ​        actual = dcg_score(y_true, y_score, k)​
  33. ​        best = dcg_score(y_true, y_true, k)​
  34. ​        score = float(actual) / float(best)​
  35. ​        scores.append(score)​

  36. ​    return np.mean(scores)​

6. 构建模型

6.1 Logistic Regression

  1. ​from sklearn.linear_model import LogisticRegression​
  2. ​from sklearn.model_selection import KFold​
  3. ​from sklearn.model_selection import cross_val_score​
  4. ​from sklearn.model_selection import train_test_split​
  5. ​lr = LogisticRegression(C = 1.0, penalty='l2', multi_class='ovr')​
  6. ​RANDOM_STATE = 2017  #随机种子​

  7. ​#k-fold cross validation(k-折叠交叉验证)​
  8. ​kf = KFold(n_splits=5, random_state=RANDOM_STATE) #分成5个组​
  9. ​train_score = [] ​
  10. ​cv_score = []​

  11. ​# select a k  (value how many y):​
  12. ​k_ndcg = 3 ​
  13. ​# kf.split: Generate indices to split data into training and test set.​
  14. ​for train_index, test_index in kf.split(xtrain_new, ytrain_new):​
  15. ​    #训练集数据分割为训练集和测试集,y是目标变量​
  16. ​    X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]​
  17. ​    y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]​

  18. ​    lr.fit(X_train, y_train)​

  19. ​    y_pred = lr.predict_proba(X_test)​
  20. ​    train_ndcg_score = ndcg_score(y_train, lr.predict_proba(X_train), k = k_ndcg)​
  21. ​    cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)​

  22. ​    train_score.append(train_ndcg_score)​
  23. ​    cv_score.append(cv_ndcg_score)​

  24. ​print ("\nThe training score is: {}".format(np.mean(train_score)))​
  25. ​print ("\nThe cv score is: {}".format(np.mean(cv_score)))​

The training score is: 0.7595244143892934 The cv score is: 0.7416926026958558

learning curve of logistic regression

  • 观察逻辑回归模型学习曲线的变化 1. 改变逻辑回归参数iteration
  1. ​# set the iterations​
  2. ​iteration = [1,5,10,15,20, 50, 100]​

  3. ​kf = KFold(n_splits=3, random_state=RANDOM_STATE)​

  4. ​train_score = []​
  5. ​cv_score = []​

  6. ​# select a k:​
  7. ​k_ndcg = 5​

  8. ​for i, item in enumerate(iteration): ​

  9. ​    lr = LogisticRegression(C=1.0, max_iter=item, tol=1e-5, solver='newton-cg', multi_class='ovr') ​
  10. ​    train_score_iter = []​
  11. ​    cv_score_iter = []​

  12. ​    for train_index, test_index in kf.split(xtrain_new, ytrain_new):​
  13. ​        X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]​
  14. ​        y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]​

  15. ​        lr.fit(X_train, y_train)​

  16. ​        y_pred = lr.predict_proba(X_test)​
  17. ​        train_ndcg_score = ndcg_score(y_train, lr.predict_proba(X_train), k = k_ndcg)​
  18. ​        cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)​


  19. ​        train_score_iter.append(train_ndcg_score)​
  20. ​        cv_score_iter.append(cv_ndcg_score)​

  21. ​    train_score.append(np.mean(train_score_iter))​
  22. ​    cv_score.append(np.mean(cv_score_iter))​
  23. ​ymin = np.min(cv_score)-0.05​
  24. ​ymax = np.max(train_score)+0.05​

  25. ​plt.figure(figsize=(9,4))​
  26. ​plt.plot(iteration, train_score, 'ro-', label = 'training')​
  27. ​plt.plot(iteration, cv_score, 'b*-', label = 'Cross-validation')​
  28. ​plt.xlabel("iterations")​
  29. ​plt.ylabel("Score")​
  30. ​plt.xlim(-5, np.max(iteration)+10)​
  31. ​plt.ylim(ymin, ymax)​
  32. ​plt.plot(np.linspace(20,20,50), np.linspace(ymin, ymax, 50), 'g--')​
  33. ​plt.legend(loc = 'lower right', fontsize = 12)​
  34. ​plt.title("Score vs iteration learning curve")​

  35. ​plt.tight_layout()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_数据_30分析:随着iteration的增大,逻辑回归模型的评分在不断升高,当iteration超过20的时候,模型的评分基本不变

2. 改变数据量大小

  1. ​# Chaning the sampling size​
  2. ​# set the iter to the best iteration: iter = 20​

  3. ​perc = [0.01,0.02,0.05,0.1,0.2,0.5,1]​

  4. ​kf = KFold(n_splits=3, random_state=RANDOM_STATE)​

  5. ​train_score = []​
  6. ​cv_score = []​

  7. ​# select a k:​
  8. ​k_ndcg = 5​

  9. ​for i, item in enumerate(perc):​

  10. ​    lr = LogisticRegression(C=1.0, max_iter=20, tol=1e-6, solver='newton-cg', multi_class='ovr')​
  11. ​    train_score_iter = []​
  12. ​    cv_score_iter = []​

  13. ​    n = int(xtrain_new.shape[0]*item)​
  14. ​    xtrain_perc = xtrain_new[:n, :]​
  15. ​    ytrain_perc = ytrain_new[:n]​


  16. ​    for train_index, test_index in kf.split(xtrain_perc, ytrain_perc):​

  17. ​        X_train, X_test = xtrain_perc[train_index, :], xtrain_perc[test_index, :]​
  18. ​        y_train, y_test = ytrain_perc[train_index], ytrain_perc[test_index]​

  19. ​        print(X_train.shape, X_test.shape)​

  20. ​        lr.fit(X_train, y_train)​

  21. ​        y_pred = lr.predict_proba(X_test)​
  22. ​        train_ndcg_score = ndcg_score(y_train, lr.predict_proba(X_train), k = k_ndcg)​
  23. ​        cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)​

  24. ​        train_score_iter.append(train_ndcg_score)​
  25. ​        cv_score_iter.append(cv_ndcg_score)​

  26. ​    train_score.append(np.mean(train_score_iter))​
  27. ​    cv_score.append(np.mean(cv_score_iter))​
  28. ​ymin = np.min(cv_score)-0.1​
  29. ​ymax = np.max(train_score)+0.1​

  30. ​plt.figure(figsize=(9,4))​
  31. ​plt.plot(np.array(perc)*100, train_score, 'ro-', label = 'training')​
  32. ​plt.plot(np.array(perc)*100, cv_score, 'bo-', label = 'Cross-validation')​
  33. ​plt.xlabel("Sample size (unit %)")​
  34. ​plt.ylabel("Score")​
  35. ​plt.xlim(-5, np.max(perc)*100+10)​
  36. ​plt.ylim(ymin, ymax)​

  37. ​plt.legend(loc = 'lower right', fontsize = 12)​
  38. ​plt.title("Score vs sample size learning curve")​

  39. ​plt.tight_layout()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_31分析:随着数据量的增加,逻辑回归模型对测试集的预测评分(蓝色线)在不断上升,因为我们在训练模型时只用了10%的数据,如果使用全部的数据,效果可能会更好

6.2 树模型

  • 其中的模型包括DecisionTree,RandomForest,AdaBoost,Bagging,ExtraTree,GraBoost
  1. ​from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier​
  2. ​from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier​
  3. ​from sklearn.tree import DecisionTreeClassifier​
  4. ​from sklearn.ensemble import *​
  5. ​from sklearn.svm import SVC, LinearSVC, NuSVC​
  6. ​LEARNING_RATE = 0.1​
  7. ​N_ESTIMATORS = 50​
  8. ​RANDOM_STATE = 2017​
  9. ​MAX_DEPTH = 9​

  10. ​#建了一个tree字典​
  11. ​clf_tree ={​
  12. ​    'DTree': DecisionTreeClassifier(max_depth=MAX_DEPTH,​
  13. ​                                    random_state=RANDOM_STATE),​

  14. ​    'RF': RandomForestClassifier(n_estimators=N_ESTIMATORS,​
  15. ​                                 max_depth=MAX_DEPTH,​
  16. ​                                 random_state=RANDOM_STATE),​

  17. ​    'AdaBoost': AdaBoostClassifier(n_estimators=N_ESTIMATORS,​
  18. ​                                   learning_rate=LEARNING_RATE,​
  19. ​                                   random_state=RANDOM_STATE),​

  20. ​    'Bagging': BaggingClassifier(n_estimators=N_ESTIMATORS,​
  21. ​                                 random_state=RANDOM_STATE),​

  22. ​    'ExtraTree': ExtraTreesClassifier(max_depth=MAX_DEPTH,​
  23. ​                                      n_estimators=N_ESTIMATORS,​
  24. ​                                      random_state=RANDOM_STATE),​

  25. ​    'GraBoost': GradientBoostingClassifier(learning_rate=LEARNING_RATE,​
  26. ​                                           max_depth=MAX_DEPTH,​
  27. ​                                           n_estimators=N_ESTIMATORS,​
  28. ​                                           random_state=RANDOM_STATE)​
  29. ​}​
  30. ​train_score = []​
  31. ​cv_score = []​

  32. ​kf = KFold(n_splits=3, random_state=RANDOM_STATE)​

  33. ​k_ndcg = 5​

  34. ​for key in clf_tree.keys():​

  35. ​    clf = clf_tree.get(key)​

  36. ​    train_score_iter = []​
  37. ​    cv_score_iter = []​

  38. ​    for train_index, test_index in kf.split(xtrain_new, ytrain_new):​

  39. ​        X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]​
  40. ​        y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]​

  41. ​        clf.fit(X_train, y_train)​

  42. ​        y_pred = clf.predict_proba(X_test)​
  43. ​        train_ndcg_score = ndcg_score(y_train, clf.predict_proba(X_train), k = k_ndcg)​
  44. ​        cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)​

  45. ​        train_score_iter.append(train_ndcg_score)​
  46. ​        cv_score_iter.append(cv_ndcg_score)​

  47. ​    train_score.append(np.mean(train_score_iter))​
  48. ​    cv_score.append(np.mean(cv_score_iter))​
  49. ​train_score_tree = train_score​
  50. ​cv_score_tree = cv_score​

  51. ​ymin = np.min(cv_score)-0.05​
  52. ​ymax = np.max(train_score)+0.05​

  53. ​x_ticks = clf_tree.keys()​

  54. ​plt.figure(figsize=(8,5))​
  55. ​plt.plot(range(len(x_ticks)), train_score_tree, 'ro-', label = 'training')​
  56. ​plt.plot(range(len(x_ticks)),cv_score_tree, 'bo-', label = 'Cross-validation')​

  57. ​plt.xticks(range(len(x_ticks)),x_ticks,rotation = 45, fontsize = 10)​
  58. ​plt.xlabel("Tree method", fontsize = 12)​
  59. ​plt.ylabel("Score", fontsize = 12)​
  60. ​plt.xlim(-0.5, 5.5)​
  61. ​plt.ylim(ymin, ymax)​

  62. ​plt.legend(loc = 'best', fontsize = 12)​
  63. ​plt.title("Different tree methods")​

  64. ​plt.tight_layout()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_32

6.3 SVM模型

  • 根据核函数的不同,又分为:SVM-rbf,SVM-poly,SVM-linear等
  1. ​TOL = 1e-4​
  2. ​MAX_ITER = 1000​

  3. ​clf_svm = {​

  4. ​    'SVM-rbf': SVC(kernel='rbf',​
  5. ​                   max_iter=MAX_ITER,​
  6. ​                   tol=TOL, random_state=RANDOM_STATE,​
  7. ​                   decision_function_shape='ovr'),     ​

  8. ​    'SVM-poly': SVC(kernel='poly',​
  9. ​                   max_iter=MAX_ITER,​
  10. ​                   tol=TOL, random_state=RANDOM_STATE,​
  11. ​                   decision_function_shape='ovr'),     ​

  12. ​    'SVM-linear': SVC(kernel='linear',​
  13. ​                      max_iter=MAX_ITER,​
  14. ​                      tol=TOL, ​
  15. ​                      random_state=RANDOM_STATE,​
  16. ​                      decision_function_shape='ovr'),  ​

  17. ​    'LinearSVC': LinearSVC(max_iter=MAX_ITER,​
  18. ​                            tol=TOL,​
  19. ​                            random_state=RANDOM_STATE,​
  20. ​                            multi_class = 'ovr')       ​

  21. ​train_score_svm = []​
  22. ​cv_score_svm = []​

  23. ​kf = KFold(n_splits=3, random_state=RANDOM_STATE)​

  24. ​k_ndcg = 5​

  25. ​for key in clf_svm.keys():​

  26. ​    clf = clf_svm.get(key)​

  27. ​    train_score_iter = []​
  28. ​    cv_score_iter = []​

  29. ​    for train_index, test_index in kf.split(xtrain_new, ytrain_new):​

  30. ​        X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]​
  31. ​        y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]​

  32. ​        clf.fit(X_train, y_train)​

  33. ​        y_pred = clf.decision_function(X_test)​
  34. ​        train_ndcg_score = ndcg_score(y_train, clf.decision_function(X_train), k = k_ndcg)​
  35. ​        cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)​

  36. ​        train_score_iter.append(train_ndcg_score)​
  37. ​        cv_score_iter.append(cv_ndcg_score)​

  38. ​    train_score_svm.append(np.mean(train_score_iter))​
  39. ​    cv_score_svm.append(np.mean(cv_score_iter))​
  40. ​}​
  41. ​ymin = np.min(cv_score_svm)-0.05​
  42. ​ymax = np.max(train_score_svm)+0.05​

  43. ​x_ticks = clf_svm.keys()​

  44. ​plt.figure(figsize=(8,5))​
  45. ​plt.plot(range(len(x_ticks)), train_score_svm, 'ro-', label = 'training')​
  46. ​plt.plot(range(len(x_ticks)),cv_score_svm, 'bo-', label = 'Cross-validation')​

  47. ​plt.xticks(range(len(x_ticks)),x_ticks,rotation = 45, fontsize = 10)​
  48. ​plt.xlabel("Tree method", fontsize = 12)​
  49. ​plt.ylabel("Score", fontsize = 12)​
  50. ​plt.xlim(-0.5, 3.5)​
  51. ​plt.ylim(ymin, ymax)​

  52. ​plt.legend(loc = 'best', fontsize = 12)​
  53. ​plt.title("Different SVM methods")​

  54. ​plt.tight_layout()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_缺失值_33

6.4 xgboost

  • kaggle比赛中常用的一个模型
  1. ​import xgboost as xgb​

  2. ​def customized_eval(preds, dtrain):​
  3. ​    labels = dtrain.get_label()​
  4. ​    top = []​
  5. ​    for i in range(preds.shape[0]):​
  6. ​        top.append(np.argsort(preds[i])[::-1][:5])​
  7. ​    mat = np.reshape(np.repeat(labels,np.shape(top)[1]) == np.array(top).ravel(),np.array(top).shape).astype(int)​
  8. ​    score = np.mean(np.sum(mat/np.log2(np.arange(2, mat.shape[1] + 2)),axis = 1))​
  9. ​    return 'ndcg5', score​
  10. ​# xgboost parameters​

  11. ​NUM_XGB = 200​

  12. ​params = {}​
  13. ​params['colsample_bytree'] = 0.6​
  14. ​params['max_depth'] = 6​
  15. ​params['subsample'] = 0.8​
  16. ​params['eta'] = 0.3​
  17. ​params['seed'] = RANDOM_STATE​
  18. ​params['num_class'] = 12​
  19. ​params['objective'] = 'multi:softprob'   # output the probability instead of class. ​
  20. ​train_score_iter = []​
  21. ​cv_score_iter = []​

  22. ​kf = KFold(n_splits = 3, random_state=RANDOM_STATE)​

  23. ​k_ndcg = 5​

  24. ​for train_index, test_index in kf.split(xtrain_new, ytrain_new):​

  25. ​    X_train, X_test = xtrain_new[train_index, :], xtrain_new[test_index, :]​
  26. ​    y_train, y_test = ytrain_new[train_index], ytrain_new[test_index]​

  27. ​    train_xgb = xgb.DMatrix(X_train, label= y_train)​
  28. ​    test_xgb = xgb.DMatrix(X_test, label = y_test)​

  29. ​    watchlist = [ (train_xgb,'train'), (test_xgb, 'test') ]​

  30. ​    bst = xgb.train(params, ​
  31. ​                     train_xgb,​
  32. ​                     NUM_XGB,​
  33. ​                     watchlist,​
  34. ​                     feval = customized_eval,​
  35. ​                     verbose_eval = 3,​
  36. ​                     early_stopping_rounds = 5)​


  37. ​    #bst = xgb.train( params, dtrain, num_round, evallist )​

  38. ​    y_pred = np.array(bst.predict(test_xgb))​
  39. ​    y_pred_train = np.array(bst.predict(train_xgb))​
  40. ​    train_ndcg_score = ndcg_score(y_train, y_pred_train , k = k_ndcg)​
  41. ​    cv_ndcg_score = ndcg_score(y_test, y_pred, k=k_ndcg)​

  42. ​    train_score_iter.append(train_ndcg_score)​
  43. ​    cv_score_iter.append(cv_ndcg_score)​

  44. ​train_score_xgb = np.mean(train_score_iter)​
  45. ​cv_score_xgb = np.mean(cv_score_iter)​

  46. ​print ("\nThe training score is: {}".format(train_score_xgb))​
  47. ​print ("The cv score is: {}\n".format(cv_score_xgb))​

The training score is: 0.803445955699075 The cv score is: 0.7721491602424301

7. 模型比较

  1. ​model_cvscore = np.hstack((cv_score_lr, cv_score_tree, cv_score_svm, cv_score_xgb))​
  2. ​model_name = np.array(['LinearReg','ExtraTree','DTree','RF','GraBoost','Bagging','AdaBoost','LinearSVC','SVM-linear','SVM-rbf','SVM-poly','Xgboost'])​
  3. ​fig = plt.figure(figsize=(8,4))​

  4. ​sns.barplot(model_cvscore, model_name, palette="Blues_d")​

  5. ​plt.xticks(rotation=0, size = 10)​
  6. ​plt.xlabel("CV score", fontsize = 12)​
  7. ​plt.ylabel("Model", fontsize = 12)​
  8. ​plt.title("Cross-validation score for different models")​

  9. ​plt.tight_layout()​

【数据挖掘项目】Airbnb新用户的民宿预定结果预测_sed_34

8.总结

  1. 对数据的理解和探索很重要
  2. 可以通过特征工程,进一步提取特征
  3. 模型评估的方法有很多种,选取适宜的模型评估方法
  4. 目前只用了10%的数据进行模型训练,用全部的数据集进行训练,效果可能会更好
  5. 需要深入学习模型算法,学会调参


举报

相关推荐

0 条评论