处理不平衡数据——基于UCI人口调查数据集(二)


本文是处理不平衡数据系列之二,在上一篇文章中,我们完成了对数据的预处理、可视化以及模型训练与预测等等工作,对数据有了整体的认识。在对实验数据进行预处理的时候,缺失值(missing values)和高相关性变量(variables with high correlation)是重点关注的对象。解决了这两个问题后,数据集样本不平衡的缺陷仍旧没有根除,所以针对数据分别进行了上采样、下采样以及SMOTE三种采样方法。显然,采样花费时间最久的SMOTE在模型中表现最佳,拿到了最高的准确率0.896,可是正当准备庆祝的时候,一个不幸的“消息”告诉我们:特异度(Specificity)只有0.254。也就是说,模型对预测收入高于5w的少数人群(minority class)表现不太好,这样的模型结果是不太令人满意的,能够拿到0.896的准确率自然也是在情理之中,毕竟正反样本的比例(96:4)摆在那里。为了克服这个缺陷,我们在R语言中采用了高效、性能强大的xgboost处理框架,最终得到理想的数据。

说句题外话,原本计划完成任务需花费10个番茄,实际耗时远远多出了预期的1倍多,整个五一就窝在实验室了。经过这个小小的项目后,深感“单兵作战”孤立无援的苦楚,唯有不断google,不断将写好的代码推倒重来,不断input、output······

本项目github地址:https://github.com/swordspoet/UCI_imbalanced_data

  • 一、初次尝试xgboost
    • xgboost模型参数解释与设定
  • 二、xgboost in mlr
    • 数据预处理
    • 调参与模型训练
  • 参考链接

一、初次尝试xgboost

为了训练出一个能够在预测正负样本表现均良好的模型,在这篇文章中我们会用到xgboost算法,xgboost(eXtreme Gradient Boosting)的作者是华盛顿大学的陈天奇,xgboost最大的特点在于,它能够自动利用CPU的多线程进行并行计算,同时在算法上加以改进提高了精度。随着越来越多的队伍借助xgboost取得了kaggle比赛中的胜利,xgboost在近年来变得十分流行。

xgboost

xgboost

xgboost不仅被封装成了Python库,何通制作了xgboost工具的R语言接口,所以R中安装后便可以直接使用。

照例,我们先对数据进行预处理,预处理的思路是:分别考虑训练集、测试集中的数值型变量和类别型变量,对数值型,剔除高度相关的变量,对类别型,剔除数据遗漏严重的变量。经过上述两个步骤后,再将 数值型和类别型变量重新组合。因为R对内存的要求太苛刻了,完成数据预处理后,还需要将train,test,num_train,num_test,cat_train,cat_test从RStudio中移除以减少内存占用,不然会出现内存不够的情况。以笔者8G内存的台式机器为例,本次实验中CPU与内存满负荷运算是常事,还出现几次假死、崩溃的情况,所以在R中进行数据处理的时候一定要注意内存的管理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 加载包
library(caret)
library(data.table)
library(xgboost)
# 加载数据集
train <- fread("E:/R/imbalancedata/train.csv",na.string=c(""," ","?","NA",NA))
test <- fread("E:/R/imbalancedata/test.csv",na.string=c(""," ","?","NA",NA))
table(is.na(train));table(is.na(test))
# 将train,test数据切分为数值型和类别型
factcols <- c(2:5,7,8:16,20:29,31:38,40,41)
numcols <- setdiff(1:40,factcols)
cat_train <- train[,factcols, with=FALSE];cat_test <- test[,factcols,with=FALSE]
num_train <- train[,numcols,with=FALSE];num_test <- test[,numcols,with=FALSE]
# 去掉数值型(num)数据中高度相关的变量
ax <- findCorrelation(cor(num_train), cutoff = 0.7)
num_train <- num_train[,-ax,with=FALSE];num_test <- num_test[,-ax,with=FALSE]
# 处理类别型(cat)数据中的遗漏值
mvtr <- sapply(cat_train, function(x){sum(is.na(x))/length(x)}*100)
mvte <- sapply(cat_test, function(x){sum(is.na(x)/length(x))}*100)
# 将遗漏率小于5%的列单独挑选出来
cat_train <- subset(cat_train, select = mvtr < 5 )
cat_test <- subset(cat_test, select = mvte < 5)
cat_train[is.na(cat_train)] <- "Missing";cat_test[is.na(cat_test)] <- "Missing"
# 合并数值型和分类型数据
d_train <- cbind(num_train, cat_train);d_test <- cbind(num_test, cat_test)
rm(train,test,num_train,num_test,cat_train,cat_test)

xgboost模型参数解释与设定

完成了数据预处理,接着便是分类模型的构建。在运行xgboost前,有三类参数需要人工设定: general parameters, booster parameters and task parameters,xgboost的官方文档有关于这些参数的详细解释:

  • General parameters relates to which booster we are using to do boosting, commonly tree or linear model
  • Booster parameters depends on which booster you have chosen
  • Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks
  1. Tree Booster参数解释:
  • eta [default=0.3, alias: learning_rate]
    • 学习率,防止模型出现过拟合,默认取0.3,通常取值范围[0.01,3]
    • 在每次迭代后,变量的权重会随之衰减
  • gamma [default=0, alias: min_split_loss]
    • 模型正则化系数,gamma越大,意味着模型越不容易出现过拟合
  • max_depth [default=6]
    • max_depth值越大,意味着模型越复杂,也越容易出现过拟合
    • max_depth的取值没有规定
  • min_child_weight [default=1]
    • In classification, if the leaf node has a minimum sum of instance weight (calculated by second order partial derivative) lower than min_child_weight, the tree splitting stops.
  • subsample[default=1][range: (0,1)]
    • It controls the number of samples (observations) supplied to a tree.
    • 通常取值范围为 (0.5,0.8)
  • colsample_bytree[default=1][range: (0,1)]
    • It control the number of features (variables) supplied to a tree
    • 通常取值范围为 (0.5,0.9)
  1. Learning Task参数解释:
  • Objective[default=reg:linear]
    • Objective规定模型需要处理的任务
    • reg:linear - 线性回归
    • binary:logistic - 二分类LR回归
    • multi:softmax - 多分类softmax回归
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
> tr_labels <- d_train$income_level
> ts_labels <- d_test$income_level
> new_tr <- model.matrix(~.+0,data = d_train[,-c("income_level"),with=F])
> new_ts <- model.matrix(~.+0,data = d_test[,-c("income_level"),with=F])
> tr_labels <- as.numeric(tr_labels)-1
> ts_labels <- as.numeric(ts_labels)-1
> dtrain <- xgb.DMatrix(data = new_tr,label = tr_labels)
> dtest <- xgb.DMatrix(data = new_ts,label= ts_labels)
> params <- list(booster = "gbtree",
+ objective = "binary:logistic",
+ eta=0.3, gamma=0, max_depth=6,
+ min_child_weight=1, subsample=1,
+ colsample_bytree=1)
> xgbcv <- xgb.cv( params = params,
+ data = dtrain, nrounds = 100,
+ nfold = 5, showsd = T,
+ stratified = T, print.every.n = 10,
+ early.stop.round = 20, maximize = F)
[1] train-error:0.049454+0.000231 test-error:0.050275+0.001244
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 20 rounds.
[11] train-error:0.045343+0.000408 test-error:0.046691+0.001152
[21] train-error:0.042996+0.000356 test-error:0.045323+0.001094
[31] train-error:0.041548+0.000311 test-error:0.044180+0.000912
[41] train-error:0.040261+0.000405 test-error:0.043739+0.000868
[51] train-error:0.039582+0.000303 test-error:0.043514+0.000995
[61] train-error:0.038914+0.000295 test-error:0.043308+0.000788
[71] train-error:0.038361+0.000195 test-error:0.043088+0.000858
[81] train-error:0.037948+0.000252 test-error:0.043003+0.000837
[91] train-error:0.037500+0.000189 test-error:0.042937+0.000921
[100] train-error:0.037144+0.000316 test-error:0.043063+0.001010
Warning messages:
1: 'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").
2: 'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
> xgb1 <- xgb.train (params = params,
+ data = dtrain, nrounds = 100,
+ watchlist = list(val=dtest,train=dtrain),
+ print.every.n = 10,
+ early.stop.round = 10,
+ maximize = F , eval_metric = "error")
[1] val-error:0.049758 train-error:0.049714
Multiple eval metrics are present. Will use train_error for early stopping.
Will train until train_error hasn't improved in 10 rounds.
[11] val-error:0.046511 train-error:0.045644
[21] val-error:0.044937 train-error:0.042993
[31] val-error:0.044396 train-error:0.041504
[41] val-error:0.043915 train-error:0.040777
[51] val-error:0.044205 train-error:0.039835
[61] val-error:0.044486 train-error:0.038888
[71] val-error:0.044917 train-error:0.038467
[81] val-error:0.045007 train-error:0.038166
[91] val-error:0.044797 train-error:0.037890
[100] val-error:0.044917 train-error:0.037665
Warning messages:
1: 'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").
2: 'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
> xgbpred <- predict (xgb1,dtest)
> xgbpred <- ifelse (xgbpred > 0.5,1,0)
> library(caret)
> confusionMatrix(xgbpred, ts_labels)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 92366 3271
1 1210 2915
Accuracy : 0.9551
95% CI : (0.9538, 0.9564)
No Information Rate : 0.938
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5427
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9871
Specificity : 0.4712
Pos Pred Value : 0.9658
Neg Pred Value : 0.7067
Prevalence : 0.9380
Detection Rate : 0.9259
Detection Prevalence : 0.9587
Balanced Accuracy : 0.7291
'Positive' Class : 0
> mat <- xgb.importance (feature_names = colnames(new_tr),model = xgb1)
> xgb.plot.importance (importance_matrix = mat[1:20])

其实,即使是给模型设定默认的参数也能得到意想不到的准确率,xgboost在kaggle社区中如此受欢迎也是有理由的。运行,训练模型(耗时约4分钟)并预测,模糊矩阵confusionMatrix(xgbpred, ts_labels)结果显示模型准确率达到了95.51%,然而这并不是重点。提升模型对预测收入高于5w的少数人群(minority class)的预测能力才是我们的目标,结果显示特异度(Specificity)达到47.12%,比上一个朴素贝叶斯提升了11.7%,效果仍然不是特别完美,不过也还可以了!无论是准确率还是其他衡量指标,xgboost得出的结果是全面优于之前的朴素贝叶斯模型的,那么还有没有提升的空间呢?


二、xgboost in mlr

MLR

MLR

2016年,R语言的用户迎来了mlr包的诞生,mlr,即machine learning in R。mlr是R语言中为专门应对机器学习问题而开发的包,在mlr出现之前,R语言中是不存在像Scikit-Learn这样的科学计算工具的。mlr包为在R中用机器学习方法解决问题提供了一套自有的框架,涵盖了分类、聚类、回归、生存分析等问题,另外mlr还为参数调优、预测结果分析、数据预处理等与机器学习相关的话题贡献了较为完整的方案。如果说,Scikit-Learn在Python的各个库之间不分伯仲,那么R语言的mlr就可谓一枝独秀。

说了这么多,如果对mlr感兴趣的同学可以去RStudio里一睹“真容”;mlr也专门为用户建立了一个教程网站:Machine Learning in R: mlr Tutorial,可以去官网找一个例子来跑一跑;这是mlr的github项目,由于mlr的普及率还不算太高,官方文档也还在优化中,所以在google上找到关于mlr的资源还不是特别多,所以建议大家在使用过程中出现问题的话去项目中提issue或者在issue中找答案,这是最有效的办法!

在R语言中使用mlr包解决机器学习问题只需要牢记三个步骤即可:

  • Create a Task:导入数据集,创建任务,可以是分类、回归、聚类等等
  • Make a Learner:构建模型,模型构建过程中涉及到参数设定、参数调节诸多技巧
  • Fit the model:拟合模型
  • Make predictions:预测

在R中,变量可以归结为名义型、有序型、或连续型变量,类别(名义型)变量和有序类别(有序型)在R中称为因子(factor)。

更多关于R语言数据结构的内容参见这篇文章

值得注意的是mlr包对数据的格式是有要求的,mlr任务函数不接受字符型(char)变量,所以在构建任务函数前,必须确保将所有的变量转换为因子型(factor),作者的解释是

All these things are possible pre-processors, which can be a model that wraps xgboost, when before doing train/predict, run the pre-processing and feed processed data to xgboost. So it is not hard.This is also reason why I do not explicit support factor in the tree construction algorithm. There could be many ways doing so, and in all the ways, having an algorithm optimized for sparse matrices is efficient for taking the processed data. Normal tree growing algorithm only support dense numerical features, and have to support one-hot encoding factor explicitly for computation efficiency reason.

在mlr中的xgboost,似乎并不需要进行太多的数据预处理,xgboost的作者回复issue时是这样说的

“….. xgboost treat every input feature as numerical, with support for missing values and sparsity. The decision is at the user So if you want ordered variables, you can transform the variables into numerical levels(say age). Or if you prefer treat it as categorical variable, do one hot encoding.”

也就是说xgboost视每个特征均为数值型,同时还支持遗漏变量和稀疏性数据,至于对数据进行何种预处理,决定权在用户手上。不同于之前,本次数据预处理仅仅是将字符型变量转换成因子,然后feed给mlr,mlr就直接开始创建任务(Create task)、构建模型(Make a learner)了,简单而且粗暴。

数据预处理

下面,我们将使用mlr包继续提升模型的预测效果,照例首先加载数据和包。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 载入数据和包
> library(data.table)
data.table 1.9.8
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> library(xgboost)
Warning message:
程辑包‘xgboost’是用R版本3.3.3 来建造的
> library(caret)
载入需要的程辑包:lattice
载入需要的程辑包:ggplot2
载入程辑包:‘caret’
> library(mlr)
载入需要的程辑包:ParamHelpers
Warning messages:
1: 程辑包‘mlr’是用R版本3.3.3 来建造的
2: 程辑包‘ParamHelpers’是用R版本3.3.3 来建造的
The following object is masked from ‘package:caret’:
train
> train <- fread("E:/R/imbalancedata/train.csv",na.string=c(""," ","?","NA",NA))
> test <- fread("E:/R/imbalancedata/test.csv",na.string=c(""," ","?","NA",NA))
> setDT(train)
> setDT(test)

在加载包的时候需要注意mlr和caret的加载顺序,caret应该在mlr之前载入,否则训练模型的时候R不清楚到底是加载caret的train还是mlr的train,从而导致如下错误

1
2
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors

一定要确保

1
2
3
The following object is masked from ‘package:caret’:
train

调参与模型训练

在对模型进行训练时,R的运算速度一直是一个问题,其中一个就是只能使用单线程计算。但是R在2.14版本之后,R就内置了parallel包,强化了R的并行计算能力。parallel包可以很容易的在计算集群上实施并行计算,在多个CPU核心的单机上,也能发挥并行计算的功能。笔者用的计算机是4核i5-6600K的CPU与8G内存,即使是中端配置的机器也需要满负荷计算约一小时才能得到最优参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
> char_cols <- colnames(train)[sapply(train,is.character)]
> for(i in char_cols) set(train,j=i,value = factor(train[[i]]))
> for(i in char_cols) set(test,j=i,value = factor(test[[i]]))
> train_task <- makeClassifTask(data = train, target = "income_level")
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
> test_task <- makeClassifTask(data = test, target = "income_level")
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
> train_task <- createDummyFeatures(obj = train_task)
> train_task <- createDummyFeatures(obj = train_task)
> set.seed(2002)
> xgb_learner <- makeLearner("classif.xgboost",predict.type = "response")
> xgb_learner$par.vals <- list(
+ objective = "binary:logistic",
+ eval_metric = "error",
+ nrounds = 150,
+ print.every.n = 50)
> xg_ps <- makeParamSet(
+ makeIntegerParam("max_depth",lower=3,upper=10),
+ makeNumericParam("lambda",lower=0.05,upper=0.5),
+ makeNumericParam("eta", lower = 0.01, upper = 0.5),
+ makeNumericParam("subsample", lower = 0.50, upper = 1),
+ makeNumericParam("min_child_weight",lower=2,upper=10),
+ makeNumericParam("colsample_bytree",lower = 0.50,upper = 0.80))
> rancontrol <- makeTuneControlRandom(maxit = 5L) #do 5 iterations
> set_cv <- makeResampleDesc("CV",iters = 5L,stratify = TRUE)
> library(parallel)
> library(parallelMap)
> parallelStartSocket(cpus = detectCores())
Starting parallelization in mode=socket with cpus=4.
> xgb_tune <- tuneParams(learner = xgb_learner, task = train_task, resampling = set_cv,
+ measures = list(acc,tpr,tnr,fpr,fp,fn), par.set = xg_ps,
+ control = rancontrol)
[Tune] Started tuning learner classif.xgboost for parameter set:
Type len Def Constr Req Tunable Trafo
max_depth integer - - 3 to 10 - TRUE -
lambda numeric - - 0.05 to 0.5 - TRUE -
eta numeric - - 0.01 to 0.5 - TRUE -
subsample numeric - - 0.5 to 1 - TRUE -
min_child_weight numeric - - 2 to 10 - TRUE -
colsample_bytree numeric - - 0.5 to 0.8 - TRUE -
With control class: TuneControlRandom
Imputation value: -0Imputation value: -0Imputation value: -0Imputation value: 1Imputation value: InfImputation value: Inf
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 4; elements = 5.
[Tune] Result: max_depth=5; lambda=0.171; eta=0.295; subsample=0.855; min_child_weight=5.54; colsample_bytree=0.735 : acc.test.mean=0.958,tpr.test.mean=0.989,tnr.test.mean=0.482,fpr.test.mean=0.518,fp.test.mean=1.28e+03,fn.test.mean= 413
> xgb_tune$y
acc.test.mean tpr.test.mean tnr.test.mean fpr.test.mean fp.test.mean fn.test.mean
0.9575036 0.9889762 0.4818275 0.5181725 1283.2000000 412.6000000
> xgb_tune$x
$max_depth
[1] 5
$lambda
[1] 0.1711398
$eta
[1] 0.295421
$subsample
[1] 0.8545802
$min_child_weight
[1] 5.541689
$colsample_bytree
[1] 0.7345529

xgb_tune$x查看参数调节得出的最优结果,将最优参数设定在模型xgb_new中,然后进行训练,这时便出现了我们前面提到的错误unique() applies only to vectors(当然github项目上给出的代码已经修正了)。出现这个错误之后,刚开始并不清楚原因在哪个地方,在下面的代码执行日志中可以发现我在不停地重新赋值再训练还有重新创建任务(因为我之前在R中遇到过将同一段代码先后两次执行,第一次错误,第二次却成功的情况),来来回回尝试了十几次,直到后来在github找到关于这条错误信息的issue,原来是caret和mlr的加载顺序弄错了。然后,用detach("package:caret")detach("package:mlr")命令先将两个包移除,再按照先加载caret后加载mlr的顺序,最后再重新赋值训练,成功。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
> xgb_new <- setHyperPars(learner = xgb_learner, par.vals = xgb_tune$x)
> xgb_model <- train(learner = xgb_new, task = train_task)
Error in train.default(learner = xgb_new, task = train_task) :
argument "y" is missing, with no default
> xgb_new <- setHyperPars(learner = xgb_learner, par.vals = xgb_tune$x)
> xgb_model <- train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> rm(xgb_new)
> xgb_new <- setHyperPars(learner = xgb_learner, par.vals = xgb_tune$x)
> xgb_model <- train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_tune$y
acc.test.mean tpr.test.mean tnr.test.mean fpr.test.mean fp.test.mean fn.test.mean
0.9575036 0.9889762 0.4818275 0.5181725 1283.2000000 412.6000000
> xgb_new <- setHyperPars(xgb_learner, par.vals = xgb_tune$x)
> xgb_model <- train(learner = xgb_new, task = train_task)
Error in train.default(learner = xgb_new, task = train_task) :
argument "y" is missing, with no default
> xgb_model <- train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_tune$y
acc.test.mean tpr.test.mean tnr.test.mean fpr.test.mean fp.test.mean fn.test.mean
0.9575036 0.9889762 0.4818275 0.5181725 1283.2000000 412.6000000
> xgb_new <- setHyperPars(learner = xgb_learner, par.vals = xgb_tune$x)
> xgmodel <- train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_tune
Tune result:
Op. pars: max_depth=5; lambda=0.171; eta=0.295; subsample=0.855; min_child_weight=5.54; colsample_bytree=0.735
acc.test.mean=0.958,tpr.test.mean=0.989,tnr.test.mean=0.482,fpr.test.mean=0.518,fp.test.mean=1.28e+03,fn.test.mean= 413
> xgb_new <- setHyperPars(learner = xgb_learner, par.vals = xgb_tune$x)
> xgmodel <- train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_new <- setHyperPars(makeLearner("classif.xgboost"), par.vals = xgb_tune$x)
> xgmodel <- train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_new
Learner classif.xgboost from package xgboost
Type: classif
Name: eXtreme Gradient Boosting; Short name: xgboost
Class: classif.xgboost
Properties: twoclass,multiclass,numerics,prob,weights,missings,featimp
Predict-Type: response
Hyperparameters: nrounds=1,verbose=0,max_depth=5,lambda=0.171,eta=0.295,subsample=0.855,min_child_weight=5.54,colsample_bytree=0.735
> xgmodel = train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_learner <- makeLearner("classif.xgboost",predict.type = "response")
> xgb_learner$par.vals <- list(
+ objective = "binary:logistic",
+ eval_metric = "error",
+ nrounds = 150,
+ print.every.n = 50)
> xg_ps <- makeParamSet(
+ makeIntegerParam("max_depth",lower=3,upper=10),
+ makeNumericParam("lambda",lower=0.05,upper=0.5),
+ makeNumericParam("eta", lower = 0.01, upper = 0.5),
+ makeNumericParam("subsample", lower = 0.50, upper = 1),
+ makeNumericParam("min_child_weight",lower=2,upper=10),
+ makeNumericParam("colsample_bytree",lower = 0.50,upper = 0.80))
> xgmodel = train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> train_task <- makeClassifTask(data = train, target = "income_level")
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
> test_task <- makeClassifTask(data = test, target = "income_level")
Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Provided data is not a pure data.frame but from class data.table, hence it will be converted.
> train_task <- createDummyFeatures(obj = train_task)
> test_task <- createDummyFeatures(obj = test_task)
> xgmodel = train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgb_new <- setHyperPars(xgb_learner, par.vals = xgb_tune$x)
> xgmodel = train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> library(caret)
> xgmodel = train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> xgmodel = caret::train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> library(caret)
> library(data.table)
> library(mlr)
> library(xgboost)
> xgmodel = train(xgb_new, train_task)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
> (packages())
Error: could not find function "packages"
> (.packages())
[1] "randomForest" "parallelMap" "parallel" "caret" "ggplot2" "lattice" "xgboost"
[8] "mlr" "ParamHelpers" "data.table" "stats" "graphics" "grDevices" "utils"
[15] "datasets" "methods" "base"
> detach("package:caret")
> detach("package:mlr")
> (.packages())
[1] "randomForest" "parallelMap" "parallel" "ggplot2" "lattice" "xgboost" "ParamHelpers"
[8] "data.table" "stats" "graphics" "grDevices" "utils" "datasets" "methods"
[15] "base"
> library(caret)
> library(mlr)
载入程辑包:‘mlr’
The following object is masked from ‘package:caret’:
train
Warning message:
程辑包‘mlr’是用R版本3.3.3 来建造的
> xgmodel = train(xgb_new, train_task)
[1] train-error:0.050866
[51] train-error:0.041344
[101] train-error:0.039279
[150] train-error:0.037895
Warning message:
'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").

经历了千辛万苦,模型终于训练好了,胜利的曙光似乎就在前方,终于可以进行预测了!然而,猝不及防,正当我们将测试集的income_levelxgb_prediction进行对比时,an error thrown again,这次是The data contain levels not found in the data.。错误信息直接翻译过来的意思是数据中包含levels not found,分别查看预测结果xgb_predictiontest$income_level,发现原来是两者的标签设置不一样,xgb_prediction预测的结果是-50000+50000两种,而原测试集目标变量test$income_level的标签是-5000050000+.两个level,标签不同自然无法比较。

解决办法也挺简单,执行test[,income_level:= ifelse(income_level == "-50000","-50000","+50000")]50000+.替换为+50000即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
> predict_xgb <- predict(xgmodel, test_task)
> xgb_prediction <- predict_xgb$data$response
> confusionMatrix(test$income_level, xgb_prediction)
Error in confusionMatrix.default(test$income_level, xgb_prediction) :
The data contain levels not found in the data.
> xgb_prediction
[1] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[16] -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000
[31] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000
[46] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[61] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[76] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[91] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[106] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000
[121] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[136] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[151] -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000
[166] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[181] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[196] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[211] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[226] +50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[241] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[256] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000
[271] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[286] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000
[301] -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[316] -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[331] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[346] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000
[361] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[376] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 +50000 -50000 -50000 -50000 -50000
[391] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[406] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[421] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[436] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[451] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[466] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[481] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[496] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[511] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[526] -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[541] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[556] -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000
[571] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[586] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000
[601] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[616] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[631] -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[646] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[661] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000
[676] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[691] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[706] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000
[721] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[736] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[751] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000
[766] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[781] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[796] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000
[811] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[826] -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[841] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[856] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[871] -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000
[886] -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000
[901] +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[916] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000
[931] +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[946] -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[961] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[976] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[991] -50000 -50000 -50000 -50000 -50000 +50000 -50000 -50000 -50000 -50000
[ reached getOption("max.print") -- omitted 98762 entries ]
Levels: -50000 +50000
> confusionMatrix(test$income_level, xgb_prediction)
Error in confusionMatrix.default(test$income_level, xgb_prediction) :
The data contain levels not found in the data.
> confusionMatrix(xgb_prediction$data$response,xgb_prediction$data$truth)
Error in xgb_prediction$data : $ operator is invalid for atomic vectors
> xg_confused <- confusionMatrix(test$income_level,xgb_prediction)
Error in confusionMatrix.default(test$income_level, xgb_prediction) :
The data contain levels not found in the data.
> test$income_level
[1] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[14] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[27] -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 50000+.
[40] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[53] -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000
[66] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[79] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[92] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000
[105] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[118] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[131] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[144] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 50000+. -50000
[157] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[170] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[183] -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[196] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000
[209] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[222] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000
[235] -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[248] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[261] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[274] -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[287] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000
[300] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[313] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000
[326] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[339] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[352] -50000 -50000 50000+. 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[365] -50000 -50000 -50000 50000+. -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[378] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[391] 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[404] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[417] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[430] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[443] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[456] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[469] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[482] -50000 -50000 50000+. -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[495] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[508] -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000
[521] -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+.
[534] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[547] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000
[560] -50000 -50000 50000+. -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000
[573] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[586] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[599] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[612] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[625] -50000 -50000 50000+. -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000
[638] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[651] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000
[664] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000
[677] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[690] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[703] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000
[716] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[729] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[742] -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[755] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[768] -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. 50000+. -50000 -50000 -50000 -50000
[781] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[794] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+.
[807] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[820] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[833] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[846] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[859] -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+.
[872] -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000
[885] -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[898] -50000 -50000 50000+. 50000+. -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[911] -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000
[924] 50000+. -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 50000+. -50000 -50000 -50000
[937] -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000
[950] 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[963] -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000
[976] -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000
[989] -50000 -50000 -50000 -50000 -50000 -50000 -50000 50000+. -50000 -50000 -50000 -50000
[ reached getOption("max.print") -- omitted 98762 entries ]
Levels: -50000 50000+.
> test[,income_level:= ifelse(income_level == "-50000","-50000","+50000")]
> test$income_level
[1] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[12] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000"
[23] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[34] "+50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[45] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[56] "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[67] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[78] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[89] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[100] "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[111] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[122] "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[133] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[144] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000"
[155] "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[166] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[177] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000"
[188] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[199] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000"
[210] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[221] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[232] "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000"
[243] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[254] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[265] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[276] "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[287] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[298] "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000"
[309] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[320] "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[331] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[342] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[353] "-50000" "+50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[364] "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000"
[375] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000"
[386] "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[397] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[408] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[419] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[430] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[441] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[452] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000"
[463] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[474] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000"
[485] "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[496] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[507] "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[518] "-50000" "+50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[529] "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[540] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[551] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000"
[562] "+50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[573] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[584] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[595] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[606] "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[617] "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000"
[628] "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[639] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[650] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[661] "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[672] "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[683] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[694] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[705] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000"
[716] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000"
[727] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[738] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000"
[749] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[760] "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[771] "-50000" "-50000" "-50000" "-50000" "+50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[782] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000"
[793] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[804] "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000"
[815] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[826] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[837] "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[848] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[859] "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[870] "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000"
[881] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000"
[892] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "+50000" "-50000"
[903] "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000"
[914] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "+50000"
[925] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "+50000" "-50000" "-50000"
[936] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000"
[947] "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[958] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000"
[969] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "-50000"
[980] "-50000" "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000"
[991] "-50000" "-50000" "-50000" "-50000" "-50000" "+50000" "-50000" "-50000" "-50000" "-50000"
[ reached getOption("max.print") -- omitted 98762 entries ]
> xg_confused <- confusionMatrix(test$income_level,xgb_prediction)
> xg_confused
Confusion Matrix and Statistics
Reference
Prediction -50000 +50000
-50000 92699 877
+50000 3433 2753
Accuracy : 0.9568
95% CI : (0.9555, 0.9581)
No Information Rate : 0.9636
P-Value [Acc > NIR] : 1
Kappa : 0.5398
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9643
Specificity : 0.7584
Pos Pred Value : 0.9906
Neg Pred Value : 0.4450
Prevalence : 0.9636
Detection Rate : 0.9292
Detection Prevalence : 0.9380
Balanced Accuracy : 0.8613
'Positive' Class : -50000

查看模糊矩阵得到的结果,模型准确率达到95.68%,并且特异度(Specificity),也就是对负样本的预测准确率达到75.84%,可以说,已经非常不错了!至此,UCI人口调查数据的折腾就暂时告一段落了,如果有时间我还会继续发表研究这个数据以及学习xgboost的心得!

(完)

参考链接

  1. 关于xgboost不接受字符型变量的讨论:Factors #95
  2. 关于出现unique() applies only to vectors错误的讨论:Error in unique.default(x, nmax = nmax) : unique() applies only to vectors #1407Error in makeParamSet when tuning hyperparameters of nnet #1418
  3. mlr入门教程:Machine Learning in R: mlr Tutorial
  4. Get Started with XGBoost in R
  5. 在R语言中针对初学者的xgboost和调参教程:Beginners Tutorial on XGBoost and Parameter Tuning in R
  6. 知乎专栏:强大的机器学习专属R包——mlr包
  7. mlr-tutorial:Imbalanced Classification Problems
觉得还不错?帮我赞助点域名费吧:)