R code using Caret Package for Various Models of Boosting
Package
"caret"
은 다양한 머신러닝 분석을 하나로 모은 패키지이며,trainControl
을 이용하여 과적합을 방지할 수 있다."caret"
에서는 Boosting을 이용한 다양한 기법을 수행할 수 있으며, 그 중 가장 많이 쓰이는AdaBoost
,Gradient Boosting
,XGBoost
를 이용하여 예제 데이터를 분석한다. 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은Person.Loan
이다.
pacman::p_load("data.table", "dplyr")
UB <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>% # 데이터 불러오기
data.frame() %>% # Data frame 변환환
mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>% # Character for classification
select(-1) # ID변수 제거거
cols <- c("Family", "Education", "Personal.Loan", "Securities.Account",
"CD.Account", "Online", "CreditCard")
UB <- UB %>%
mutate_at(cols, as.factor) # 범주형 변수 변환
glimpse(UB) # 데이터 구조
Rows: 2,500
Columns: 13
$ Age <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
Boosting에서 가장 많이 쓰이는 AdaBoost의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method = "cv", number = 5, search = "random") # 5-Fold-Cross Validation
set.seed(100) # seed 고정 For Cross Validation
caret.rd.ada <- train(Personal.Loan~., data = UB.trd,
method = "AdaBoost.M1", trControl = fitControl,
tuneLength = 5) # tuneLength (탐색할 후보 모수 갯수)
caret.rd.ada
AdaBoost.M1
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
coeflearn maxdepth mfinal Accuracy Kappa
Breiman 7 70 0.9817338 0.8991055
Freund 2 86 0.9800130 0.8871317
Zhu 4 89 0.9817322 0.8972157
Zhu 6 23 0.9828702 0.9035620
Zhu 23 78 0.9823036 0.9001572
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were mfinal = 23, maxdepth =
6 and coeflearn = Zhu.
mfinal
: 반복 수maxdepth
: Tree의 최대 깊이coeflearn
: 가중치 업데이트 계수plot(caret.rd.ada) # Accuracy
maxdepth
= 6, mfinal
= 23, coeflearn
= “Zhu”일 때 정확도가 가장 높다.mfinal
= 23을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method = "cv", number = 5) # 5-Fold-Cross Validation
customGrid <- expand.grid(mfinal = seq(22, 24, by = 1), # Random Search의 Best Parameter 기준으로 탐색NA= 1, # Stump를 생성하기 위해 최대 깊이 "1"로 고정NA= "Breiman") # 가장 많이 쓰이는 Breiman로 고정
set.seed(100) # seed 고정 For Cross Validation
caret.gd.ada <- train(Personal.Loan~., data = UB.trd, method = "AdaBoost.M1",
trControl = fitControl, tuneGrid = customGrid)
caret.gd.ada
AdaBoost.M1
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
mfinal Accuracy Kappa
22 0.9371884 0.5745333
23 0.9469011 0.6593923
24 0.9457468 0.6272606
Tuning parameter 'maxdepth' was held constant at a value of 1
Tuning parameter 'coeflearn' was held constant at a value of Breiman
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were mfinal = 23, maxdepth =
1 and coeflearn = Breiman.
plot(caret.gd.ada) # Accuracy
maxdepth
= 1, mfinal
= 23, coeflearn
= “Breiman”일 때 가장 높다.# 변수 중요도도
adaImp <- varImp(caret.gd.ada, scale = FALSE)
plot(adaImp)
caret.gd.ada$finalModel$weights # 각 Tree에 대한 정보의 양양
[1] 1.08325539 0.89957656 0.63275242 0.75226832 0.48449782 0.33250481
[7] 0.16625240 0.24771483 0.17284836 0.18903596 0.19079791 0.21296646
[13] 0.10648323 0.14983887 0.31141150 0.30563367 0.21895195 0.05529500
[19] 0.13715590 0.18066928 0.12458244 0.07265042 0.23648112
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.ada.pred <- predict(caret.gd.ada, newdata = UB.ted) # predict(AdaBoost모형, Test Data)
confusionMatrix(caret.gd.ada.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 668 32
yes 5 44
Accuracy : 0.9506
95% CI : (0.9325, 0.965)
No Information Rate : 0.8985
P-Value [Acc > NIR] : 1.712e-07
Kappa : 0.6784
Mcnemar's Test P-Value : 1.917e-05
Sensitivity : 0.57895
Specificity : 0.99257
Pos Pred Value : 0.89796
Neg Pred Value : 0.95429
Prevalence : 0.10147
Detection Rate : 0.05874
Detection Prevalence : 0.06542
Balanced Accuracy : 0.78576
'Positive' Class : yes
pacman::p_load("pROC")
test.ada.prob <- predict(caret.gd.ada, newdata = UB.ted, type = "prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.ada.prob <- test.ada.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.ada.prob) # "yes"에 대한 예측 확률
ada.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(ada.roc), 3) # AUC
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)
pacman::p_load("Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp,ac, plot="ROC") # ROC(예측 확률 , 실제 클래스)
detach(package:Epi)
pacman::p_load("ROCR")
ada.pred <- prediction(pp, ac) # prediction(예측 확률, 실제 클래스)스)
ada.perf <- performance(ada.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(ada.perf, col = "red") # ROC Curve
abline(0,1, col = "black")
perf.auc <- performance(ada.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright",legend = auc,bty = "n")
ada.lift <- performance(ada.pred,"lift", "rpp") # Lift chart
plot(ada.lift, colorize = T, lwd = 2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(pp, ac.numeric, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric) # Top 10% 향상도 출력
[1] 5.913
detach(package:lift)
Boosting에서 가장 많이 쓰이는 Gradient Boosting의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method = "cv", number = 5, search = "random") # 5-Fold-Cross Validation
set.seed(100) # seed 고정 For Cross Validation
caret.rd.gbm <- train(Personal.Loan~., data = UB.trd,
method = "gbm", trControl = fitControl,
tuneLength = 5, verbose=F) # tuneLength (탐색할 후보 모수 갯수)
caret.rd.gbm
Stochastic Gradient Boosting
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
shrinkage interaction.depth n.minobsinnode n.trees Accuracy
0.1235627 6 22 282 0.9777306
0.2151574 6 16 2343 0.9817273
0.2163256 4 7 2419 0.9811559
0.4017440 7 15 2762 0.9703004
0.4144840 7 23 4062 0.9651608
Kappa
0.8749167
0.8980256
0.8951350
0.8432471
0.8159151
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were n.trees =
2343, interaction.depth = 6, shrinkage = 0.2151574
and n.minobsinnode = 16.
n.trees
: 반복 수interaction.depth
: Tree의 최대 깊이shrinkage
: Learning Raten.minobsinnode
: Terminal node의 최소 관측갯수n.trees
= 2343, interaction.depth
= 6, shrinkage
= 0.2151574, n.minobsinnode
= 16일 때 정확도가 가장 높다.n.trees
= 2343, interaction.depth
= 6, shrinkage
= 0.2151574, n.minobsinnode
= 16을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method = "cv", number = 5) # 5-Fold-Cross Validation
customGrid <- expand.grid(n.trees = seq(2343, 2344, by = 1), # Random Search의 Best Parameter 기준으로 탐색NA= seq(6, 7, by = 1),
shrinkage = seq(0.21, 0.22, by = 0.01),
n.minobsinnode = seq(16, 17, by = 1))
set.seed(100) # seed 고정 For Cross Validation
caret.gd.gbm <- train(Personal.Loan~., data = UB.trd, method = "gbm",
trControl = fitControl, tuneGrid = customGrid, verbose=F)
caret.gd.gbm
Stochastic Gradient Boosting
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
shrinkage interaction.depth n.minobsinnode n.trees Accuracy
0.21 6 16 2343 0.9811543
0.21 6 16 2344 0.9811543
0.21 6 17 2343 0.9805861
0.21 6 17 2344 0.9805861
0.21 7 16 2343 0.9828702
0.21 7 16 2344 0.9828702
0.21 7 17 2343 0.9805845
0.21 7 17 2344 0.9805845
0.22 6 16 2343 0.9817273
0.22 6 16 2344 0.9817273
0.22 6 17 2343 0.9817289
0.22 6 17 2344 0.9817289
0.22 7 16 2343 0.9811575
0.22 7 16 2344 0.9817289
0.22 7 17 2343 0.9817257
0.22 7 17 2344 0.9817257
Kappa
0.8954993
0.8954993
0.8924529
0.8924529
0.9042869
0.9042869
0.8928468
0.8928468
0.8991119
0.8991119
0.8983321
0.8983321
0.8959374
0.8992264
0.8996204
0.8996204
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were n.trees =
2343, interaction.depth = 7, shrinkage = 0.21 and n.minobsinnode
= 16.
n.trees
= 2343, interaction.depth
= 7, shrinkage
= 0.21, n.minobsinnode
= 16일 때 가장 높으며, 정확도는 약간 증가했다.# 최종 모형NAcaret.gd.gbm$finalModel
A gradient boosted model with bernoulli loss function.
2343 iterations were performed.
There were 15 predictors of which 15 had non-zero influence.
summary(caret.gd.gbm$finalModel, cBars = 10, las=2) # cBars : 상위 몇개 나타낼 것인지
var rel.inf
Income Income 34.69765109
CCAvg CCAvg 16.09264392
Education2 Education2 13.99098001
Education3 Education3 11.81194074
Family3 Family3 7.77047223
Family4 Family4 7.70439846
CD.Account1 CD.Account1 3.15419520
ZIP.Code ZIP.Code 1.48241098
Age Age 0.95032101
Mortgage Mortgage 0.80482155
Experience Experience 0.55690568
Online1 Online1 0.48191241
CreditCard1 CreditCard1 0.28222266
Family2 Family2 0.14886186
Securities.Account1 Securities.Account1 0.07026221
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.gbm.pred <- predict(caret.gd.gbm, newdata=UB.ted) # predict(gbm모형, Test Data)
confusionMatrix(caret.gd.gbm.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 669 12
yes 4 64
Accuracy : 0.9786
95% CI : (0.9655, 0.9877)
No Information Rate : 0.8985
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8771
Mcnemar's Test P-Value : 0.08012
Sensitivity : 0.84211
Specificity : 0.99406
Pos Pred Value : 0.94118
Neg Pred Value : 0.98238
Prevalence : 0.10147
Detection Rate : 0.08545
Detection Prevalence : 0.09079
Balanced Accuracy : 0.91808
'Positive' Class : yes
pacman::p_load("pROC")
test.gbm.prob <- predict(caret.gd.gbm, newdata = UB.ted, type = "prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.gbm.prob <- test.gbm.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.gbm.prob) # "yes"에 대한 예측 확률
gbm.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(gbm.roc), 3) # AUC
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)
pacman::p_load("Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp,ac, plot="ROC") # ROC(예측 확률 , 실제 클래스)
detach(package:Epi)
pacman::p_load("ROCR")
gbm.pred <- prediction(pp, ac) # prediction(예측 확률, 실제 클래스)스)
gbm.perf <- performance(gbm.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(gbm.perf, col="red") # ROC Curve
abline(0,1, col="black")
perf.auc <- performance(gbm.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright",legend = auc,bty = "n")
gbm.lift <- performance(gbm.pred,"lift", "rpp") # Lift chart
plot(gbm.lift, colorize = T, lwd = 2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(pp, ac.numeric, cumulative = T, n.buckets =24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric) # Top 10% 향상도 출력
[1] 8.804
detach(package:lift)
Gradient Boosting에서 확장된 XGBoost의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method = "cv", number = 5, search = "random") # 5-Fold-Cross Validation
set.seed(100) # seed 고정 For Cross Validation
caret.rd.xgb <- train(Personal.Loan~., data = UB.trd,
method = "xgbTree", trControl = fitControl,
tuneLength = 5, # tuneLength (탐색할 후보 모수 갯수)
lambda = 0) # Regularization Parameter
caret.rd.xgb
eXtreme Gradient Boosting
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
eta max_depth gamma colsample_bytree min_child_weight
0.1235627 6 7.108038 0.6081206 18
0.2151574 6 5.383487 0.6527814 3
0.2163256 4 7.489722 0.5196387 3
0.3219509 6 1.714202 0.4953224 15
0.4144840 7 4.201015 0.4110895 19
subsample nrounds Accuracy Kappa
0.7220431 358 0.9371803 0.5812145
0.9921731 624 0.9811608 0.8921203
0.3477167 985 0.9697354 0.8184426
0.8988404 919 0.9623118 0.7734267
0.4979954 718 0.8989109 0.3717268
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were nrounds = 624, max_depth
= 6, eta = 0.2151574, gamma = 5.383487, colsample_bytree
= 0.6527814, min_child_weight = 3 and subsample = 0.9921731.
nrounds
: 반복 수max_depth
: Tree의 최대 깊이eta
: Learning Lategamma
: 분할하기 위해 필요한 최소 손실 감소, 클수록 분할이 쉽게 일어나지 않음colsample_bytree
: Tree 생성 때 사용할 예측변수 비율min_child_weight
: 한 leaf 노드에 요구되는 관측치에 대한 가중치의 최소 합subsample
: 모델 구축시 사용할 Data비율로 1이면 전체 Data 사용nrounds
= 624, max_depth
= 6, eta
= 0.2151574, gamma
= 5.383487, colsample_bytree
= 0.6527814, min_child_weight
= 3, subsample
= 0.9921731일 때 정확도가 가장 높다.nrounds
= 624, max_depth
= 6, eta
= 0.2151574, gamma
= 5.383487, colsample_bytree
= 0.6527814, min_child_weight
= 3, subsample
= 0.9921731을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method = "cv", number = 5) # 5-Fold-Cross Validation
customGrid <- expand.grid(nrounds = seq(624, 625, by = 1), # Random Search의 Best Parameter 기준으로 탐색NA= seq(6, 7, by = 1),
eta = seq(0.2, 0.3, by = 0.1),
gamma = seq(5.3, 5.4, by = 0.1),
colsample_bytree = seq(0.6, 0.7, by = 0.1),
min_child_weight = seq(3, 4, by = 1),
subsample = seq(0.9, 1, by = 0.1))
set.seed(100) # seed 고정 For Cross Validation
caret.gd.xgb <- train(Personal.Loan~., data = UB.trd, method = "xgbTree",
trControl = fitControl, tuneGrid = customGrid,
lambda = 0) # Regularization Parameter
caret.gd.xgb
eXtreme Gradient Boosting
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
eta max_depth gamma colsample_bytree min_child_weight
0.2 6 5.3 0.6 3
0.2 6 5.3 0.6 3
0.2 6 5.3 0.6 3
0.2 6 5.3 0.6 3
0.2 6 5.3 0.6 4
0.2 6 5.3 0.6 4
0.2 6 5.3 0.6 4
0.2 6 5.3 0.6 4
0.2 6 5.3 0.7 3
0.2 6 5.3 0.7 3
0.2 6 5.3 0.7 3
0.2 6 5.3 0.7 3
0.2 6 5.3 0.7 4
0.2 6 5.3 0.7 4
0.2 6 5.3 0.7 4
0.2 6 5.3 0.7 4
0.2 6 5.4 0.6 3
0.2 6 5.4 0.6 3
0.2 6 5.4 0.6 3
0.2 6 5.4 0.6 3
0.2 6 5.4 0.6 4
0.2 6 5.4 0.6 4
0.2 6 5.4 0.6 4
0.2 6 5.4 0.6 4
0.2 6 5.4 0.7 3
0.2 6 5.4 0.7 3
0.2 6 5.4 0.7 3
0.2 6 5.4 0.7 3
0.2 6 5.4 0.7 4
0.2 6 5.4 0.7 4
0.2 6 5.4 0.7 4
0.2 6 5.4 0.7 4
0.2 7 5.3 0.6 3
0.2 7 5.3 0.6 3
0.2 7 5.3 0.6 3
0.2 7 5.3 0.6 3
0.2 7 5.3 0.6 4
0.2 7 5.3 0.6 4
0.2 7 5.3 0.6 4
0.2 7 5.3 0.6 4
0.2 7 5.3 0.7 3
0.2 7 5.3 0.7 3
0.2 7 5.3 0.7 3
0.2 7 5.3 0.7 3
0.2 7 5.3 0.7 4
0.2 7 5.3 0.7 4
0.2 7 5.3 0.7 4
0.2 7 5.3 0.7 4
0.2 7 5.4 0.6 3
0.2 7 5.4 0.6 3
0.2 7 5.4 0.6 3
0.2 7 5.4 0.6 3
0.2 7 5.4 0.6 4
0.2 7 5.4 0.6 4
0.2 7 5.4 0.6 4
0.2 7 5.4 0.6 4
0.2 7 5.4 0.7 3
0.2 7 5.4 0.7 3
0.2 7 5.4 0.7 3
0.2 7 5.4 0.7 3
0.2 7 5.4 0.7 4
0.2 7 5.4 0.7 4
0.2 7 5.4 0.7 4
0.2 7 5.4 0.7 4
0.3 6 5.3 0.6 3
0.3 6 5.3 0.6 3
0.3 6 5.3 0.6 3
0.3 6 5.3 0.6 3
0.3 6 5.3 0.6 4
0.3 6 5.3 0.6 4
0.3 6 5.3 0.6 4
0.3 6 5.3 0.6 4
0.3 6 5.3 0.7 3
0.3 6 5.3 0.7 3
0.3 6 5.3 0.7 3
0.3 6 5.3 0.7 3
0.3 6 5.3 0.7 4
0.3 6 5.3 0.7 4
0.3 6 5.3 0.7 4
0.3 6 5.3 0.7 4
0.3 6 5.4 0.6 3
0.3 6 5.4 0.6 3
0.3 6 5.4 0.6 3
0.3 6 5.4 0.6 3
0.3 6 5.4 0.6 4
0.3 6 5.4 0.6 4
0.3 6 5.4 0.6 4
0.3 6 5.4 0.6 4
0.3 6 5.4 0.7 3
0.3 6 5.4 0.7 3
0.3 6 5.4 0.7 3
0.3 6 5.4 0.7 3
0.3 6 5.4 0.7 4
0.3 6 5.4 0.7 4
0.3 6 5.4 0.7 4
0.3 6 5.4 0.7 4
0.3 7 5.3 0.6 3
0.3 7 5.3 0.6 3
0.3 7 5.3 0.6 3
0.3 7 5.3 0.6 3
0.3 7 5.3 0.6 4
0.3 7 5.3 0.6 4
0.3 7 5.3 0.6 4
0.3 7 5.3 0.6 4
0.3 7 5.3 0.7 3
0.3 7 5.3 0.7 3
0.3 7 5.3 0.7 3
0.3 7 5.3 0.7 3
0.3 7 5.3 0.7 4
0.3 7 5.3 0.7 4
0.3 7 5.3 0.7 4
0.3 7 5.3 0.7 4
0.3 7 5.4 0.6 3
0.3 7 5.4 0.6 3
0.3 7 5.4 0.6 3
0.3 7 5.4 0.6 3
0.3 7 5.4 0.6 4
0.3 7 5.4 0.6 4
0.3 7 5.4 0.6 4
0.3 7 5.4 0.6 4
0.3 7 5.4 0.7 3
0.3 7 5.4 0.7 3
0.3 7 5.4 0.7 3
0.3 7 5.4 0.7 3
0.3 7 5.4 0.7 4
0.3 7 5.4 0.7 4
0.3 7 5.4 0.7 4
0.3 7 5.4 0.7 4
subsample nrounds Accuracy Kappa
0.9 624 0.9811624 0.8919530
0.9 625 0.9811624 0.8919530
1.0 624 0.9817289 0.8955944
1.0 625 0.9817289 0.8955944
0.9 624 0.9823020 0.8989565
0.9 625 0.9817306 0.8959285
1.0 624 0.9800179 0.8850897
1.0 625 0.9800179 0.8850897
0.9 624 0.9777338 0.8725072
0.9 625 0.9777338 0.8725072
1.0 624 0.9805893 0.8893606
1.0 625 0.9805893 0.8893606
0.9 624 0.9811608 0.8928304
0.9 625 0.9811608 0.8928304
1.0 624 0.9794432 0.8828790
1.0 625 0.9794432 0.8828790
0.9 624 0.9783036 0.8770100
0.9 625 0.9777322 0.8734481
1.0 624 0.9783053 0.8739334
1.0 625 0.9783053 0.8739334
0.9 624 0.9783069 0.8742904
0.9 625 0.9783069 0.8742904
1.0 624 0.9817338 0.8950015
1.0 625 0.9817338 0.8950015
0.9 624 0.9811624 0.8920329
0.9 625 0.9811624 0.8920329
1.0 624 0.9811575 0.8925663
1.0 625 0.9811575 0.8925663
0.9 624 0.9783053 0.8751519
0.9 625 0.9783053 0.8751519
1.0 624 0.9805877 0.8894179
1.0 625 0.9805877 0.8894179
0.9 624 0.9800147 0.8873730
0.9 625 0.9800147 0.8873730
1.0 624 0.9783085 0.8731261
1.0 625 0.9783085 0.8731261
0.9 624 0.9794449 0.8823109
0.9 625 0.9794449 0.8823109
1.0 624 0.9771640 0.8682078
1.0 625 0.9771640 0.8682078
0.9 624 0.9811591 0.8931233
0.9 625 0.9811591 0.8931233
1.0 624 0.9823053 0.8980542
1.0 625 0.9823053 0.8980542
0.9 624 0.9817322 0.8951524
0.9 625 0.9817322 0.8951524
1.0 624 0.9811608 0.8918731
1.0 625 0.9811608 0.8918731
0.9 624 0.9800163 0.8867696
0.9 625 0.9800163 0.8867696
1.0 624 0.9811591 0.8923666
1.0 625 0.9811591 0.8923666
0.9 624 0.9805893 0.8887474
0.9 625 0.9805893 0.8887474
1.0 624 0.9765910 0.8641748
1.0 625 0.9765910 0.8641748
0.9 624 0.9794481 0.8820096
0.9 625 0.9800179 0.8857224
1.0 624 0.9794481 0.8812937
1.0 625 0.9794481 0.8812937
0.9 624 0.9783053 0.8758413
0.9 625 0.9783053 0.8758413
1.0 624 0.9800179 0.8846535
1.0 625 0.9800179 0.8846535
0.9 624 0.9794481 0.8812038
0.9 625 0.9794481 0.8812038
1.0 624 0.9805893 0.8887474
1.0 625 0.9805893 0.8887474
0.9 624 0.9805893 0.8886546
0.9 625 0.9805893 0.8886546
1.0 624 0.9794465 0.8817076
1.0 625 0.9794465 0.8817076
0.9 624 0.9794449 0.8830880
0.9 625 0.9794449 0.8830880
1.0 624 0.9805893 0.8891734
1.0 625 0.9805893 0.8891734
0.9 624 0.9788734 0.8788344
0.9 625 0.9788734 0.8788344
1.0 624 0.9828702 0.9022731
1.0 625 0.9828702 0.9022731
0.9 624 0.9811591 0.8919249
0.9 625 0.9805877 0.8888818
1.0 624 0.9783036 0.8771763
1.0 625 0.9783036 0.8771763
0.9 624 0.9811591 0.8920659
0.9 625 0.9817306 0.8956209
1.0 624 0.9771624 0.8674133
1.0 625 0.9771624 0.8674133
0.9 624 0.9828734 0.9009698
0.9 625 0.9828734 0.9009698
1.0 624 0.9794449 0.8822959
1.0 625 0.9794449 0.8822959
0.9 624 0.9788751 0.8782051
0.9 625 0.9788751 0.8782051
1.0 624 0.9777306 0.8712897
1.0 625 0.9777306 0.8712897
0.9 624 0.9783036 0.8781373
0.9 625 0.9783036 0.8781373
1.0 624 0.9748751 0.8544771
1.0 625 0.9748751 0.8544771
0.9 624 0.9794497 0.8818147
0.9 625 0.9800212 0.8854641
1.0 624 0.9771559 0.8694640
1.0 625 0.9771559 0.8694640
0.9 624 0.9794449 0.8834449
0.9 625 0.9794449 0.8834449
1.0 624 0.9828751 0.9024489
1.0 625 0.9828751 0.9024489
0.9 624 0.9788734 0.8780400
0.9 625 0.9783036 0.8744209
1.0 624 0.9788767 0.8767860
1.0 625 0.9788767 0.8767860
0.9 624 0.9805910 0.8890289
0.9 625 0.9800195 0.8859762
1.0 624 0.9783036 0.8757023
1.0 625 0.9783036 0.8757023
0.9 624 0.9811608 0.8926284
0.9 625 0.9811608 0.8926284
1.0 624 0.9777354 0.8710367
1.0 625 0.9777354 0.8710367
0.9 624 0.9811608 0.8926377
0.9 625 0.9811608 0.8926377
1.0 624 0.9777338 0.8704176
1.0 625 0.9777338 0.8704176
0.9 624 0.9817338 0.8954059
0.9 625 0.9811624 0.8920340
1.0 624 0.9800179 0.8852626
1.0 625 0.9800179 0.8852626
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were nrounds = 624, max_depth
= 7, eta = 0.3, gamma = 5.3, colsample_bytree =
0.7, min_child_weight = 3 and subsample = 1.
nrounds
= 624, max_depth
= 7, eta
= 0.3, gamma
= 5.3, colsample_bytree
= 0.7, min_child_weight
= 3, subsample
= 1일 때 가장 높다.xgbImp <- varImp(caret.gd.xgb, scale = FALSE)
plot(xgbImp)
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.xgb.pred <- predict(caret.gd.xgb, newdata=UB.ted) # predict(xgboost모형, Test Data)
confusionMatrix(caret.gd.xgb.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 670 12
yes 3 64
Accuracy : 0.98
95% CI : (0.9672, 0.9887)
No Information Rate : 0.8985
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8841
Mcnemar's Test P-Value : 0.03887
Sensitivity : 0.84211
Specificity : 0.99554
Pos Pred Value : 0.95522
Neg Pred Value : 0.98240
Prevalence : 0.10147
Detection Rate : 0.08545
Detection Prevalence : 0.08945
Balanced Accuracy : 0.91882
'Positive' Class : yes
pacman::p_load("pROC")
test.xgb.prob <- predict(caret.gd.xgb, newdata = UB.ted, type = "prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.xgb.prob <- test.xgb.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.xgb.prob) # "yes"에 대한 예측 확률
xgb.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(xgb.roc), 3) # AUC
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)
pacman::p_load("Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp,ac, plot="ROC") # ROC(예측 확률 , 실제 클래스)
detach(package:Epi)
pacman::p_load("ROCR")
xgb.pred <- prediction(pp, ac) # prediction(예측 확률, 실제 클래스)스)
xgb.perf <- performance(xgb.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(xgb.perf, col = "red") # ROC Curve
abline(0,1, col = "black")
perf.auc <- performance(xgb.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright",legend = auc,bty = "n")
xgb.lift <- performance(xgb.pred,"lift", "rpp") # Lift chart
plot(xgb.lift, colorize = T, lwd = 2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(pp, ac.numeric, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric) # Top 10% 향상도 출력
[1] 8.935
detach(package:lift)
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".