Boosting based on Caret

Machine Learning

R code using Caret Package for Various Models of Boosting

Yeongeun Jeon , Jeongwook Lee , Jung In Seo
11-01-2020

Package "caret"은 다양한 머신러닝 분석을 하나로 모은 패키지이며, trainControl을 이용하여 과적합을 방지할 수 있다. "caret"에서는 Boosting을 이용한 다양한 기법을 수행할 수 있으며, 그 중 가장 많이 쓰이는 AdaBoost, Gradient Boosting, XGBoost를 이용하여 예제 데이터를 분석한다. 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 TargetPerson.Loan이다.



1. 데이터 불러오기

pacman::p_load("data.table", "dplyr") 

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환
 
glimpse(UB)                                                           # 데이터 구조
Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

3. AdaBoost

3-1. 최적의 모수 찾기

Boosting에서 가장 많이 쓰이는 AdaBoost의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation
set.seed(100)                                                          # seed 고정 For Cross Validation
caret.rd.ada <- train(Personal.Loan~., data = UB.trd,
                      method = "AdaBoost.M1", trControl = fitControl,   
                      tuneLength = 5)                                  # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.ada
AdaBoost.M1 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  coeflearn  maxdepth  mfinal  Accuracy   Kappa    
  Breiman     7        70      0.9817338  0.8991055
  Freund      2        86      0.9800130  0.8871317
  Zhu         4        89      0.9817322  0.8972157
  Zhu         6        23      0.9828702  0.9035620
  Zhu        23        78      0.9823036  0.9001572

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were mfinal = 23, maxdepth =
 6 and coeflearn = Zhu.
plot(caret.rd.ada)                            # Accuracy 

fitControl <- trainControl(method = "cv", number = 5)         # 5-Fold-Cross Validation


customGrid <- expand.grid(mfinal    = seq(22, 24, by = 1),    # Random Search의 Best Parameter 기준으로 탐색NA= 1,                      # Stump를 생성하기 위해 최대 깊이 "1"로 고정NA= "Breiman")              # 가장 많이 쓰이는 Breiman로 고정     
set.seed(100)  # seed 고정 For Cross Validation
caret.gd.ada <- train(Personal.Loan~., data = UB.trd, method = "AdaBoost.M1",  
                      trControl = fitControl, tuneGrid = customGrid)    

caret.gd.ada
AdaBoost.M1 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  mfinal  Accuracy   Kappa    
  22      0.9371884  0.5745333
  23      0.9469011  0.6593923
  24      0.9457468  0.6272606

Tuning parameter 'maxdepth' was held constant at a value of 1

Tuning parameter 'coeflearn' was held constant at a value of Breiman
Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were mfinal = 23, maxdepth =
 1 and coeflearn = Breiman.
plot(caret.gd.ada)            # Accuracy

3-1-1. 변수 중요도

# 변수 중요도adaImp <- varImp(caret.gd.ada, scale = FALSE)
plot(adaImp)

3-1-2. 가중치

caret.gd.ada$finalModel$weights             # 각 Tree에 대한 정보의 양
 [1] 1.08325539 0.89957656 0.63275242 0.75226832 0.48449782 0.33250481
 [7] 0.16625240 0.24771483 0.17284836 0.18903596 0.19079791 0.21296646
[13] 0.10648323 0.14983887 0.31141150 0.30563367 0.21895195 0.05529500
[19] 0.13715590 0.18066928 0.12458244 0.07265042 0.23648112

3-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.ada.pred <- predict(caret.gd.ada, newdata = UB.ted)       # predict(AdaBoost모형, Test Data)

3-2-1. ConfusionMatrix

confusionMatrix(caret.gd.ada.pred, UB.ted$Personal.Loan, positive = "yes")  # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  668  32
       yes   5  44
                                         
               Accuracy : 0.9506         
                 95% CI : (0.9325, 0.965)
    No Information Rate : 0.8985         
    P-Value [Acc > NIR] : 1.712e-07      
                                         
                  Kappa : 0.6784         
                                         
 Mcnemar's Test P-Value : 1.917e-05      
                                         
            Sensitivity : 0.57895        
            Specificity : 0.99257        
         Pos Pred Value : 0.89796        
         Neg Pred Value : 0.95429        
             Prevalence : 0.10147        
         Detection Rate : 0.05874        
   Detection Prevalence : 0.06542        
      Balanced Accuracy : 0.78576        
                                         
       'Positive' Class : yes            
                                         


3-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

test.ada.prob <- predict(caret.gd.ada, newdata = UB.ted, type = "prob")  # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.ada.prob <- test.ada.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                     # 실제 클래스래스

pp           <- as.numeric(test.ada.prob)                                # "yes"에 대한 예측 확률


ada.roc     <- roc(ac, pp, plot = T, col = "red")                        # roc(실제 클래스, 예측 확률)률)

auc <- round(auc(ada.roc), 3)                                            # AUC 
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)


2) Package “Epi”

pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)                                  
detach(package:Epi)


3) Package “ROCR”

pacman::p_load("ROCR")                      

ada.pred <- prediction(pp, ac)                      # prediction(예측 확률, 실제 클래스)스)

ada.perf <- performance(ada.pred, "tpr", "fpr")     # performance(, "민감도", "1-특이도")                        
plot(ada.perf, col = "red")                         # ROC Curve
abline(0,1, col = "black")


perf.auc <- performance(ada.pred, "auc")            # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend = auc,bty = "n") 


3-2-3. 향상 차트

1) Package “ROCR”

ada.lift <- performance(ada.pred,"lift", "rpp")      # Lift chart
plot(ada.lift, colorize = T, lwd = 2)      
detach(package:ROCR)


2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets = 24)  # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력
[1] 5.913
detach(package:lift)

4. Gradient Boosting

4-1. 최적의 모수 찾기

Boosting에서 가장 많이 쓰이는 Gradient Boosting의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation
set.seed(100)                                                  # seed 고정 For Cross Validation
caret.rd.gbm <- train(Personal.Loan~., data = UB.trd,
                      method = "gbm", trControl = fitControl,  
                      tuneLength = 5, verbose=F)               # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.gbm 
Stochastic Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
  0.1235627  6                  22               282     0.9777306
  0.2151574  6                  16              2343     0.9817273
  0.2163256  4                   7              2419     0.9811559
  0.4017440  7                  15              2762     0.9703004
  0.4144840  7                  23              4062     0.9651608
  Kappa    
  0.8749167
  0.8980256
  0.8951350
  0.8432471
  0.8159151

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were n.trees =
 2343, interaction.depth = 6, shrinkage = 0.2151574
 and n.minobsinnode = 16.
fitControl <- trainControl(method = "cv", number = 5)                       # 5-Fold-Cross Validation


customGrid <- expand.grid(n.trees           = seq(2343, 2344, by = 1),      # Random Search의 Best Parameter 기준으로 탐색NA= seq(6, 7, by = 1),
                          shrinkage         = seq(0.21, 0.22, by = 0.01),
                          n.minobsinnode    = seq(16, 17, by = 1))
set.seed(100)             # seed 고정 For Cross Validation
caret.gd.gbm <- train(Personal.Loan~., data = UB.trd, method = "gbm",   
                      trControl = fitControl, tuneGrid = customGrid, verbose=F)    

caret.gd.gbm
Stochastic Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
  0.21       6                  16              2343     0.9811543
  0.21       6                  16              2344     0.9811543
  0.21       6                  17              2343     0.9805861
  0.21       6                  17              2344     0.9805861
  0.21       7                  16              2343     0.9828702
  0.21       7                  16              2344     0.9828702
  0.21       7                  17              2343     0.9805845
  0.21       7                  17              2344     0.9805845
  0.22       6                  16              2343     0.9817273
  0.22       6                  16              2344     0.9817273
  0.22       6                  17              2343     0.9817289
  0.22       6                  17              2344     0.9817289
  0.22       7                  16              2343     0.9811575
  0.22       7                  16              2344     0.9817289
  0.22       7                  17              2343     0.9817257
  0.22       7                  17              2344     0.9817257
  Kappa    
  0.8954993
  0.8954993
  0.8924529
  0.8924529
  0.9042869
  0.9042869
  0.8928468
  0.8928468
  0.8991119
  0.8991119
  0.8983321
  0.8983321
  0.8959374
  0.8992264
  0.8996204
  0.8996204

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were n.trees =
 2343, interaction.depth = 7, shrinkage = 0.21 and n.minobsinnode
 = 16.
# 최종 모형NAcaret.gd.gbm$finalModel                                                 
A gradient boosted model with bernoulli loss function.
2343 iterations were performed.
There were 15 predictors of which 15 had non-zero influence.

4-1-1. 변수 중요도

summary(caret.gd.gbm$finalModel, cBars = 10, las=2)         # cBars : 상위 몇개 나타낼 것인지  

                                    var     rel.inf
Income                           Income 34.69765109
CCAvg                             CCAvg 16.09264392
Education2                   Education2 13.99098001
Education3                   Education3 11.81194074
Family3                         Family3  7.77047223
Family4                         Family4  7.70439846
CD.Account1                 CD.Account1  3.15419520
ZIP.Code                       ZIP.Code  1.48241098
Age                                 Age  0.95032101
Mortgage                       Mortgage  0.80482155
Experience                   Experience  0.55690568
Online1                         Online1  0.48191241
CreditCard1                 CreditCard1  0.28222266
Family2                         Family2  0.14886186
Securities.Account1 Securities.Account1  0.07026221

4-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.gbm.pred <- predict(caret.gd.gbm, newdata=UB.ted)       # predict(gbm모형, Test Data)

4-2-1. ConfusionMatrix

confusionMatrix(caret.gd.gbm.pred, UB.ted$Personal.Loan, positive = "yes")  # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  669  12
       yes   4  64
                                          
               Accuracy : 0.9786          
                 95% CI : (0.9655, 0.9877)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8771          
                                          
 Mcnemar's Test P-Value : 0.08012         
                                          
            Sensitivity : 0.84211         
            Specificity : 0.99406         
         Pos Pred Value : 0.94118         
         Neg Pred Value : 0.98238         
             Prevalence : 0.10147         
         Detection Rate : 0.08545         
   Detection Prevalence : 0.09079         
      Balanced Accuracy : 0.91808         
                                          
       'Positive' Class : yes             
                                          


4-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

test.gbm.prob <- predict(caret.gd.gbm, newdata = UB.ted, type = "prob")  # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.gbm.prob <- test.gbm.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                     # 실제 클래스래스

pp           <- as.numeric(test.gbm.prob)                                # "yes"에 대한 예측 확률


gbm.roc      <- roc(ac, pp, plot = T, col = "red")                       # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(gbm.roc), 3)                                   # AUC 
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)


2) Package “Epi”

pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)                                  
detach(package:Epi)


3) Package “ROCR”

pacman::p_load("ROCR")                                                  

gbm.pred <- prediction(pp, ac)                      # prediction(예측 확률, 실제 클래스)스)

gbm.perf <- performance(gbm.pred, "tpr", "fpr")     # performance(, "민감도", "1-특이도")                        
plot(gbm.perf, col="red")                           # ROC Curve
abline(0,1, col="black")


perf.auc <- performance(gbm.pred, "auc")            # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend = auc,bty = "n") 


4-2-3. 향상 차트

1) Package “ROCR”

gbm.lift <- performance(gbm.pred,"lift", "rpp")      # Lift chart
plot(gbm.lift, colorize = T, lwd = 2)      
detach(package:ROCR)


2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets =24)   # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력
[1] 8.804
detach(package:lift)

5. XGBoost

5-1. 최적의 모수 찾기

Gradient Boosting에서 확장된 XGBoost의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation
set.seed(100)                                                      # seed 고정 For Cross Validation
caret.rd.xgb <- train(Personal.Loan~., data = UB.trd,
                      method = "xgbTree", trControl = fitControl,   
                      tuneLength = 5,                              # tuneLength (탐색할 후보 모수 갯수)
                      lambda = 0)                                  # Regularization Parameter

caret.rd.xgb 
eXtreme Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  eta        max_depth  gamma     colsample_bytree  min_child_weight
  0.1235627  6          7.108038  0.6081206         18              
  0.2151574  6          5.383487  0.6527814          3              
  0.2163256  4          7.489722  0.5196387          3              
  0.3219509  6          1.714202  0.4953224         15              
  0.4144840  7          4.201015  0.4110895         19              
  subsample  nrounds  Accuracy   Kappa    
  0.7220431  358      0.9371803  0.5812145
  0.9921731  624      0.9811608  0.8921203
  0.3477167  985      0.9697354  0.8184426
  0.8988404  919      0.9623118  0.7734267
  0.4979954  718      0.8989109  0.3717268

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were nrounds = 624, max_depth
 = 6, eta = 0.2151574, gamma = 5.383487, colsample_bytree
 = 0.6527814, min_child_weight = 3 and subsample = 0.9921731.
fitControl <- trainControl(method = "cv", number = 5)                       # 5-Fold-Cross Validation


customGrid <- expand.grid(nrounds          = seq(624, 625, by = 1),         # Random Search의 Best Parameter 기준으로 탐색NA= seq(6, 7, by = 1),
                          eta              = seq(0.2, 0.3, by = 0.1),
                          gamma            = seq(5.3, 5.4, by = 0.1),
                          colsample_bytree = seq(0.6, 0.7, by = 0.1),
                          min_child_weight = seq(3, 4, by = 1),
                          subsample        = seq(0.9, 1, by = 0.1))
set.seed(100)                                                              # seed 고정 For Cross Validation
caret.gd.xgb <- train(Personal.Loan~., data = UB.trd, method = "xgbTree",  
                      trControl = fitControl, tuneGrid = customGrid,
                      lambda = 0)                                          # Regularization Parameter    

caret.gd.xgb
eXtreme Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  eta  max_depth  gamma  colsample_bytree  min_child_weight
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               4               
  0.2  6          5.3    0.7               4               
  0.2  6          5.3    0.7               4               
  0.2  6          5.3    0.7               4               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               4               
  0.2  6          5.4    0.7               4               
  0.2  6          5.4    0.7               4               
  0.2  6          5.4    0.7               4               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               4               
  0.2  7          5.3    0.7               4               
  0.2  7          5.3    0.7               4               
  0.2  7          5.3    0.7               4               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               4               
  0.2  7          5.4    0.7               4               
  0.2  7          5.4    0.7               4               
  0.2  7          5.4    0.7               4               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               4               
  0.3  6          5.3    0.7               4               
  0.3  6          5.3    0.7               4               
  0.3  6          5.3    0.7               4               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               4               
  0.3  6          5.4    0.7               4               
  0.3  6          5.4    0.7               4               
  0.3  6          5.4    0.7               4               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               4               
  0.3  7          5.3    0.7               4               
  0.3  7          5.3    0.7               4               
  0.3  7          5.3    0.7               4               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               4               
  0.3  7          5.4    0.7               4               
  0.3  7          5.4    0.7               4               
  0.3  7          5.4    0.7               4               
  subsample  nrounds  Accuracy   Kappa    
  0.9        624      0.9811624  0.8919530
  0.9        625      0.9811624  0.8919530
  1.0        624      0.9817289  0.8955944
  1.0        625      0.9817289  0.8955944
  0.9        624      0.9823020  0.8989565
  0.9        625      0.9817306  0.8959285
  1.0        624      0.9800179  0.8850897
  1.0        625      0.9800179  0.8850897
  0.9        624      0.9777338  0.8725072
  0.9        625      0.9777338  0.8725072
  1.0        624      0.9805893  0.8893606
  1.0        625      0.9805893  0.8893606
  0.9        624      0.9811608  0.8928304
  0.9        625      0.9811608  0.8928304
  1.0        624      0.9794432  0.8828790
  1.0        625      0.9794432  0.8828790
  0.9        624      0.9783036  0.8770100
  0.9        625      0.9777322  0.8734481
  1.0        624      0.9783053  0.8739334
  1.0        625      0.9783053  0.8739334
  0.9        624      0.9783069  0.8742904
  0.9        625      0.9783069  0.8742904
  1.0        624      0.9817338  0.8950015
  1.0        625      0.9817338  0.8950015
  0.9        624      0.9811624  0.8920329
  0.9        625      0.9811624  0.8920329
  1.0        624      0.9811575  0.8925663
  1.0        625      0.9811575  0.8925663
  0.9        624      0.9783053  0.8751519
  0.9        625      0.9783053  0.8751519
  1.0        624      0.9805877  0.8894179
  1.0        625      0.9805877  0.8894179
  0.9        624      0.9800147  0.8873730
  0.9        625      0.9800147  0.8873730
  1.0        624      0.9783085  0.8731261
  1.0        625      0.9783085  0.8731261
  0.9        624      0.9794449  0.8823109
  0.9        625      0.9794449  0.8823109
  1.0        624      0.9771640  0.8682078
  1.0        625      0.9771640  0.8682078
  0.9        624      0.9811591  0.8931233
  0.9        625      0.9811591  0.8931233
  1.0        624      0.9823053  0.8980542
  1.0        625      0.9823053  0.8980542
  0.9        624      0.9817322  0.8951524
  0.9        625      0.9817322  0.8951524
  1.0        624      0.9811608  0.8918731
  1.0        625      0.9811608  0.8918731
  0.9        624      0.9800163  0.8867696
  0.9        625      0.9800163  0.8867696
  1.0        624      0.9811591  0.8923666
  1.0        625      0.9811591  0.8923666
  0.9        624      0.9805893  0.8887474
  0.9        625      0.9805893  0.8887474
  1.0        624      0.9765910  0.8641748
  1.0        625      0.9765910  0.8641748
  0.9        624      0.9794481  0.8820096
  0.9        625      0.9800179  0.8857224
  1.0        624      0.9794481  0.8812937
  1.0        625      0.9794481  0.8812937
  0.9        624      0.9783053  0.8758413
  0.9        625      0.9783053  0.8758413
  1.0        624      0.9800179  0.8846535
  1.0        625      0.9800179  0.8846535
  0.9        624      0.9794481  0.8812038
  0.9        625      0.9794481  0.8812038
  1.0        624      0.9805893  0.8887474
  1.0        625      0.9805893  0.8887474
  0.9        624      0.9805893  0.8886546
  0.9        625      0.9805893  0.8886546
  1.0        624      0.9794465  0.8817076
  1.0        625      0.9794465  0.8817076
  0.9        624      0.9794449  0.8830880
  0.9        625      0.9794449  0.8830880
  1.0        624      0.9805893  0.8891734
  1.0        625      0.9805893  0.8891734
  0.9        624      0.9788734  0.8788344
  0.9        625      0.9788734  0.8788344
  1.0        624      0.9828702  0.9022731
  1.0        625      0.9828702  0.9022731
  0.9        624      0.9811591  0.8919249
  0.9        625      0.9805877  0.8888818
  1.0        624      0.9783036  0.8771763
  1.0        625      0.9783036  0.8771763
  0.9        624      0.9811591  0.8920659
  0.9        625      0.9817306  0.8956209
  1.0        624      0.9771624  0.8674133
  1.0        625      0.9771624  0.8674133
  0.9        624      0.9828734  0.9009698
  0.9        625      0.9828734  0.9009698
  1.0        624      0.9794449  0.8822959
  1.0        625      0.9794449  0.8822959
  0.9        624      0.9788751  0.8782051
  0.9        625      0.9788751  0.8782051
  1.0        624      0.9777306  0.8712897
  1.0        625      0.9777306  0.8712897
  0.9        624      0.9783036  0.8781373
  0.9        625      0.9783036  0.8781373
  1.0        624      0.9748751  0.8544771
  1.0        625      0.9748751  0.8544771
  0.9        624      0.9794497  0.8818147
  0.9        625      0.9800212  0.8854641
  1.0        624      0.9771559  0.8694640
  1.0        625      0.9771559  0.8694640
  0.9        624      0.9794449  0.8834449
  0.9        625      0.9794449  0.8834449
  1.0        624      0.9828751  0.9024489
  1.0        625      0.9828751  0.9024489
  0.9        624      0.9788734  0.8780400
  0.9        625      0.9783036  0.8744209
  1.0        624      0.9788767  0.8767860
  1.0        625      0.9788767  0.8767860
  0.9        624      0.9805910  0.8890289
  0.9        625      0.9800195  0.8859762
  1.0        624      0.9783036  0.8757023
  1.0        625      0.9783036  0.8757023
  0.9        624      0.9811608  0.8926284
  0.9        625      0.9811608  0.8926284
  1.0        624      0.9777354  0.8710367
  1.0        625      0.9777354  0.8710367
  0.9        624      0.9811608  0.8926377
  0.9        625      0.9811608  0.8926377
  1.0        624      0.9777338  0.8704176
  1.0        625      0.9777338  0.8704176
  0.9        624      0.9817338  0.8954059
  0.9        625      0.9811624  0.8920340
  1.0        624      0.9800179  0.8852626
  1.0        625      0.9800179  0.8852626

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were nrounds = 624, max_depth
 = 7, eta = 0.3, gamma = 5.3, colsample_bytree =
 0.7, min_child_weight = 3 and subsample = 1.

5-1-1. 변수 중요도

xgbImp <- varImp(caret.gd.xgb, scale = FALSE)
plot(xgbImp)


5-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.xgb.pred <- predict(caret.gd.xgb, newdata=UB.ted)       # predict(xgboost모형, Test Data)

5-2-1. ConfusionMatrix

confusionMatrix(caret.gd.xgb.pred, UB.ted$Personal.Loan, positive = "yes")  # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  670  12
       yes   3  64
                                          
               Accuracy : 0.98            
                 95% CI : (0.9672, 0.9887)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8841          
                                          
 Mcnemar's Test P-Value : 0.03887         
                                          
            Sensitivity : 0.84211         
            Specificity : 0.99554         
         Pos Pred Value : 0.95522         
         Neg Pred Value : 0.98240         
             Prevalence : 0.10147         
         Detection Rate : 0.08545         
   Detection Prevalence : 0.08945         
      Balanced Accuracy : 0.91882         
                                          
       'Positive' Class : yes             
                                          


5-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

test.xgb.prob <- predict(caret.gd.xgb, newdata = UB.ted, type = "prob")  # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.xgb.prob <- test.xgb.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                     # 실제 클래스래스

pp           <- as.numeric(test.xgb.prob)                                # "yes"에 대한 예측 확률


xgb.roc      <- roc(ac, pp, plot = T, col = "red")                       # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(xgb.roc), 3)                                   # AUC 
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)


2) Package “Epi”

pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)                                  
detach(package:Epi)


3) Package “ROCR”

pacman::p_load("ROCR")                      

xgb.pred <- prediction(pp, ac)                      # prediction(예측 확률, 실제 클래스)스)

xgb.perf <- performance(xgb.pred, "tpr", "fpr")     # performance(, "민감도", "1-특이도")                        
plot(xgb.perf, col = "red")                         # ROC Curve
abline(0,1, col = "black")


perf.auc <- performance(xgb.pred, "auc")            # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend = auc,bty = "n") 


5-2-3. 향상 차트

1) Package “ROCR”

xgb.lift <- performance(xgb.pred,"lift", "rpp")      # Lift chart
plot(xgb.lift, colorize = T, lwd = 2)      
detach(package:ROCR)


2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets = 24)  # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력
[1] 8.935
detach(package:lift)

6. 모형 비교

plot(ada.roc, col="red")         # ROC Curve
par(new=TRUE)
plot(gbm.roc, col="green")       # ROC Curve
par(new=TRUE)
plot(xgb.roc, col="orange")      # ROC Curve

legend("bottomright", legend=c( "AdaBoost", "GBM", "XGBoost" ),
       col=c( "red", "green", "orange"), lty=c(1,1,1))

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".