Package "caret"은 다양한 머신러닝 분석을 하나로 모은 패키지이며, trainControl을 이용하여 과적합을 방지할 수 있다. "caret"에서는 Boosting을 이용한 다양한 기법을 수행할 수 있으며, 그 중 가장 많이 쓰이는 AdaBoost, Gradient Boosting, XGBoost를 이용하여 예제 데이터를 분석한다. 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은 Person.Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table", "dplyr") 

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환환
  mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거거



cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환
 
glimpse(UB)                                                           # 데이터 구조

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

3. AdaBoost

3-1. 최적의 모수 찾기

Boosting에서 가장 많이 쓰이는 AdaBoost의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation

set.seed(100)                                                          # seed 고정 For Cross Validation
caret.rd.ada <- train(Personal.Loan~., data = UB.trd,
                      method = "AdaBoost.M1", trControl = fitControl,   
                      tuneLength = 5)                                  # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.ada

AdaBoost.M1 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  coeflearn  maxdepth  mfinal  Accuracy   Kappa    
  Breiman     7        70      0.9817338  0.8991055
  Freund      2        86      0.9800130  0.8871317
  Zhu         4        89      0.9817322  0.8972157
  Zhu         6        23      0.9828702  0.9035620
  Zhu        23        78      0.9823036  0.9001572

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were mfinal = 23, maxdepth =
 6 and coeflearn = Zhu.

Tune Parameter
- mfinal : 반복 수
- maxdepth : Tree의 최대 깊이
- coeflearn : 가중치 업데이트 계수

plot(caret.rd.ada)                            # Accuracy

maxdepth = 6, mfinal = 23, coeflearn = “Zhu”일 때 정확도가 가장 높다.
mfinal = 23을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method = "cv", number = 5)         # 5-Fold-Cross Validation


customGrid <- expand.grid(mfinal    = seq(22, 24, by = 1),    # Random Search의 Best Parameter 기준으로 탐색NA= 1,                      # Stump를 생성하기 위해 최대 깊이 "1"로 고정NA= "Breiman")              # 가장 많이 쓰이는 Breiman로 고정

set.seed(100)  # seed 고정 For Cross Validation
caret.gd.ada <- train(Personal.Loan~., data = UB.trd, method = "AdaBoost.M1",  
                      trControl = fitControl, tuneGrid = customGrid)    

caret.gd.ada

AdaBoost.M1 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  mfinal  Accuracy   Kappa    
  22      0.9371884  0.5745333
  23      0.9469011  0.6593923
  24      0.9457468  0.6272606

Tuning parameter 'maxdepth' was held constant at a value of 1

Tuning parameter 'coeflearn' was held constant at a value of Breiman
Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were mfinal = 23, maxdepth =
 1 and coeflearn = Breiman.

plot(caret.gd.ada)            # Accuracy

maxdepth = 1, mfinal = 23, coeflearn = “Breiman”일 때 가장 높다.

3-1-1. 변수 중요도

# 변수 중요도도
adaImp <- varImp(caret.gd.ada, scale = FALSE)
plot(adaImp)

3-1-2. 가중치

caret.gd.ada$finalModel$weights             # 각 Tree에 대한 정보의 양양

 [1] 1.08325539 0.89957656 0.63275242 0.75226832 0.48449782 0.33250481
 [7] 0.16625240 0.24771483 0.17284836 0.18903596 0.19079791 0.21296646
[13] 0.10648323 0.14983887 0.31141150 0.30563367 0.21895195 0.05529500
[19] 0.13715590 0.18066928 0.12458244 0.07265042 0.23648112

3-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.ada.pred <- predict(caret.gd.ada, newdata = UB.ted)       # predict(AdaBoost모형, Test Data)

3-2-1. ConfusionMatrix

confusionMatrix(caret.gd.ada.pred, UB.ted$Personal.Loan, positive = "yes")  # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  668  32
       yes   5  44
                                         
               Accuracy : 0.9506         
                 95% CI : (0.9325, 0.965)
    No Information Rate : 0.8985         
    P-Value [Acc > NIR] : 1.712e-07      
                                         
                  Kappa : 0.6784         
                                         
 Mcnemar's Test P-Value : 1.917e-05      
                                         
            Sensitivity : 0.57895        
            Specificity : 0.99257        
         Pos Pred Value : 0.89796        
         Neg Pred Value : 0.95429        
             Prevalence : 0.10147        
         Detection Rate : 0.05874        
   Detection Prevalence : 0.06542        
      Balanced Accuracy : 0.78576        
                                         
       'Positive' Class : yes

3-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

test.ada.prob <- predict(caret.gd.ada, newdata = UB.ted, type = "prob")  # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.ada.prob <- test.ada.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                     # 실제 클래스래스

pp           <- as.numeric(test.ada.prob)                                # "yes"에 대한 예측 확률


ada.roc     <- roc(ac, pp, plot = T, col = "red")                        # roc(실제 클래스, 예측 확률)률)

auc <- round(auc(ada.roc), 3)                                            # AUC 
legend("bottomright",legend = auc, bty = "n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                      

ada.pred <- prediction(pp, ac)                      # prediction(예측 확률, 실제 클래스)스)

ada.perf <- performance(ada.pred, "tpr", "fpr")     # performance(, "민감도", "1-특이도")                        
plot(ada.perf, col = "red")                         # ROC Curve
abline(0,1, col = "black")


perf.auc <- performance(ada.pred, "auc")            # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend = auc,bty = "n")

3-2-3. 향상 차트

1) Package “ROCR”

ada.lift <- performance(ada.pred,"lift", "rpp")      # Lift chart
plot(ada.lift, colorize = T, lwd = 2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets = 24)  # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력

[1] 5.913

detach(package:lift)

4. Gradient Boosting

4-1. 최적의 모수 찾기

Boosting에서 가장 많이 쓰이는 Gradient Boosting의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation

set.seed(100)                                                  # seed 고정 For Cross Validation
caret.rd.gbm <- train(Personal.Loan~., data = UB.trd,
                      method = "gbm", trControl = fitControl,  
                      tuneLength = 5, verbose=F)               # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.gbm

Stochastic Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
  0.1235627  6                  22               282     0.9777306
  0.2151574  6                  16              2343     0.9817273
  0.2163256  4                   7              2419     0.9811559
  0.4017440  7                  15              2762     0.9703004
  0.4144840  7                  23              4062     0.9651608
  Kappa    
  0.8749167
  0.8980256
  0.8951350
  0.8432471
  0.8159151

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were n.trees =
 2343, interaction.depth = 6, shrinkage = 0.2151574
 and n.minobsinnode = 16.

Tune Parameter
- n.trees : 반복 수
- interaction.depth : Tree의 최대 깊이
- shrinkage : Learning Rate
- n.minobsinnode : Terminal node의 최소 관측갯수
n.trees = 2343, interaction.depth = 6, shrinkage = 0.2151574, n.minobsinnode = 16일 때 정확도가 가장 높다.
n.trees = 2343, interaction.depth = 6, shrinkage = 0.2151574, n.minobsinnode = 16을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method = "cv", number = 5)                       # 5-Fold-Cross Validation


customGrid <- expand.grid(n.trees           = seq(2343, 2344, by = 1),      # Random Search의 Best Parameter 기준으로 탐색NA= seq(6, 7, by = 1),
                          shrinkage         = seq(0.21, 0.22, by = 0.01),
                          n.minobsinnode    = seq(16, 17, by = 1))

set.seed(100)             # seed 고정 For Cross Validation
caret.gd.gbm <- train(Personal.Loan~., data = UB.trd, method = "gbm",   
                      trControl = fitControl, tuneGrid = customGrid, verbose=F)    

caret.gd.gbm

Stochastic Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
  0.21       6                  16              2343     0.9811543
  0.21       6                  16              2344     0.9811543
  0.21       6                  17              2343     0.9805861
  0.21       6                  17              2344     0.9805861
  0.21       7                  16              2343     0.9828702
  0.21       7                  16              2344     0.9828702
  0.21       7                  17              2343     0.9805845
  0.21       7                  17              2344     0.9805845
  0.22       6                  16              2343     0.9817273
  0.22       6                  16              2344     0.9817273
  0.22       6                  17              2343     0.9817289
  0.22       6                  17              2344     0.9817289
  0.22       7                  16              2343     0.9811575
  0.22       7                  16              2344     0.9817289
  0.22       7                  17              2343     0.9817257
  0.22       7                  17              2344     0.9817257
  Kappa    
  0.8954993
  0.8954993
  0.8924529
  0.8924529
  0.9042869
  0.9042869
  0.8928468
  0.8928468
  0.8991119
  0.8991119
  0.8983321
  0.8983321
  0.8959374
  0.8992264
  0.8996204
  0.8996204

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were n.trees =
 2343, interaction.depth = 7, shrinkage = 0.21 and n.minobsinnode
 = 16.

n.trees = 2343, interaction.depth = 7, shrinkage = 0.21, n.minobsinnode = 16일 때 가장 높으며, 정확도는 약간 증가했다.

# 최종 모형NAcaret.gd.gbm$finalModel

A gradient boosted model with bernoulli loss function.
2343 iterations were performed.
There were 15 predictors of which 15 had non-zero influence.

4-1-1. 변수 중요도

summary(caret.gd.gbm$finalModel, cBars = 10, las=2)         # cBars : 상위 몇개 나타낼 것인지

                                    var     rel.inf
Income                           Income 34.69765109
CCAvg                             CCAvg 16.09264392
Education2                   Education2 13.99098001
Education3                   Education3 11.81194074
Family3                         Family3  7.77047223
Family4                         Family4  7.70439846
CD.Account1                 CD.Account1  3.15419520
ZIP.Code                       ZIP.Code  1.48241098
Age                                 Age  0.95032101
Mortgage                       Mortgage  0.80482155
Experience                   Experience  0.55690568
Online1                         Online1  0.48191241
CreditCard1                 CreditCard1  0.28222266
Family2                         Family2  0.14886186
Securities.Account1 Securities.Account1  0.07026221

4-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.gbm.pred <- predict(caret.gd.gbm, newdata=UB.ted)       # predict(gbm모형, Test Data)

4-2-1. ConfusionMatrix

confusionMatrix(caret.gd.gbm.pred, UB.ted$Personal.Loan, positive = "yes")  # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  669  12
       yes   4  64
                                          
               Accuracy : 0.9786          
                 95% CI : (0.9655, 0.9877)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8771          
                                          
 Mcnemar's Test P-Value : 0.08012         
                                          
            Sensitivity : 0.84211         
            Specificity : 0.99406         
         Pos Pred Value : 0.94118         
         Neg Pred Value : 0.98238         
             Prevalence : 0.10147         
         Detection Rate : 0.08545         
   Detection Prevalence : 0.09079         
      Balanced Accuracy : 0.91808         
                                          
       'Positive' Class : yes

4-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

test.gbm.prob <- predict(caret.gd.gbm, newdata = UB.ted, type = "prob")  # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.gbm.prob <- test.gbm.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                     # 실제 클래스래스

pp           <- as.numeric(test.gbm.prob)                                # "yes"에 대한 예측 확률


gbm.roc      <- roc(ac, pp, plot = T, col = "red")                       # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(gbm.roc), 3)                                   # AUC 
legend("bottomright",legend = auc, bty = "n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                                                  

gbm.pred <- prediction(pp, ac)                      # prediction(예측 확률, 실제 클래스)스)

gbm.perf <- performance(gbm.pred, "tpr", "fpr")     # performance(, "민감도", "1-특이도")                        
plot(gbm.perf, col="red")                           # ROC Curve
abline(0,1, col="black")


perf.auc <- performance(gbm.pred, "auc")            # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend = auc,bty = "n")

4-2-3. 향상 차트

1) Package “ROCR”

gbm.lift <- performance(gbm.pred,"lift", "rpp")      # Lift chart
plot(gbm.lift, colorize = T, lwd = 2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets =24)   # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력

[1] 8.804

detach(package:lift)

5. XGBoost

5-1. 최적의 모수 찾기

Gradient Boosting에서 확장된 XGBoost의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation

set.seed(100)                                                      # seed 고정 For Cross Validation
caret.rd.xgb <- train(Personal.Loan~., data = UB.trd,
                      method = "xgbTree", trControl = fitControl,   
                      tuneLength = 5,                              # tuneLength (탐색할 후보 모수 갯수)
                      lambda = 0)                                  # Regularization Parameter

caret.rd.xgb

eXtreme Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  eta        max_depth  gamma     colsample_bytree  min_child_weight
  0.1235627  6          7.108038  0.6081206         18              
  0.2151574  6          5.383487  0.6527814          3              
  0.2163256  4          7.489722  0.5196387          3              
  0.3219509  6          1.714202  0.4953224         15              
  0.4144840  7          4.201015  0.4110895         19              
  subsample  nrounds  Accuracy   Kappa    
  0.7220431  358      0.9371803  0.5812145
  0.9921731  624      0.9811608  0.8921203
  0.3477167  985      0.9697354  0.8184426
  0.8988404  919      0.9623118  0.7734267
  0.4979954  718      0.8989109  0.3717268

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were nrounds = 624, max_depth
 = 6, eta = 0.2151574, gamma = 5.383487, colsample_bytree
 = 0.6527814, min_child_weight = 3 and subsample = 0.9921731.

Tune Parameter
- nrounds : 반복 수
- max_depth : Tree의 최대 깊이
- eta : Learning Late
- gamma : 분할하기 위해 필요한 최소 손실 감소, 클수록 분할이 쉽게 일어나지 않음
- colsample_bytree : Tree 생성 때 사용할 예측변수 비율
- min_child_weight : 한 leaf 노드에 요구되는 관측치에 대한 가중치의 최소 합
- subsample : 모델 구축시 사용할 Data비율로 1이면 전체 Data 사용
nrounds = 624, max_depth = 6, eta = 0.2151574, gamma = 5.383487, colsample_bytree = 0.6527814, min_child_weight = 3, subsample = 0.9921731일 때 정확도가 가장 높다.
nrounds = 624, max_depth = 6, eta = 0.2151574, gamma = 5.383487, colsample_bytree = 0.6527814, min_child_weight = 3, subsample = 0.9921731을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method = "cv", number = 5)                       # 5-Fold-Cross Validation


customGrid <- expand.grid(nrounds          = seq(624, 625, by = 1),         # Random Search의 Best Parameter 기준으로 탐색NA= seq(6, 7, by = 1),
                          eta              = seq(0.2, 0.3, by = 0.1),
                          gamma            = seq(5.3, 5.4, by = 0.1),
                          colsample_bytree = seq(0.6, 0.7, by = 0.1),
                          min_child_weight = seq(3, 4, by = 1),
                          subsample        = seq(0.9, 1, by = 0.1))

set.seed(100)                                                              # seed 고정 For Cross Validation
caret.gd.xgb <- train(Personal.Loan~., data = UB.trd, method = "xgbTree",  
                      trControl = fitControl, tuneGrid = customGrid,
                      lambda = 0)                                          # Regularization Parameter    

caret.gd.xgb

eXtreme Gradient Boosting 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  eta  max_depth  gamma  colsample_bytree  min_child_weight
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               3               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.6               4               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               3               
  0.2  6          5.3    0.7               4               
  0.2  6          5.3    0.7               4               
  0.2  6          5.3    0.7               4               
  0.2  6          5.3    0.7               4               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               3               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.6               4               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               3               
  0.2  6          5.4    0.7               4               
  0.2  6          5.4    0.7               4               
  0.2  6          5.4    0.7               4               
  0.2  6          5.4    0.7               4               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               3               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.6               4               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               3               
  0.2  7          5.3    0.7               4               
  0.2  7          5.3    0.7               4               
  0.2  7          5.3    0.7               4               
  0.2  7          5.3    0.7               4               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               3               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.6               4               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               3               
  0.2  7          5.4    0.7               4               
  0.2  7          5.4    0.7               4               
  0.2  7          5.4    0.7               4               
  0.2  7          5.4    0.7               4               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               3               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.6               4               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               3               
  0.3  6          5.3    0.7               4               
  0.3  6          5.3    0.7               4               
  0.3  6          5.3    0.7               4               
  0.3  6          5.3    0.7               4               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               3               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.6               4               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               3               
  0.3  6          5.4    0.7               4               
  0.3  6          5.4    0.7               4               
  0.3  6          5.4    0.7               4               
  0.3  6          5.4    0.7               4               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               3               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.6               4               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               3               
  0.3  7          5.3    0.7               4               
  0.3  7          5.3    0.7               4               
  0.3  7          5.3    0.7               4               
  0.3  7          5.3    0.7               4               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               3               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.6               4               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               3               
  0.3  7          5.4    0.7               4               
  0.3  7          5.4    0.7               4               
  0.3  7          5.4    0.7               4               
  0.3  7          5.4    0.7               4               
  subsample  nrounds  Accuracy   Kappa    
  0.9        624      0.9811624  0.8919530
  0.9        625      0.9811624  0.8919530
  1.0        624      0.9817289  0.8955944
  1.0        625      0.9817289  0.8955944
  0.9        624      0.9823020  0.8989565
  0.9        625      0.9817306  0.8959285
  1.0        624      0.9800179  0.8850897
  1.0        625      0.9800179  0.8850897
  0.9        624      0.9777338  0.8725072
  0.9        625      0.9777338  0.8725072
  1.0        624      0.9805893  0.8893606
  1.0        625      0.9805893  0.8893606
  0.9        624      0.9811608  0.8928304
  0.9        625      0.9811608  0.8928304
  1.0        624      0.9794432  0.8828790
  1.0        625      0.9794432  0.8828790
  0.9        624      0.9783036  0.8770100
  0.9        625      0.9777322  0.8734481
  1.0        624      0.9783053  0.8739334
  1.0        625      0.9783053  0.8739334
  0.9        624      0.9783069  0.8742904
  0.9        625      0.9783069  0.8742904
  1.0        624      0.9817338  0.8950015
  1.0        625      0.9817338  0.8950015
  0.9        624      0.9811624  0.8920329
  0.9        625      0.9811624  0.8920329
  1.0        624      0.9811575  0.8925663
  1.0        625      0.9811575  0.8925663
  0.9        624      0.9783053  0.8751519
  0.9        625      0.9783053  0.8751519
  1.0        624      0.9805877  0.8894179
  1.0        625      0.9805877  0.8894179
  0.9        624      0.9800147  0.8873730
  0.9        625      0.9800147  0.8873730
  1.0        624      0.9783085  0.8731261
  1.0        625      0.9783085  0.8731261
  0.9        624      0.9794449  0.8823109
  0.9        625      0.9794449  0.8823109
  1.0        624      0.9771640  0.8682078
  1.0        625      0.9771640  0.8682078
  0.9        624      0.9811591  0.8931233
  0.9        625      0.9811591  0.8931233
  1.0        624      0.9823053  0.8980542
  1.0        625      0.9823053  0.8980542
  0.9        624      0.9817322  0.8951524
  0.9        625      0.9817322  0.8951524
  1.0        624      0.9811608  0.8918731
  1.0        625      0.9811608  0.8918731
  0.9        624      0.9800163  0.8867696
  0.9        625      0.9800163  0.8867696
  1.0        624      0.9811591  0.8923666
  1.0        625      0.9811591  0.8923666
  0.9        624      0.9805893  0.8887474
  0.9        625      0.9805893  0.8887474
  1.0        624      0.9765910  0.8641748
  1.0        625      0.9765910  0.8641748
  0.9        624      0.9794481  0.8820096
  0.9        625      0.9800179  0.8857224
  1.0        624      0.9794481  0.8812937
  1.0        625      0.9794481  0.8812937
  0.9        624      0.9783053  0.8758413
  0.9        625      0.9783053  0.8758413
  1.0        624      0.9800179  0.8846535
  1.0        625      0.9800179  0.8846535
  0.9        624      0.9794481  0.8812038
  0.9        625      0.9794481  0.8812038
  1.0        624      0.9805893  0.8887474
  1.0        625      0.9805893  0.8887474
  0.9        624      0.9805893  0.8886546
  0.9        625      0.9805893  0.8886546
  1.0        624      0.9794465  0.8817076
  1.0        625      0.9794465  0.8817076
  0.9        624      0.9794449  0.8830880
  0.9        625      0.9794449  0.8830880
  1.0        624      0.9805893  0.8891734
  1.0        625      0.9805893  0.8891734
  0.9        624      0.9788734  0.8788344
  0.9        625      0.9788734  0.8788344
  1.0        624      0.9828702  0.9022731
  1.0        625      0.9828702  0.9022731
  0.9        624      0.9811591  0.8919249
  0.9        625      0.9805877  0.8888818
  1.0        624      0.9783036  0.8771763
  1.0        625      0.9783036  0.8771763
  0.9        624      0.9811591  0.8920659
  0.9        625      0.9817306  0.8956209
  1.0        624      0.9771624  0.8674133
  1.0        625      0.9771624  0.8674133
  0.9        624      0.9828734  0.9009698
  0.9        625      0.9828734  0.9009698
  1.0        624      0.9794449  0.8822959
  1.0        625      0.9794449  0.8822959
  0.9        624      0.9788751  0.8782051
  0.9        625      0.9788751  0.8782051
  1.0        624      0.9777306  0.8712897
  1.0        625      0.9777306  0.8712897
  0.9        624      0.9783036  0.8781373
  0.9        625      0.9783036  0.8781373
  1.0        624      0.9748751  0.8544771
  1.0        625      0.9748751  0.8544771
  0.9        624      0.9794497  0.8818147
  0.9        625      0.9800212  0.8854641
  1.0        624      0.9771559  0.8694640
  1.0        625      0.9771559  0.8694640
  0.9        624      0.9794449  0.8834449
  0.9        625      0.9794449  0.8834449
  1.0        624      0.9828751  0.9024489
  1.0        625      0.9828751  0.9024489
  0.9        624      0.9788734  0.8780400
  0.9        625      0.9783036  0.8744209
  1.0        624      0.9788767  0.8767860
  1.0        625      0.9788767  0.8767860
  0.9        624      0.9805910  0.8890289
  0.9        625      0.9800195  0.8859762
  1.0        624      0.9783036  0.8757023
  1.0        625      0.9783036  0.8757023
  0.9        624      0.9811608  0.8926284
  0.9        625      0.9811608  0.8926284
  1.0        624      0.9777354  0.8710367
  1.0        625      0.9777354  0.8710367
  0.9        624      0.9811608  0.8926377
  0.9        625      0.9811608  0.8926377
  1.0        624      0.9777338  0.8704176
  1.0        625      0.9777338  0.8704176
  0.9        624      0.9817338  0.8954059
  0.9        625      0.9811624  0.8920340
  1.0        624      0.9800179  0.8852626
  1.0        625      0.9800179  0.8852626

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were nrounds = 624, max_depth
 = 7, eta = 0.3, gamma = 5.3, colsample_bytree =
 0.7, min_child_weight = 3 and subsample = 1.

nrounds = 624, max_depth = 7, eta = 0.3, gamma = 5.3, colsample_bytree = 0.7, min_child_weight = 3, subsample = 1일 때 가장 높다.

5-1-1. 변수 중요도

xgbImp <- varImp(caret.gd.xgb, scale = FALSE)
plot(xgbImp)

5-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.xgb.pred <- predict(caret.gd.xgb, newdata=UB.ted)       # predict(xgboost모형, Test Data)

5-2-1. ConfusionMatrix

confusionMatrix(caret.gd.xgb.pred, UB.ted$Personal.Loan, positive = "yes")  # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  670  12
       yes   3  64
                                          
               Accuracy : 0.98            
                 95% CI : (0.9672, 0.9887)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8841          
                                          
 Mcnemar's Test P-Value : 0.03887         
                                          
            Sensitivity : 0.84211         
            Specificity : 0.99554         
         Pos Pred Value : 0.95522         
         Neg Pred Value : 0.98240         
             Prevalence : 0.10147         
         Detection Rate : 0.08545         
   Detection Prevalence : 0.08945         
      Balanced Accuracy : 0.91882         
                                          
       'Positive' Class : yes

5-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

test.xgb.prob <- predict(caret.gd.xgb, newdata = UB.ted, type = "prob")  # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.xgb.prob <- test.xgb.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                     # 실제 클래스래스

pp           <- as.numeric(test.xgb.prob)                                # "yes"에 대한 예측 확률


xgb.roc      <- roc(ac, pp, plot = T, col = "red")                       # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(xgb.roc), 3)                                   # AUC 
legend("bottomright",legend = auc, bty = "n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                      

xgb.pred <- prediction(pp, ac)                      # prediction(예측 확률, 실제 클래스)스)

xgb.perf <- performance(xgb.pred, "tpr", "fpr")     # performance(, "민감도", "1-특이도")                        
plot(xgb.perf, col = "red")                         # ROC Curve
abline(0,1, col = "black")


perf.auc <- performance(xgb.pred, "auc")            # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend = auc,bty = "n")

5-2-3. 향상 차트

1) Package “ROCR”

xgb.lift <- performance(xgb.pred,"lift", "rpp")      # Lift chart
plot(xgb.lift, colorize = T, lwd = 2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets = 24)  # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력

[1] 8.935

detach(package:lift)

6. 모형 비교

plot(ada.roc, col="red")         # ROC Curve
par(new=TRUE)
plot(gbm.roc, col="green")       # ROC Curve
par(new=TRUE)
plot(xgb.roc, col="orange")      # ROC Curve

legend("bottomright", legend=c( "AdaBoost", "GBM", "XGBoost" ),
       col=c( "red", "green", "orange"), lty=c(1,1,1))

Boosting based on Caret

1. 데이터 불러오기

2. 데이터 분할

3. AdaBoost

3-1. 최적의 모수 찾기

3-1-1. 변수 중요도

3-1-2. 가중치

3-2. 모형 평가

3-2-1. ConfusionMatrix

3-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

3-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

4. Gradient Boosting

4-1. 최적의 모수 찾기

4-1-1. 변수 중요도

4-2. 모형 평가

4-2-1. ConfusionMatrix

4-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

4-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

5. XGBoost

5-1. 최적의 모수 찾기

5-1-1. 변수 중요도

5-2. 모형 평가

5-2-1. ConfusionMatrix

5-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

5-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

6. 모형 비교

Reuse