Bagging based on Caret

Machine Learning

R code using Caret Package for Random Forest of Bagging

Yeongeun Jeon , Jeongwook Lee , Jung In Seo
10-16-2020

Package "caret"은 다양한 머신러닝 분석을 하나로 모은 패키지이며, trainControl을 이용하여 과적합을 방지할 수 있다. "caret"에서는 Bagging을 이용한 다양한 기법을 수행할 수 있으며, 그 중 가장 많이 쓰이는 Random Forest를 이용하여 예제 데이터를 분석한다. 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 TargetPerson.Loan이다.



1. 데이터 불러오기

pacman::p_load("data.table", "dplyr") 

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환
 
glimpse(UB)                                                           # 데이터 구조
Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

3. Random Forest

3-1. 최적의 모수 찾기

Bagging에서 가장 많이 쓰이는 기법인 Random Forest의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation
set.seed(100)                                                        # seed 고정 For Cross Validation
caret.rd.rf <- train(Personal.Loan~., data = UB.trd, method = "rf",  
                     trControl = fitControl, tuneLength = 10,        # tuneLength (탐색할 후보 모수 갯수) 
                     ntree = 500)                                    # 생성할 Tree 갯수caret.rd.rf
Random Forest 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   3    0.9743036  0.8471829
   4    0.9794432  0.8817984
   6    0.9805877  0.8914686
   7    0.9811591  0.8942015
   9    0.9811591  0.8944264
  10    0.9811591  0.8944196
  12    0.9794449  0.8851846
  14    0.9788734  0.8828271

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was mtry = 7.
plot(caret.rd.rf)         # Accuracy

fitControl <- trainControl(method = "cv", number = 5)    # 5-Fold-Cross Validation


customGrid <- expand.grid(mtry = seq(4, 10, by = 1))     # Random Search의 Best Parameter 기준으로 탐색NA
set.seed(100)                                                        # seed 고정 For Cross Validation
caret.gd.rf <- train(Personal.Loan~., data = UB.trd, method = "rf", 
                     trControl = fitControl, tuneGrid = customGrid,
                     ntree = 500)                                    # 생성할 Tree 갯수     

caret.gd.rf
Random Forest 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   4    0.9783036  0.8747493
   5    0.9805861  0.8910712
   6    0.9811591  0.8942015
   7    0.9805877  0.8914899
   8    0.9817289  0.8982840
   9    0.9817306  0.8981210
  10    0.9805877  0.8914542

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was mtry = 9.
plot(caret.gd.rf)   # Accuracy

# 최종 모형NAcaret.gd.rf$finalModel                      

Call:
 randomForest(x = x, y = y, ntree = 500, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 9

        OOB estimate of  error rate: 1.94%
Confusion matrix:
      no yes class.error
no  1559  12 0.007638447
yes   22 158 0.122222222

3-1-1. 변수 중요도

rfImp <- varImp(caret.gd.rf, scale = FALSE)
plot(rfImp)

3-1-2. OBB Error

head(caret.gd.rf$finalModel$err.rate)
            OOB         no       yes
[1,] 0.04128440 0.02030457 0.2380952
[2,] 0.03831418 0.02234043 0.1826923
[3,] 0.03048780 0.01939292 0.1349206
[4,] 0.03885481 0.02573808 0.1575342
[5,] 0.03121019 0.01770538 0.1518987
[6,] 0.02926829 0.01290761 0.1726190
# Plot for Error
pacman::p_load("ggplot2")

oob.error.data <- data.frame(Trees=rep(1:nrow(caret.gd.rf$finalModel$err.rate),times=3), 
                             Type=rep(c("OOB","No","Yes"), 
                                      each=nrow(caret.gd.rf$finalModel$err.rate)),
                             Error=c(caret.gd.rf$finalModel$err.rate[,"OOB"],
                                     caret.gd.rf$finalModel$err.rate[,"no"],
                                     caret.gd.rf$finalModel$err.rate[,"yes"]))


ggplot(data=oob.error.data, aes(x=Trees, y=Error)) + 
  geom_line(aes(color=Type)) + theme_bw()


3-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.rf.pred <- predict(caret.gd.rf, newdata = UB.ted)   # predict(Random Forest모형, Test Data) 

3-2-1. ConfusionMatrix

confusionMatrix(caret.gd.rf.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  666  11
       yes   7  65
                                          
               Accuracy : 0.976           
                 95% CI : (0.9623, 0.9857)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8651          
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 0.85526         
            Specificity : 0.98960         
         Pos Pred Value : 0.90278         
         Neg Pred Value : 0.98375         
             Prevalence : 0.10147         
         Detection Rate : 0.08678         
   Detection Prevalence : 0.09613         
      Balanced Accuracy : 0.92243         
                                          
       'Positive' Class : yes             
                                          


3-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.rf.prob <- predict(caret.gd.rf, newdata = UB.ted, type = "prob")  #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rf.prob <- test.rf.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                   # 실제 클래스래스
pp           <- as.numeric(test.rf.prob)                               # "yes"에 대한 예측 확률

rf.roc       <- roc(ac, pp, plot = T, col = "red")                     # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(rf.roc),3)
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)


2) Package “Epi”

pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot="ROC")    # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)


3) Package “ROCR”

pacman::p_load("ROCR")                                                  

rf.pred <- prediction(test.rf.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)     


rf.perf <- performance(rf.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(rf.perf, col = "red")                                   # ROC Curve
abline(0,1, col = "black")

perf.auc        <- performance(rf.pred, "auc")               # AUC
auc             <- attributes(perf.auc)$y.values
legend("bottomright", legend = auc, bty = "n")


3-2-3. 향상 차트

1) Package “ROCR”

rf.perf       <- performance(rf.pred, "lift", "rpp")          # Lift Chart
plot(rf.perf, colorize = T, lwd = 2)  
detach(package:ROCR)


2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)                  # 실제 클래스를 수치형으로 변환 변환

plotLift(test.rf.prob, ac.numeric, cumulative = T, n.buckets = 24)     # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.rf.prob, ac.numeric)                                 # Top 10% 향상도 출력
[1] 8.541
detach(package:lift)

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".