Bagging

Machine Learning

R code for Random Forest of Bagging

Yeongeun Jeon , Jeongwook Lee , Jung In Seo
10-15-2020

Bagging은 Bootstrap을 이용하여 독립적으로 모형을 생성하는 알고리즘이며, 가장 대표적인 기법은 Random Forest이다. 예제 데이터를 이용해 Random Forest을 수행해보았다.
예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 TargetPerson.Loan이다.



1. 데이터 불러오기

pacman::p_load("data.table", "dplyr")     

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환

glimpse(UB)                                                           # 데이터 구조            
Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

detach(package:caret)

3. Random Forest

Bagging에서 가장 많이 쓰이는 기법은 Random Forest이다. Random Forest를 수행할 수 있는 Package는 "randomForest", "party"가 있으며, 예제 데이터에는 "randomForest"를 사용하였다. 자세한 내용은 여기를 참고한다.

randomForest(formula, data, ntree, importance, mtry, ...)

3-1. 모형 적합

pacman::p_load("randomForest")

set.seed(100)
UB.rf <- randomForest(Personal.Loan~., data=UB.trd,
                      ntree=100, importance=T, mtry=sqrt(12))  # randomForest(formula, datda, ntree,mtry=sqrt(p))
                                                              
UB.rf

Call:
 randomForest(formula = Personal.Loan ~ ., data = UB.trd, ntree = 100,      importance = T, mtry = sqrt(12)) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 3

        OOB estimate of  error rate: 1.71%
Confusion matrix:
      no yes class.error
no  1567   4 0.002546149
yes   26 154 0.144444444

3-1-1. 변수 중요도

# 변수 중요도varImpPlot(UB.rf)

3-1-2. OBB Error

head(UB.rf$err.rate)
            OOB         no       yes
[1,] 0.02435312 0.01507538 0.1166667
[2,] 0.04293893 0.02224576 0.2307692
[3,] 0.04861111 0.02577320 0.2500000
[4,] 0.04958678 0.02143951 0.3013699
[5,] 0.04501608 0.02214286 0.2516129
[6,] 0.04406365 0.02178353 0.2424242
# Plot for Error
pacman::p_load("ggplot2")
oob.error.data <- data.frame(Trees=rep(1:nrow(UB.rf$err.rate),times=3), 
                             Type=rep(c("OOB","No","Yes"), 
                                      each=nrow(UB.rf$err.rate)),
                             Error=c(UB.rf$err.rate[,"OOB"],
                                     UB.rf$err.rate[,"no"],
                                     UB.rf$err.rate[,"yes"]))



ggplot(data=oob.error.data, aes(x=Trees, y=Error)) + 
  geom_line(aes(color=Type)) + theme_bw()
detach(package:ggplot2)

3-2. 모형 평가

# 적합된 모형에 대하여 Test Data 예측NAUB.pred.rf <- predict(UB.rf, newdata=UB.ted)       # predict(Random Forest모형, Test Data)

ConfusionMatrix

pacman::p_load("caret")

confusionMatrix(UB.pred.rf, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  671  13
       yes   2  63
                                          
               Accuracy : 0.98            
                 95% CI : (0.9672, 0.9887)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8826          
                                          
 Mcnemar's Test P-Value : 0.009823        
                                          
            Sensitivity : 0.82895         
            Specificity : 0.99703         
         Pos Pred Value : 0.96923         
         Neg Pred Value : 0.98099         
             Prevalence : 0.10147         
         Detection Rate : 0.08411         
   Detection Prevalence : 0.08678         
      Balanced Accuracy : 0.91299         
                                          
       'Positive' Class : yes             
                                          


ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

ac     <- UB.ted$Personal.Loan                              # 실제 클래스래스

pp     <- predict(UB.rf, newdata=UB.ted, type="prob")[,2]   # "yes"에 대한 예측 확률 출력

rf.roc <- roc(ac, pp, plot=T, col="red")                    # roc(실제 클래스, 예측 확률)률)

auc <- round(auc(rf.roc), 3)                                # AUC 
legend("bottomright",legend=auc, bty="n")
detach(package:pROC)


2) Package “Epi”

# install.packages("Epi")
pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)                                  
detach(package:Epi)


3) Package “ROCR”

pacman::p_load("ROCR")                      

rf.pred <- prediction(pp, ac)                       # prediction(예측 확률, 실제 클래스)스)
  
rf.perf <- performance(rf.pred, "tpr", "fpr")       # performance(, "민감도", "1-특이도")                        
plot(rf.perf, col="red")                            # ROC Curve
abline(0,1, col="black")


perf.auc <- performance(rf.pred, "auc")             # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend=auc,bty="n") 


향상 차트

1) Package “ROCR”

rf.lift <- performance(rf.pred,"lift", "rpp")       # Lift chart
plot(rf.lift, colorize=T, lwd=2)      
detach(package:ROCR)


2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets =24)   # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력
[1] 9.067
detach(package:lift)

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".