Bagging은 Bootstrap을 이용하여 독립적으로 모형을 생성하는 알고리즘이며, 가장 대표적인 기법은 Random Forest이다. 예제 데이터를 이용해 Random Forest을 수행해보았다.
예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은 Person.Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table", "dplyr")     

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환환
  mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거거



cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환

glimpse(UB)                                                           # 데이터 구조

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

detach(package:caret)

3. Random Forest

Bagging에서 가장 많이 쓰이는 기법은 Random Forest이다. Random Forest를 수행할 수 있는 Package는 "randomForest", "party"가 있으며, 예제 데이터에는 "randomForest"를 사용하였다. 자세한 내용은 여기를 참고한다.

randomForest(formula, data, ntree, importance, mtry, ...)

formula : Target과 예측 변수에 대한 공식으로써 일반적으로 Target ~ 예측변수 사용
data : formula의 변수들이 있는 데이터 프레임
ntree : 생성할 나무 갯수
importance : TRUE 일 때 예측변수에 대한 중요도를 평가
mtry : 분할할 때마다 랜덤적으로 추출할 예측변수 갯수

3-1. 모형 적합

pacman::p_load("randomForest")

set.seed(100)
UB.rf <- randomForest(Personal.Loan~., data=UB.trd,
                      ntree=100, importance=T, mtry=sqrt(12))  # randomForest(formula, datda, ntree,mtry=sqrt(p))
                                                              
UB.rf


Call:
 randomForest(formula = Personal.Loan ~ ., data = UB.trd, ntree = 100,      importance = T, mtry = sqrt(12)) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 3

        OOB estimate of  error rate: 1.71%
Confusion matrix:
      no yes class.error
no  1567   4 0.002546149
yes   26 154 0.144444444

3-1-1. 변수 중요도

# 변수 중요도도
varImpPlot(UB.rf)

3-1-2. OBB Error

head(UB.rf$err.rate)

            OOB         no       yes
[1,] 0.02435312 0.01507538 0.1166667
[2,] 0.04293893 0.02224576 0.2307692
[3,] 0.04861111 0.02577320 0.2500000
[4,] 0.04958678 0.02143951 0.3013699
[5,] 0.04501608 0.02214286 0.2516129
[6,] 0.04406365 0.02178353 0.2424242

# Plot for Error
pacman::p_load("ggplot2")
oob.error.data <- data.frame(Trees=rep(1:nrow(UB.rf$err.rate),times=3), 
                             Type=rep(c("OOB","No","Yes"), 
                                      each=nrow(UB.rf$err.rate)),
                             Error=c(UB.rf$err.rate[,"OOB"],
                                     UB.rf$err.rate[,"no"],
                                     UB.rf$err.rate[,"yes"]))



ggplot(data=oob.error.data, aes(x=Trees, y=Error)) + 
  geom_line(aes(color=Type)) + theme_bw()

detach(package:ggplot2)

3-2. 모형 평가

# 적합된 모형에 대하여 Test Data 예측NAUB.pred.rf <- predict(UB.rf, newdata=UB.ted)       # predict(Random Forest모형, Test Data)

ConfusionMatrix

pacman::p_load("caret")

confusionMatrix(UB.pred.rf, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  671  13
       yes   2  63
                                          
               Accuracy : 0.98            
                 95% CI : (0.9672, 0.9887)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8826          
                                          
 Mcnemar's Test P-Value : 0.009823        
                                          
            Sensitivity : 0.82895         
            Specificity : 0.99703         
         Pos Pred Value : 0.96923         
         Neg Pred Value : 0.98099         
             Prevalence : 0.10147         
         Detection Rate : 0.08411         
   Detection Prevalence : 0.08678         
      Balanced Accuracy : 0.91299         
                                          
       'Positive' Class : yes

ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")                          

ac     <- UB.ted$Personal.Loan                              # 실제 클래스래스

pp     <- predict(UB.rf, newdata=UB.ted, type="prob")[,2]   # "yes"에 대한 예측 확률 출력

rf.roc <- roc(ac, pp, plot=T, col="red")                    # roc(실제 클래스, 예측 확률)률)

auc <- round(auc(rf.roc), 3)                                # AUC 
legend("bottomright",legend=auc, bty="n")

detach(package:pROC)

2) Package “Epi”

# install.packages("Epi")
pacman::p_load("Epi")                        
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp,ac, plot="ROC")       # ROC(예측 확률 , 실제 클래스)

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                      

rf.pred <- prediction(pp, ac)                       # prediction(예측 확률, 실제 클래스)스)
  
rf.perf <- performance(rf.pred, "tpr", "fpr")       # performance(, "민감도", "1-특이도")                        
plot(rf.perf, col="red")                            # ROC Curve
abline(0,1, col="black")


perf.auc <- performance(rf.pred, "auc")             # AUC        

auc <- attributes(perf.auc)$y.values                  
legend("bottomright",legend=auc,bty="n")

향상 차트

1) Package “ROCR”

rf.lift <- performance(rf.pred,"lift", "rpp")       # Lift chart
plot(rf.lift, colorize=T, lwd=2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)     # 실제 클래스를 수치형으로 변환 변환

plotLift(pp, ac.numeric, cumulative = T, n.buckets =24)   # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(pp, ac.numeric)                             # Top 10% 향상도 출력

[1] 9.067

detach(package:lift)

Bagging

1. 데이터 불러오기

2. 데이터 분할

3. Random Forest

3-1. 모형 적합

3-1-1. 변수 중요도

3-1-2. OBB Error

3-2. 모형 평가

ConfusionMatrix

ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

향상 차트

1) Package “ROCR”

2) Package “lift”

Reuse