Package "caret"은 다양한 머신러닝 분석을 하나로 모은 패키지이며, trainControl을 이용하여 과적합을 방지할 수 있다. "caret"에서는 Bagging을 이용한 다양한 기법을 수행할 수 있으며, 그 중 가장 많이 쓰이는 Random Forest를 이용하여 예제 데이터를 분석한다. 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은 Person.Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table", "dplyr") 

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환환
  mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거거



cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환
 
glimpse(UB)                                                           # 데이터 구조

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

3. Random Forest

3-1. 최적의 모수 찾기

Bagging에서 가장 많이 쓰이는 기법인 Random Forest의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method = "cv", number = 5, search = "random")    # 5-Fold-Cross Validation

set.seed(100)                                                        # seed 고정 For Cross Validation
caret.rd.rf <- train(Personal.Loan~., data = UB.trd, method = "rf",  
                     trControl = fitControl, tuneLength = 10,        # tuneLength (탐색할 후보 모수 갯수) 
                     ntree = 500)                                    # 생성할 Tree 갯수수
caret.rd.rf

Random Forest 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   3    0.9743036  0.8471829
   4    0.9794432  0.8817984
   6    0.9805877  0.8914686
   7    0.9811591  0.8942015
   9    0.9811591  0.8944264
  10    0.9811591  0.8944196
  12    0.9794449  0.8851846
  14    0.9788734  0.8828271

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was mtry = 7.

Tune Parameter
- mtry : 분할할 때마다 랜덤적으로 추출할 예측변수 갯수

plot(caret.rd.rf)         # Accuracy

mtry = 7일 때 정확도가 가장 높다.
mtry = 7을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method = "cv", number = 5)    # 5-Fold-Cross Validation


customGrid <- expand.grid(mtry = seq(4, 10, by = 1))     # Random Search의 Best Parameter 기준으로 탐색NA

set.seed(100)                                                        # seed 고정 For Cross Validation
caret.gd.rf <- train(Personal.Loan~., data = UB.trd, method = "rf", 
                     trControl = fitControl, tuneGrid = customGrid,
                     ntree = 500)                                    # 생성할 Tree 갯수     

caret.gd.rf

Random Forest 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   4    0.9783036  0.8747493
   5    0.9805861  0.8910712
   6    0.9811591  0.8942015
   7    0.9805877  0.8914899
   8    0.9817289  0.8982840
   9    0.9817306  0.8981210
  10    0.9805877  0.8914542

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was mtry = 9.

plot(caret.gd.rf)   # Accuracy

mtry = 9일 때 정확도가 가장 높으며, mtry = 7 보다 정확도가 약간 증가하였다.

# 최종 모형NAcaret.gd.rf$finalModel


Call:
 randomForest(x = x, y = y, ntree = 500, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 9

        OOB estimate of  error rate: 1.94%
Confusion matrix:
      no yes class.error
no  1559  12 0.007638447
yes   22 158 0.122222222

3-1-1. 변수 중요도

rfImp <- varImp(caret.gd.rf, scale = FALSE)
plot(rfImp)

3-1-2. OBB Error

head(caret.gd.rf$finalModel$err.rate)

            OOB         no       yes
[1,] 0.04128440 0.02030457 0.2380952
[2,] 0.03831418 0.02234043 0.1826923
[3,] 0.03048780 0.01939292 0.1349206
[4,] 0.03885481 0.02573808 0.1575342
[5,] 0.03121019 0.01770538 0.1518987
[6,] 0.02926829 0.01290761 0.1726190

# Plot for Error
pacman::p_load("ggplot2")

oob.error.data <- data.frame(Trees=rep(1:nrow(caret.gd.rf$finalModel$err.rate),times=3), 
                             Type=rep(c("OOB","No","Yes"), 
                                      each=nrow(caret.gd.rf$finalModel$err.rate)),
                             Error=c(caret.gd.rf$finalModel$err.rate[,"OOB"],
                                     caret.gd.rf$finalModel$err.rate[,"no"],
                                     caret.gd.rf$finalModel$err.rate[,"yes"]))


ggplot(data=oob.error.data, aes(x=Trees, y=Error)) + 
  geom_line(aes(color=Type)) + theme_bw()

3-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.rf.pred <- predict(caret.gd.rf, newdata = UB.ted)   # predict(Random Forest모형, Test Data)

3-2-1. ConfusionMatrix

confusionMatrix(caret.gd.rf.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  666  11
       yes   7  65
                                          
               Accuracy : 0.976           
                 95% CI : (0.9623, 0.9857)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8651          
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 0.85526         
            Specificity : 0.98960         
         Pos Pred Value : 0.90278         
         Neg Pred Value : 0.98375         
             Prevalence : 0.10147         
         Detection Rate : 0.08678         
   Detection Prevalence : 0.09613         
      Balanced Accuracy : 0.92243         
                                          
       'Positive' Class : yes

3-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.rf.prob <- predict(caret.gd.rf, newdata = UB.ted, type = "prob")  #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rf.prob <- test.rf.prob[,2]                                       # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                   # 실제 클래스래스
pp           <- as.numeric(test.rf.prob)                               # "yes"에 대한 예측 확률

rf.roc       <- roc(ac, pp, plot = T, col = "red")                     # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(rf.roc),3)
legend("bottomright",legend = auc, bty = "n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot="ROC")    # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                                                  

rf.pred <- prediction(test.rf.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)     


rf.perf <- performance(rf.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(rf.perf, col = "red")                                   # ROC Curve
abline(0,1, col = "black")

perf.auc        <- performance(rf.pred, "auc")               # AUC
auc             <- attributes(perf.auc)$y.values
legend("bottomright", legend = auc, bty = "n")

3-2-3. 향상 차트

1) Package “ROCR”

rf.perf       <- performance(rf.pred, "lift", "rpp")          # Lift Chart
plot(rf.perf, colorize = T, lwd = 2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)                  # 실제 클래스를 수치형으로 변환 변환

plotLift(test.rf.prob, ac.numeric, cumulative = T, n.buckets = 24)     # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(test.rf.prob, ac.numeric)                                 # Top 10% 향상도 출력

[1] 8.541

detach(package:lift)

Bagging based on Caret

1. 데이터 불러오기

2. 데이터 분할

3. Random Forest

3-1. 최적의 모수 찾기

3-1-1. 변수 중요도

3-1-2. OBB Error

3-2. 모형 평가

3-2-1. ConfusionMatrix

3-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

3-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

Reuse