R code using Caret Package for Random Forest of Bagging
Package
"caret"
은 다양한 머신러닝 분석을 하나로 모은 패키지이며,trainControl
을 이용하여 과적합을 방지할 수 있다."caret"
에서는 Bagging을 이용한 다양한 기법을 수행할 수 있으며, 그 중 가장 많이 쓰이는Random Forest
를 이용하여 예제 데이터를 분석한다. 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은Person.Loan
이다.
pacman::p_load("data.table", "dplyr")
UB <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>% # 데이터 불러오기
data.frame() %>% # Data frame 변환환
mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>% # Character for classification
select(-1) # ID변수 제거거
cols <- c("Family", "Education", "Personal.Loan", "Securities.Account",
"CD.Account", "Online", "CreditCard")
UB <- UB %>%
mutate_at(cols, as.factor) # 범주형 변수 변환
glimpse(UB) # 데이터 구조
Rows: 2,500
Columns: 13
$ Age <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
Bagging에서 가장 많이 쓰이는 기법인 Random Forest의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method = "cv", number = 5, search = "random") # 5-Fold-Cross Validation
set.seed(100) # seed 고정 For Cross Validation
caret.rd.rf <- train(Personal.Loan~., data = UB.trd, method = "rf",
trControl = fitControl, tuneLength = 10, # tuneLength (탐색할 후보 모수 갯수)
ntree = 500) # 생성할 Tree 갯수수
caret.rd.rf
Random Forest
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
mtry Accuracy Kappa
3 0.9743036 0.8471829
4 0.9794432 0.8817984
6 0.9805877 0.8914686
7 0.9811591 0.8942015
9 0.9811591 0.8944264
10 0.9811591 0.8944196
12 0.9794449 0.8851846
14 0.9788734 0.8828271
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 7.
mtry
: 분할할 때마다 랜덤적으로 추출할 예측변수 갯수plot(caret.rd.rf) # Accuracy
mtry
= 7일 때 정확도가 가장 높다.mtry
= 7을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method = "cv", number = 5) # 5-Fold-Cross Validation
customGrid <- expand.grid(mtry = seq(4, 10, by = 1)) # Random Search의 Best Parameter 기준으로 탐색NA
set.seed(100) # seed 고정 For Cross Validation
caret.gd.rf <- train(Personal.Loan~., data = UB.trd, method = "rf",
trControl = fitControl, tuneGrid = customGrid,
ntree = 500) # 생성할 Tree 갯수
caret.gd.rf
Random Forest
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
mtry Accuracy Kappa
4 0.9783036 0.8747493
5 0.9805861 0.8910712
6 0.9811591 0.8942015
7 0.9805877 0.8914899
8 0.9817289 0.8982840
9 0.9817306 0.8981210
10 0.9805877 0.8914542
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 9.
plot(caret.gd.rf) # Accuracy
mtry
= 9일 때 정확도가 가장 높으며, mtry
= 7 보다 정확도가 약간 증가하였다.# 최종 모형NAcaret.gd.rf$finalModel
Call:
randomForest(x = x, y = y, ntree = 500, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 9
OOB estimate of error rate: 1.94%
Confusion matrix:
no yes class.error
no 1559 12 0.007638447
yes 22 158 0.122222222
rfImp <- varImp(caret.gd.rf, scale = FALSE)
plot(rfImp)
head(caret.gd.rf$finalModel$err.rate)
OOB no yes
[1,] 0.04128440 0.02030457 0.2380952
[2,] 0.03831418 0.02234043 0.1826923
[3,] 0.03048780 0.01939292 0.1349206
[4,] 0.03885481 0.02573808 0.1575342
[5,] 0.03121019 0.01770538 0.1518987
[6,] 0.02926829 0.01290761 0.1726190
# Plot for Error
pacman::p_load("ggplot2")
oob.error.data <- data.frame(Trees=rep(1:nrow(caret.gd.rf$finalModel$err.rate),times=3),
Type=rep(c("OOB","No","Yes"),
each=nrow(caret.gd.rf$finalModel$err.rate)),
Error=c(caret.gd.rf$finalModel$err.rate[,"OOB"],
caret.gd.rf$finalModel$err.rate[,"no"],
caret.gd.rf$finalModel$err.rate[,"yes"]))
ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
geom_line(aes(color=Type)) + theme_bw()
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.rf.pred <- predict(caret.gd.rf, newdata = UB.ted) # predict(Random Forest모형, Test Data)
confusionMatrix(caret.gd.rf.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 666 11
yes 7 65
Accuracy : 0.976
95% CI : (0.9623, 0.9857)
No Information Rate : 0.8985
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8651
Mcnemar's Test P-Value : 0.4795
Sensitivity : 0.85526
Specificity : 0.98960
Pos Pred Value : 0.90278
Neg Pred Value : 0.98375
Prevalence : 0.10147
Detection Rate : 0.08678
Detection Prevalence : 0.09613
Balanced Accuracy : 0.92243
'Positive' Class : yes
pacman::p_load("pROC")
test.rf.prob <- predict(caret.gd.rf, newdata = UB.ted, type = "prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rf.prob <- test.rf.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.rf.prob) # "yes"에 대한 예측 확률
rf.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(rf.roc),3)
legend("bottomright",legend = auc, bty = "n")
detach(package:pROC)
pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp, ac, plot="ROC") # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)
pacman::p_load("ROCR")
rf.pred <- prediction(test.rf.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
rf.perf <- performance(rf.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(rf.perf, col = "red") # ROC Curve
abline(0,1, col = "black")
perf.auc <- performance(rf.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright", legend = auc, bty = "n")
rf.perf <- performance(rf.pred, "lift", "rpp") # Lift Chart
plot(rf.perf, colorize = T, lwd = 2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(test.rf.prob, ac.numeric, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.rf.prob, ac.numeric) # Top 10% 향상도 출력
[1] 8.541
detach(package:lift)
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".