R code using Various Packages for Decision Tree
“rpart”와 “C5.0” 방법으로 의사결정나무모형을 실습하기 위해서 사용될 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은
Person.Loan
이다.
pacman::p_load("data.table",
"dplyr")
UB <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>% # 데이터 불러오기
data.frame() %>% # Data frame 변환환
select(-1) # ID변수 제거거
# select columns
cols <- c("Family", "Education", "Personal.Loan", "Securities.Account",
"CD.Account", "Online", "CreditCard")
UB <- UB %>%
mutate_at(cols, as.factor) # 범주형 변수 변환
glimpse(UB) # 데이터 구조
Rows: 2,500
Columns: 13
$ Age <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
의사결정나무모형을 형성하기 위해 사용될 첫번째 Package는 "rpart"
이다. "rpart"
는 약간 수정된 CART를 사용하며, CP (Complexity Parameter)를 중심으로 분석한다. 게다가 "rpart"
는 Cross Validation을 이용하여 최적의 CP값을 선택할 수 있도록하며, defalut값은 10-fold Cross Validation이다. 또한 가독성 좋은 그래프가 있기 때문에 트리를 시각화하기에 좋다.
rpart(formula, data, method, ...)
formula
: Target과 예측 변수에 대한 공식으로써 일반적으로 Target ~ 예측변수로 적는다.data
: formula
의 변수들이 있는 데이터 프레임method
: Target이 범주형이면 “class”, Target이 수치형이면 “anova”를 해준다.pacman::p_load("rpart", # for Decision tree
"rattle", "rpart.plot") # for fancyRpartPlot(가독성 좋은 그래프) )
set.seed(200) # seed 고정 for cross-validation
rContol <- rpart.control(xval=15) # xval : Number of cross validation
UB.trd.rtree <- rpart(Personal.Loan~., data=UB.trd,
method="class", control = rContol)
summary(UB.trd.rtree)
Call:
rpart(formula = Personal.Loan ~ ., data = UB.trd, method = "class",
control = rContol)
n= 1751
CP nsplit rel error xerror xstd
1 0.32500000 0 1.0000000 1.0000000 0.07060066
2 0.15000000 2 0.3500000 0.4111111 0.04676994
3 0.01944444 3 0.2000000 0.1944444 0.03253697
4 0.01111111 5 0.1611111 0.2166667 0.03430588
5 0.01000000 7 0.1388889 0.2055556 0.03343418
Variable importance
Income Education Family CCAvg CD.Account Mortgage
29 28 20 13 5 3
ZIP.Code Age Experience
1 1 1
Node number 1: 1751 observations, complexity param=0.325
predicted class=0 expected loss=0.1027984 P(node) =1
class counts: 1571 180
probabilities: 0.897 0.103
left son=2 (1366 obs) right son=3 (385 obs)
Primary splits:
Income < 106.5 to the left, improve=94.968130, (0 missing)
CCAvg < 2.95 to the left, improve=69.046700, (0 missing)
CD.Account splits as LR, improve=41.226300, (0 missing)
Mortgage < 293.5 to the left, improve=14.467750, (0 missing)
Education splits as LRR, improve= 7.681315, (0 missing)
Surrogate splits:
CCAvg < 3.25 to the left, agree=0.868, adj=0.400, (0 split)
Mortgage < 339.5 to the left, agree=0.802, adj=0.101, (0 split)
CD.Account splits as LR, agree=0.786, adj=0.026, (0 split)
Node number 2: 1366 observations, complexity param=0.01111111
predicted class=0 expected loss=0.01537335 P(node) =0.7801256
class counts: 1345 21
probabilities: 0.985 0.015
left son=4 (1277 obs) right son=5 (89 obs)
Primary splits:
CCAvg < 2.95 to the left, improve=9.2644320, (0 missing)
Income < 98.5 to the left, improve=3.5382100, (0 missing)
CD.Account splits as LR, improve=1.1082890, (0 missing)
Mortgage < 220.5 to the left, improve=0.6886484, (0 missing)
Experience < 38.5 to the left, improve=0.2000360, (0 missing)
Node number 3: 385 observations, complexity param=0.325
predicted class=0 expected loss=0.412987 P(node) =0.2198744
class counts: 226 159
probabilities: 0.587 0.413
left son=6 (242 obs) right son=7 (143 obs)
Primary splits:
Education splits as LRR, improve=111.984200, (0 missing)
Family splits as LLRR, improve= 73.753990, (0 missing)
CD.Account splits as LR, improve= 27.998710, (0 missing)
Income < 156.5 to the left, improve= 11.193100, (0 missing)
CCAvg < 6.635 to the right, improve= 4.353303, (0 missing)
Surrogate splits:
Family splits as LLRR, agree=0.743, adj=0.308, (0 split)
CD.Account splits as LR, agree=0.683, adj=0.147, (0 split)
Income < 173.5 to the left, agree=0.642, adj=0.035, (0 split)
CCAvg < 8.85 to the left, agree=0.634, adj=0.014, (0 split)
ZIP.Code < 90021.5 to the right, agree=0.631, adj=0.007, (0 split)
Node number 4: 1277 observations
predicted class=0 expected loss=0 P(node) =0.7292975
class counts: 1277 0
probabilities: 1.000 0.000
Node number 5: 89 observations, complexity param=0.01111111
predicted class=0 expected loss=0.2359551 P(node) =0.0508281
class counts: 68 21
probabilities: 0.764 0.236
left son=10 (73 obs) right son=11 (16 obs)
Primary splits:
Income < 98.5 to the left, improve=5.904956, (0 missing)
CD.Account splits as LR, improve=4.645443, (0 missing)
CCAvg < 3.15 to the right, improve=3.739411, (0 missing)
Experience < 31.5 to the left, improve=2.313744, (0 missing)
Mortgage < 86.5 to the left, improve=2.146298, (0 missing)
Surrogate splits:
Age < 62.5 to the left, agree=0.843, adj=0.125, (0 split)
Experience < 38.5 to the left, agree=0.843, adj=0.125, (0 split)
ZIP.Code < 90061.5 to the right, agree=0.843, adj=0.125, (0 split)
Node number 6: 242 observations, complexity param=0.15
predicted class=0 expected loss=0.1198347 P(node) =0.1382067
class counts: 213 29
probabilities: 0.880 0.120
left son=12 (211 obs) right son=13 (31 obs)
Primary splits:
Family splits as LLRR, improve=47.3076500, (0 missing)
CD.Account splits as LR, improve= 8.0685060, (0 missing)
Mortgage < 279.5 to the left, improve= 1.5428930, (0 missing)
CCAvg < 6.635 to the right, improve= 1.2562530, (0 missing)
ZIP.Code < 95057 to the left, improve= 0.9027978, (0 missing)
Surrogate splits:
CD.Account splits as LR, agree=0.880, adj=0.065, (0 split)
Mortgage < 566 to the left, agree=0.876, adj=0.032, (0 split)
Node number 7: 143 observations, complexity param=0.01944444
predicted class=1 expected loss=0.09090909 P(node) =0.08166762
class counts: 13 130
probabilities: 0.091 0.909
left son=14 (25 obs) right son=15 (118 obs)
Primary splits:
Income < 116.5 to the left, improve=11.1563600, (0 missing)
CCAvg < 2.4 to the left, improve= 2.6908830, (0 missing)
Experience < 2.5 to the left, improve= 2.4008740, (0 missing)
Age < 29.5 to the left, improve= 2.2161600, (0 missing)
CD.Account splits as LR, improve= 0.9500891, (0 missing)
Node number 10: 73 observations
predicted class=0 expected loss=0.1506849 P(node) =0.04169046
class counts: 62 11
probabilities: 0.849 0.151
Node number 11: 16 observations
predicted class=1 expected loss=0.375 P(node) =0.009137636
class counts: 6 10
probabilities: 0.375 0.625
Node number 12: 211 observations
predicted class=0 expected loss=0 P(node) =0.1205026
class counts: 211 0
probabilities: 1.000 0.000
Node number 13: 31 observations
predicted class=1 expected loss=0.06451613 P(node) =0.01770417
class counts: 2 29
probabilities: 0.065 0.935
Node number 14: 25 observations, complexity param=0.01944444
predicted class=0 expected loss=0.48 P(node) =0.01427756
class counts: 13 12
probabilities: 0.520 0.480
left son=28 (13 obs) right son=29 (12 obs)
Primary splits:
CCAvg < 2.4 to the left, improve=3.3646150, (0 missing)
ZIP.Code < 93060 to the right, improve=2.4938890, (0 missing)
Age < 29.5 to the left, improve=1.2447060, (0 missing)
Experience < 7 to the left, improve=1.0800000, (0 missing)
Education splits as -RL, improve=0.9605195, (0 missing)
Surrogate splits:
ZIP.Code < 94014.5 to the right, agree=0.88, adj=0.750, (0 split)
Age < 38.5 to the left, agree=0.72, adj=0.417, (0 split)
Experience < 18.5 to the left, agree=0.72, adj=0.417, (0 split)
Education splits as -RL, agree=0.72, adj=0.417, (0 split)
Income < 113.5 to the left, agree=0.68, adj=0.333, (0 split)
Node number 15: 118 observations
predicted class=1 expected loss=0 P(node) =0.06739006
class counts: 0 118
probabilities: 0.000 1.000
Node number 28: 13 observations
predicted class=0 expected loss=0.2307692 P(node) =0.007424329
class counts: 10 3
probabilities: 0.769 0.231
Node number 29: 12 observations
predicted class=1 expected loss=0.25 P(node) =0.006853227
class counts: 3 9
probabilities: 0.250 0.750
먼저 첫번째로 나오는 Table에 대한 용어 설명이다.
CP
: Complexity Parameter로 Training Data에 대한 오분류율+나무 크기에 대한 벌점 요인으로 계산된다. 또한 CP는 나무의 복잡도를 나타냄으로써 나무의 크기를 통제하고 최적의 크기를 선택할 수 있게 해준다.nsplit
: 분리의 횟수rel error
: \(1-R^2\) root mean square error로 모형을 추정하는 데 사용된 데이터의 예측에 대한 오차xerror
: Cross-validation errorxstd
: xerror
의 표준오차Variable importance
는 변수중요도로써, Income \(>\) Education \(>\) Family 임을 알 수 있다. Node number 1
은 첫번째 노드에 대한 설명으로써 총 1751개의 관측값이 있으며, predicted class
=0, cp
=0.325, expected loss (불순도)
= 0.1027984이다. 전체 관측값 1751개 중 클래스 “0”은 1571, 클래스 “1”은 180개 이며, 비율은 각각 0.897, 0.103이다. left son = 2 (1366 obs)
는 왼쪽 자식 노드의 번호는 2 이고 1366개의 관측값이 있다는 뜻이다.
fancyRpartPlot(UB.trd.rtree) # 가독성 좋은 그래프래프
pacman::p_load("visNetwork","sparkline") # 네트워크 기반 그래프
visTree(UB.trd.rtree)
과적합 문제를 해결하기 위해 가지치기를 수행한다. rpart
에서 최적의 cp
값을 찾는 것이 중요하며, 이것은 xerror
가 최소가 되는 cp
를 찾으면 된다.
table <- UB.trd.rtree$cptable # cp Table
low.error <- which.min(table[ ,"xerror"]) # table의 ”xerror”열에서 가장 낮은 값 위치 추출치 추출
cp.best <- table[low.error, "CP"] # ”CP”에서 low.error에 해당하는 cp 선택 선택
UB.trd.prune.rtree <- prune(UB.trd.rtree, cp=cp.best) # prune(트리모형, 최적의 “CP”)
UB.trd.prune.rtree$cptable # 최종 모형에 대한 cp tableNA
CP nsplit rel error xerror xstd
1 0.32500000 0 1.00 1.0000000 0.07060066
2 0.15000000 2 0.35 0.4111111 0.04676994
3 0.01944444 3 0.20 0.1944444 0.03253697
가지치기를 함으로써 최종 모형이 완성되고 최종 모형에 대한 Tree 그림은 다음과 같다.
fancyRpartPlot(UB.trd.prune.rtree) # 가독성 좋은 그래프
visTree(UB.trd.prune.rtree) # 네트워크 기반 그래프
# 적합된 모형에 대하여 Test Data 예측NAtest.rtree <- predict(UB.trd.prune.rtree, newdata=UB.ted, type="class") # predict(트리모형, Test Data)
pacman::p_load("caret")
CM <- confusionMatrix(test.rtree, UB.ted$Personal.Loan, positive="1") # confusionMatrix(예측 클래스, 실제 클래스, positive="관심클래스")심클래스")
CM
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 656 12
1 17 64
Accuracy : 0.9613
95% CI : (0.9449, 0.9739)
No Information Rate : 0.8985
P-Value [Acc > NIR] : 1.228e-10
Kappa : 0.7937
Mcnemar's Test P-Value : 0.4576
Sensitivity : 0.84211
Specificity : 0.97474
Pos Pred Value : 0.79012
Neg Pred Value : 0.98204
Prevalence : 0.10147
Detection Rate : 0.08545
Detection Prevalence : 0.10814
Balanced Accuracy : 0.90842
'Positive' Class : 1
detach(package:caret)
pacman::p_load("pROC")
test.rtree.prob <- predict(UB.trd.prune.rtree, newdata=UB.ted) # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rtree.prob <- test.rtree.prob[,2] # "1"에 대한 예측 확률
ac <- as.numeric(as.character(UB.ted$Personal.Loan)) # 범주형을 숫자형으로 변환할 때 문자형으로 변환한 뒤 숫자형으로 변환해야함NArpp <- as.numeric(test.rtree.prob) # "1"에 대한 예측 확률
rtree.roc <- roc(ac, rpp, plot=T, col="red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(rtree.roc),3)
legend("bottomright", legend=auc, bty="n")
detach(package:pROC)
pacman::p_load("Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(rpp, ac, plot="ROC") # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)
pacman::p_load("ROCR")
rtree.pred <- prediction(test.rtree.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
rtree.perf <- performance(rtree.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(rtree.perf, col="blue") # ROC Curve
perf.auc <- performance(rtree.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")
rtree.perf <- performance(rtree.pred, "lift","rpp") # Lift Chart
plot(rtree.perf, main="lift curve", colorize=T, lwd=2)
detach(package:ROCR)
pacman::p_load("lift")
plotLift(test.rtree.prob, UB.ted$Personal.Loan, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.rtree.prob, UB.ted$Personal.Loan) # Top 10% 향상도 출력
[1] 7.753
detach(package:lift)
의사결정나무모형을 형성하기 위해 사용될 두번째 Package는 "C50"
이다. "C50"
은 앙상블 기법 중 부스팅을 이용하는데, 부스팅이란 붓스트랩 샘플을 독립적으로 복원 추출한 다수의 붓스트랩 트리를 하나씩 순차적으로 업그레이드하여 다음 트리에 가중치를 주는 방식으로 학습하는 방법이다.
C5.0(x, y, trials = 1, ...)
x
: 예측 변수y
: Targettrials
: 부스팅 횟수pacman::p_load(C50)
set.seed(200) # C5.0함수는 seed값이 필요함NAtree <- C5.0(UB.trd[-9], UB.trd$Personal.Loan, trials=1)
tree
Call:
C5.0.default(x = UB.trd[-9], y = UB.trd$Personal.Loan, trials = 1)
Classification Tree
Number of samples: 1751
Number of predictors: 12
Tree size: 8
Non-standard options: attempt to group attributes
Number of samples
: UB.trd의 개수Number of predictor
s : UB.trd의 독립변수 개수Tree size
: 트리 크기plot(tree)
"C5.0"
에서 가장 중요한 것은 최적의 부스팅 횟수를 찾는 것이다. 아래의 코드는 부스팅 횟수를 1~100회로 늘리면서 각 부스팅 횟수에 대해 정확도를 구한 후 정확도가 가장 높은 부스팅 횟수를 찾는 방법이다.
pacman::p_load("progress", # For progress_bar
"caret") # For confusionMatrix
pb <- progress_bar$new(total = 100) # for문의 진행상황 확인NAresults <- c()
ac <- UB.ted$Personal.Loan # 실제 클래스래스
for(i in 1:100){ # 부스팅 횟수 1~100회
pb$tick()
set.seed(200)
tree <- C5.0(UB.trd[-9], UB.trd$Personal.Loan, trials=i) # 각 부스팅 횟수에 대한 모형
pp.cm <- predict(tree, UB.ted, type="class") # 예측 클래스래스
CM <- confusionMatrix(as.factor(pp.cm), as.factor(ac), positive="1") # confusionMatrix
results[i] <- as.numeric(CM$overall[1]) # confusionmatrix의 정확도 추출NA}
which.max(results);results[which.max(results)] # 정확도가 가장 높은 값과 위치 출력출력
[1] 10
[1] 0.9839786
정확도가 가장 높은 값으로 다시 C5.0
함수를 이용하여 최종 모형을 구한다.
# 적합된 모형에 대하여 Test Data 예측NAtest.10 <- predict(tree.10, newdata=UB.ted, type="class") # predict(트리모형, Test Data)
pacman::p_load("caret")
CM <- confusionMatrix(test.10, UB.ted$Personal.Loan, positive="1") # confusionMatrix(예측 클래스, 실제 클래스, positive="관심클래스")심클래스")
CM
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 669 8
1 4 68
Accuracy : 0.984
95% CI : (0.9722, 0.9917)
No Information Rate : 0.8985
P-Value [Acc > NIR] : <2e-16
Kappa : 0.91
Mcnemar's Test P-Value : 0.3865
Sensitivity : 0.89474
Specificity : 0.99406
Pos Pred Value : 0.94444
Neg Pred Value : 0.98818
Prevalence : 0.10147
Detection Rate : 0.09079
Detection Prevalence : 0.09613
Balanced Accuracy : 0.94440
'Positive' Class : 1
detach(package:caret)
pacman::p_load("pROC")
test.10.prob <- predict(tree.10, newdata=UB.ted, type="prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.10.prob <- test.10.prob[,2] # "1"에 대한 예측 확률
ac <- as.numeric(as.character(UB.ted$Personal.Loan)) # 범주형을 숫자형으로 변환할 때 문자형으로 변환한 뒤 숫자형으로 변환해야함NAcpp <- as.numeric(test.10.prob) # "1"에 대한 예측 확률
ctree.roc <- roc(ac, cpp, plot=T, col="red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(ctree.roc),3)
legend("bottomright", legend=auc, bty="n")
detach(package:pROC)
pacman::p_load("Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(cpp, ac, plot="ROC") # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)
pacman::p_load("ROCR")
ctree.pred <- prediction(test.10.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
ctree.perf <- performance(ctree.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(ctree.perf, col="blue") # ROC Curve
perf.auc <- performance(ctree.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")
ctree.perf <- performance(ctree.pred, "lift","rpp") # Lift Chart
plot(ctree.perf, main="lift curve", colorize=T, lwd=2)
pacman::p_load("lift")
plotLift(test.10.prob, UB.ted$Personal.Loan, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.10.prob, UB.ted$Personal.Loan) # Top 10% 향상도 출력
[1] 8.935
detach(package:lift)
pacman::p_load("ROCR")
rtree.pred <- prediction(test.rtree.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
rtree.perf <- performance(rtree.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(rtree.perf, col="blue") # ROC Curve
par(new=TRUE)
ctree.perf <- performance(ctree.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(ctree.perf, col="red") # ROC Curve
legend("bottomright", legend=c("rpart","C50"), col=c("blue", "red"), lty=c(1,1))
detach(package:ROCR)
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".