“rpart”와 “C5.0” 방법으로 의사결정나무모형을 실습하기 위해서 사용될 예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은 Person.Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table",
               "dplyr")


UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
    data.frame() %>%                                                   # Data frame 변환환
    select(-1)                                                         # ID변수 제거거


# select columns
cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")
UB   <- UB %>% 
    mutate_at(cols, as.factor)                                         # 범주형 변수 변환
    
glimpse(UB)                                                            # 데이터 구조

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

# Partition (Traning Data : Test Data = 7:3)

pacman::p_load("caret")

y      <- UB$Personal.Loan                            # Target
 
set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=T)       # Training Data를 70%로 추출
UB.trd <- UB[ind$Resample1,]                          # Traning Data

UB.ted <- UB[-ind$Resample1,]                         # Test Data

detach(package:caret)

3. R Package “rpart”

의사결정나무모형을 형성하기 위해 사용될 첫번째 Package는 "rpart"이다. "rpart"는 약간 수정된 CART를 사용하며, CP (Complexity Parameter)를 중심으로 분석한다. 게다가 "rpart"는 Cross Validation을 이용하여 최적의 CP값을 선택할 수 있도록하며, defalut값은 10-fold Cross Validation이다. 또한 가독성 좋은 그래프가 있기 때문에 트리를 시각화하기에 좋다.

rpart(formula, data, method, ...)

formula : Target과 예측 변수에 대한 공식으로써 일반적으로 Target ~ 예측변수로 적는다.
data : formula의 변수들이 있는 데이터 프레임
method : Target이 범주형이면 “class”, Target이 수치형이면 “anova”를 해준다.

3-1. 모형 적합

pacman::p_load("rpart",                   # for Decision tree
               "rattle", "rpart.plot")    # for fancyRpartPlot(가독성 좋은 그래프) ) 

set.seed(200)                             # seed 고정 for cross-validation   
rContol      <- rpart.control(xval=15)    # xval : Number of cross validation
UB.trd.rtree <- rpart(Personal.Loan~., data=UB.trd,                 
                      method="class", control = rContol)         

summary(UB.trd.rtree)

Call:
rpart(formula = Personal.Loan ~ ., data = UB.trd, method = "class", 
    control = rContol)
  n= 1751 

          CP nsplit rel error    xerror       xstd
1 0.32500000      0 1.0000000 1.0000000 0.07060066
2 0.15000000      2 0.3500000 0.4111111 0.04676994
3 0.01944444      3 0.2000000 0.1944444 0.03253697
4 0.01111111      5 0.1611111 0.2166667 0.03430588
5 0.01000000      7 0.1388889 0.2055556 0.03343418

Variable importance
    Income  Education     Family      CCAvg CD.Account   Mortgage 
        29         28         20         13          5          3 
  ZIP.Code        Age Experience 
         1          1          1 

Node number 1: 1751 observations,    complexity param=0.325
  predicted class=0  expected loss=0.1027984  P(node) =1
    class counts:  1571   180
   probabilities: 0.897 0.103 
  left son=2 (1366 obs) right son=3 (385 obs)
  Primary splits:
      Income     < 106.5   to the left,  improve=94.968130, (0 missing)
      CCAvg      < 2.95    to the left,  improve=69.046700, (0 missing)
      CD.Account splits as  LR,          improve=41.226300, (0 missing)
      Mortgage   < 293.5   to the left,  improve=14.467750, (0 missing)
      Education  splits as  LRR,         improve= 7.681315, (0 missing)
  Surrogate splits:
      CCAvg      < 3.25    to the left,  agree=0.868, adj=0.400, (0 split)
      Mortgage   < 339.5   to the left,  agree=0.802, adj=0.101, (0 split)
      CD.Account splits as  LR,          agree=0.786, adj=0.026, (0 split)

Node number 2: 1366 observations,    complexity param=0.01111111
  predicted class=0  expected loss=0.01537335  P(node) =0.7801256
    class counts:  1345    21
   probabilities: 0.985 0.015 
  left son=4 (1277 obs) right son=5 (89 obs)
  Primary splits:
      CCAvg      < 2.95    to the left,  improve=9.2644320, (0 missing)
      Income     < 98.5    to the left,  improve=3.5382100, (0 missing)
      CD.Account splits as  LR,          improve=1.1082890, (0 missing)
      Mortgage   < 220.5   to the left,  improve=0.6886484, (0 missing)
      Experience < 38.5    to the left,  improve=0.2000360, (0 missing)

Node number 3: 385 observations,    complexity param=0.325
  predicted class=0  expected loss=0.412987  P(node) =0.2198744
    class counts:   226   159
   probabilities: 0.587 0.413 
  left son=6 (242 obs) right son=7 (143 obs)
  Primary splits:
      Education  splits as  LRR,         improve=111.984200, (0 missing)
      Family     splits as  LLRR,        improve= 73.753990, (0 missing)
      CD.Account splits as  LR,          improve= 27.998710, (0 missing)
      Income     < 156.5   to the left,  improve= 11.193100, (0 missing)
      CCAvg      < 6.635   to the right, improve=  4.353303, (0 missing)
  Surrogate splits:
      Family     splits as  LLRR,        agree=0.743, adj=0.308, (0 split)
      CD.Account splits as  LR,          agree=0.683, adj=0.147, (0 split)
      Income     < 173.5   to the left,  agree=0.642, adj=0.035, (0 split)
      CCAvg      < 8.85    to the left,  agree=0.634, adj=0.014, (0 split)
      ZIP.Code   < 90021.5 to the right, agree=0.631, adj=0.007, (0 split)

Node number 4: 1277 observations
  predicted class=0  expected loss=0  P(node) =0.7292975
    class counts:  1277     0
   probabilities: 1.000 0.000 

Node number 5: 89 observations,    complexity param=0.01111111
  predicted class=0  expected loss=0.2359551  P(node) =0.0508281
    class counts:    68    21
   probabilities: 0.764 0.236 
  left son=10 (73 obs) right son=11 (16 obs)
  Primary splits:
      Income     < 98.5    to the left,  improve=5.904956, (0 missing)
      CD.Account splits as  LR,          improve=4.645443, (0 missing)
      CCAvg      < 3.15    to the right, improve=3.739411, (0 missing)
      Experience < 31.5    to the left,  improve=2.313744, (0 missing)
      Mortgage   < 86.5    to the left,  improve=2.146298, (0 missing)
  Surrogate splits:
      Age        < 62.5    to the left,  agree=0.843, adj=0.125, (0 split)
      Experience < 38.5    to the left,  agree=0.843, adj=0.125, (0 split)
      ZIP.Code   < 90061.5 to the right, agree=0.843, adj=0.125, (0 split)

Node number 6: 242 observations,    complexity param=0.15
  predicted class=0  expected loss=0.1198347  P(node) =0.1382067
    class counts:   213    29
   probabilities: 0.880 0.120 
  left son=12 (211 obs) right son=13 (31 obs)
  Primary splits:
      Family     splits as  LLRR,        improve=47.3076500, (0 missing)
      CD.Account splits as  LR,          improve= 8.0685060, (0 missing)
      Mortgage   < 279.5   to the left,  improve= 1.5428930, (0 missing)
      CCAvg      < 6.635   to the right, improve= 1.2562530, (0 missing)
      ZIP.Code   < 95057   to the left,  improve= 0.9027978, (0 missing)
  Surrogate splits:
      CD.Account splits as  LR,          agree=0.880, adj=0.065, (0 split)
      Mortgage   < 566     to the left,  agree=0.876, adj=0.032, (0 split)

Node number 7: 143 observations,    complexity param=0.01944444
  predicted class=1  expected loss=0.09090909  P(node) =0.08166762
    class counts:    13   130
   probabilities: 0.091 0.909 
  left son=14 (25 obs) right son=15 (118 obs)
  Primary splits:
      Income     < 116.5   to the left,  improve=11.1563600, (0 missing)
      CCAvg      < 2.4     to the left,  improve= 2.6908830, (0 missing)
      Experience < 2.5     to the left,  improve= 2.4008740, (0 missing)
      Age        < 29.5    to the left,  improve= 2.2161600, (0 missing)
      CD.Account splits as  LR,          improve= 0.9500891, (0 missing)

Node number 10: 73 observations
  predicted class=0  expected loss=0.1506849  P(node) =0.04169046
    class counts:    62    11
   probabilities: 0.849 0.151 

Node number 11: 16 observations
  predicted class=1  expected loss=0.375  P(node) =0.009137636
    class counts:     6    10
   probabilities: 0.375 0.625 

Node number 12: 211 observations
  predicted class=0  expected loss=0  P(node) =0.1205026
    class counts:   211     0
   probabilities: 1.000 0.000 

Node number 13: 31 observations
  predicted class=1  expected loss=0.06451613  P(node) =0.01770417
    class counts:     2    29
   probabilities: 0.065 0.935 

Node number 14: 25 observations,    complexity param=0.01944444
  predicted class=0  expected loss=0.48  P(node) =0.01427756
    class counts:    13    12
   probabilities: 0.520 0.480 
  left son=28 (13 obs) right son=29 (12 obs)
  Primary splits:
      CCAvg      < 2.4     to the left,  improve=3.3646150, (0 missing)
      ZIP.Code   < 93060   to the right, improve=2.4938890, (0 missing)
      Age        < 29.5    to the left,  improve=1.2447060, (0 missing)
      Experience < 7       to the left,  improve=1.0800000, (0 missing)
      Education  splits as  -RL,         improve=0.9605195, (0 missing)
  Surrogate splits:
      ZIP.Code   < 94014.5 to the right, agree=0.88, adj=0.750, (0 split)
      Age        < 38.5    to the left,  agree=0.72, adj=0.417, (0 split)
      Experience < 18.5    to the left,  agree=0.72, adj=0.417, (0 split)
      Education  splits as  -RL,         agree=0.72, adj=0.417, (0 split)
      Income     < 113.5   to the left,  agree=0.68, adj=0.333, (0 split)

Node number 15: 118 observations
  predicted class=1  expected loss=0  P(node) =0.06739006
    class counts:     0   118
   probabilities: 0.000 1.000 

Node number 28: 13 observations
  predicted class=0  expected loss=0.2307692  P(node) =0.007424329
    class counts:    10     3
   probabilities: 0.769 0.231 

Node number 29: 12 observations
  predicted class=1  expected loss=0.25  P(node) =0.006853227
    class counts:     3     9
   probabilities: 0.250 0.750

먼저 첫번째로 나오는 Table에 대한 용어 설명이다.

CP : Complexity Parameter로 Training Data에 대한 오분류율+나무 크기에 대한 벌점 요인으로 계산된다. 또한 CP는 나무의 복잡도를 나타냄으로써 나무의 크기를 통제하고 최적의 크기를 선택할 수 있게 해준다.
nsplit : 분리의 횟수
rel error : \(1-R^2\) root mean square error로 모형을 추정하는 데 사용된 데이터의 예측에 대한 오차
xerror : Cross-validation error
xstd : xerror의 표준오차

Variable importance는 변수중요도로써, Income \(>\) Education \(>\) Family 임을 알 수 있다. Node number 1 은 첫번째 노드에 대한 설명으로써 총 1751개의 관측값이 있으며, predicted class=0, cp=0.325, expected loss (불순도)= 0.1027984이다. 전체 관측값 1751개 중 클래스 “0”은 1571, 클래스 “1”은 180개 이며, 비율은 각각 0.897, 0.103이다. left son = 2 (1366 obs)는 왼쪽 자식 노드의 번호는 2 이고 1366개의 관측값이 있다는 뜻이다.

3-2. Tree 그림

3-2-1. “Plot”

plot(UB.trd.rtree, branch=1, margin=0.2)
text(UB.trd.rtree, all=T, use.n=T)

3-2-2. “fancyRpartPlot”

fancyRpartPlot(UB.trd.rtree) # 가독성 좋은 그래프래프

3-2-3. “visTree”

pacman::p_load("visNetwork","sparkline")  # 네트워크 기반 그래프
visTree(UB.trd.rtree)

3-3. 가지치기(과적합 문제 해결)

과적합 문제를 해결하기 위해 가지치기를 수행한다. rpart에서 최적의 cp 값을 찾는 것이 중요하며, 이것은 xerror가 최소가 되는 cp를 찾으면 된다.

table              <- UB.trd.rtree$cptable             # cp Table

low.error          <- which.min(table[ ,"xerror"])     # table의 ”xerror”열에서 가장 낮은 값 위치 추출치 추출
cp.best            <- table[low.error, "CP"]           # ”CP”에서 low.error에 해당하는 cp 선택 선택

UB.trd.prune.rtree <- prune(UB.trd.rtree, cp=cp.best)  # prune(트리모형, 최적의 “CP”)
UB.trd.prune.rtree$cptable                             # 최종 모형에 대한 cp tableNA

          CP nsplit rel error    xerror       xstd
1 0.32500000      0      1.00 1.0000000 0.07060066
2 0.15000000      2      0.35 0.4111111 0.04676994
3 0.01944444      3      0.20 0.1944444 0.03253697

가지치기를 함으로써 최종 모형이 완성되고 최종 모형에 대한 Tree 그림은 다음과 같다.

plot(UB.trd.prune.rtree, branch=0.8, margin=0.2)
text(UB.trd.prune.rtree, all=T, use.n=T)

fancyRpartPlot(UB.trd.prune.rtree)        # 가독성 좋은 그래프

visTree(UB.trd.prune.rtree)   # 네트워크 기반 그래프

3-4. 모형 평가

# 적합된 모형에 대하여 Test Data 예측NAtest.rtree <- predict(UB.trd.prune.rtree, newdata=UB.ted, type="class")  # predict(트리모형, Test Data)

3-4-1. ConfusionMatrix

pacman::p_load("caret")

CM   <- confusionMatrix(test.rtree, UB.ted$Personal.Loan, positive="1")  # confusionMatrix(예측 클래스, 실제 클래스, positive="관심클래스")심클래스")

CM

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 656  12
         1  17  64
                                          
               Accuracy : 0.9613          
                 95% CI : (0.9449, 0.9739)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : 1.228e-10       
                                          
                  Kappa : 0.7937          
                                          
 Mcnemar's Test P-Value : 0.4576          
                                          
            Sensitivity : 0.84211         
            Specificity : 0.97474         
         Pos Pred Value : 0.79012         
         Neg Pred Value : 0.98204         
             Prevalence : 0.10147         
         Detection Rate : 0.08545         
   Detection Prevalence : 0.10814         
      Balanced Accuracy : 0.90842         
                                          
       'Positive' Class : 1

detach(package:caret)

3-4-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.rtree.prob <- predict(UB.trd.prune.rtree, newdata=UB.ted) #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rtree.prob <- test.rtree.prob[,2]                         # "1"에 대한 예측 확률


ac         <- as.numeric(as.character(UB.ted$Personal.Loan))   # 범주형을 숫자형으로 변환할 때 문자형으로 변환한 뒤 숫자형으로 변환해야함NArpp        <- as.numeric(test.rtree.prob)                      # "1"에 대한 예측 확률

rtree.roc  <- roc(ac, rpp, plot=T, col="red")                  # roc(실제 클래스, 예측 확률)률)

auc        <- round(auc(rtree.roc),3)
legend("bottomright", legend=auc, bty="n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("Epi")       
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(rpp, ac, plot="ROC")   # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")

rtree.pred <- prediction(test.rtree.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)    

rtree.perf <- performance(rtree.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(rtree.perf, col="blue")                                       # ROC Curve

perf.auc   <- performance(rtree.pred, "auc")                       # AUC
auc        <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")

3-4-3. 향상 차트

1) Package “ROCR”

rtree.perf <- performance(rtree.pred, "lift","rpp")                # Lift Chart                      
plot(rtree.perf, main="lift curve", colorize=T, lwd=2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

plotLift(test.rtree.prob, UB.ted$Personal.Loan, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(test.rtree.prob, UB.ted$Personal.Loan)                            # Top 10% 향상도 출력

[1] 7.753

detach(package:lift)

4. R Package “C50”

의사결정나무모형을 형성하기 위해 사용될 두번째 Package는 "C50"이다. "C50"은 앙상블 기법 중 부스팅을 이용하는데, 부스팅이란 붓스트랩 샘플을 독립적으로 복원 추출한 다수의 붓스트랩 트리를 하나씩 순차적으로 업그레이드하여 다음 트리에 가중치를 주는 방식으로 학습하는 방법이다.

C5.0(x, y, trials = 1, ...)

x : 예측 변수
y : Target
trials : 부스팅 횟수

4-1. 모형 적합

pacman::p_load(C50)   

set.seed(200)             # C5.0함수는 seed값이 필요함NAtree <- C5.0(UB.trd[-9], UB.trd$Personal.Loan, trials=1)  
tree


Call:
C5.0.default(x = UB.trd[-9], y = UB.trd$Personal.Loan, trials = 1)

Classification Tree
Number of samples: 1751 
Number of predictors: 12 

Tree size: 8 

Non-standard options: attempt to group attributes

Number of samples : UB.trd의 개수
Number of predictors : UB.trd의 독립변수 개수
Tree size : 트리 크기

4-2. Tree 그림

plot(tree)

4-3. 최적의 부스팅 횟수

"C5.0"에서 가장 중요한 것은 최적의 부스팅 횟수를 찾는 것이다. 아래의 코드는 부스팅 횟수를 1~100회로 늘리면서 각 부스팅 횟수에 대해 정확도를 구한 후 정확도가 가장 높은 부스팅 횟수를 찾는 방법이다.

pacman::p_load("progress",                                              # For progress_bar
               "caret")                                                 # For confusionMatrix

pb      <- progress_bar$new(total = 100)                                # for문의 진행상황 확인NAresults <- c()

ac      <- UB.ted$Personal.Loan                                         # 실제 클래스래스
 
for(i in 1:100){                                                        # 부스팅 횟수 1~100회
  
  pb$tick()
  
  set.seed(200)
  tree  <- C5.0(UB.trd[-9], UB.trd$Personal.Loan, trials=i)             # 각 부스팅 횟수에 대한 모형
  pp.cm <- predict(tree, UB.ted, type="class")                          # 예측 클래스래스
  
  
  CM <- confusionMatrix(as.factor(pp.cm), as.factor(ac), positive="1")  # confusionMatrix
  
  results[i] <- as.numeric(CM$overall[1])                               # confusionmatrix의 정확도 추출NA}

which.max(results);results[which.max(results)]                          # 정확도가 가장 높은 값과 위치 출력출력

[1] 10

[1] 0.9839786

정확도가 가장 높은 값으로 다시 C5.0 함수를 이용하여 최종 모형을 구한다.

set.seed(200)
tree.10   <- C5.0(UB.trd[-9], UB.trd$Personal.Loan, trials=10)   #  최종 모형NAplot(tree.10)

4-4. 모형 평가

# 적합된 모형에 대하여 Test Data 예측NAtest.10 <- predict(tree.10, newdata=UB.ted, type="class")         # predict(트리모형, Test Data)

4-4-1. ConfusionMatrix

pacman::p_load("caret")

CM      <- confusionMatrix(test.10, UB.ted$Personal.Loan, positive="1")  # confusionMatrix(예측 클래스, 실제 클래스, positive="관심클래스")심클래스")

CM

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 669   8
         1   4  68
                                          
               Accuracy : 0.984           
                 95% CI : (0.9722, 0.9917)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.91            
                                          
 Mcnemar's Test P-Value : 0.3865          
                                          
            Sensitivity : 0.89474         
            Specificity : 0.99406         
         Pos Pred Value : 0.94444         
         Neg Pred Value : 0.98818         
             Prevalence : 0.10147         
         Detection Rate : 0.09079         
   Detection Prevalence : 0.09613         
      Balanced Accuracy : 0.94440         
                                          
       'Positive' Class : 1

detach(package:caret)

4-4-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.10.prob <- predict(tree.10, newdata=UB.ted, type="prob")  #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.10.prob <- test.10.prob[,2]                               # "1"에 대한 예측 확률


ac         <- as.numeric(as.character(UB.ted$Personal.Loan))   # 범주형을 숫자형으로 변환할 때 문자형으로 변환한 뒤 숫자형으로 변환해야함NAcpp        <- as.numeric(test.10.prob)                         # "1"에 대한 예측 확률

ctree.roc  <- roc(ac, cpp, plot=T, col="red")                  # roc(실제 클래스, 예측 확률)률)

auc        <- round(auc(ctree.roc),3)
legend("bottomright", legend=auc, bty="n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("Epi")       
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(cpp, ac, plot="ROC")   # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")

ctree.pred <- prediction(test.10.prob, UB.ted$Personal.Loan)       # prediction(예측 확률, 실제 클레스)    

ctree.perf <- performance(ctree.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(ctree.perf, col="blue")                                       # ROC Curve

perf.auc   <- performance(ctree.pred, "auc")                       # AUC
auc        <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")

4-4-3. 향상 차트

1) Package “ROCR”

ctree.perf <- performance(ctree.pred, "lift","rpp")                # Lift Chart                      
plot(ctree.perf, main="lift curve", colorize=T, lwd=2)

2) Package “lift”

pacman::p_load("lift")

plotLift(test.10.prob, UB.ted$Personal.Loan, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(test.10.prob, UB.ted$Personal.Loan)      # Top 10% 향상도 출력

[1] 8.935

detach(package:lift)

5. 모형 비교

pacman::p_load("ROCR")

rtree.pred <- prediction(test.rtree.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)    

rtree.perf <- performance(rtree.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(rtree.perf, col="blue")                                       # ROC Curve
par(new=TRUE)
ctree.perf <- performance(ctree.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(ctree.perf, col="red")                                        # ROC Curve
legend("bottomright", legend=c("rpart","C50"), col=c("blue", "red"), lty=c(1,1))

detach(package:ROCR)

Decision Tree

1. 데이터 불러오기

2. 데이터 분할

3. R Package “rpart”

3-1. 모형 적합

3-2. Tree 그림

3-2-1. “Plot”

3-2-2. “fancyRpartPlot”

3-2-3. “visTree”

3-3. 가지치기(과적합 문제 해결)

3-4. 모형 평가

3-4-1. ConfusionMatrix

3-4-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

3-4-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

4. R Package “C50”

4-1. 모형 적합

4-2. Tree 그림

4-3. 최적의 부스팅 횟수

4-4. 모형 평가

4-4-1. ConfusionMatrix

4-4-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

4-4-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

5. 모형 비교

Reuse