Naive Bayes Classification using Package mlr

Data Mining

Description for Naive Bayes Classification using Package mlr

Yeongeun Jeon , Jung In Seo
2023-07-18

Naive Bayes의 장점


Naive Bayes의 단점



1. 패키지 불러오기

# Install mlr Package
# install.packages("mlr", dependencies = TRUE) # could take several minutes

# Package Loading
pacman::p_load("mlr", 
               "tidyverse")

2. 데이터 불러오기

data(HouseVotes84, package = "mlbench")   # Data Loading

votesTib <- HouseVotes84 %>%
  as_tibble                               # Tibble 형태로 변환

votesTib
# A tibble: 435 × 17
   Class   V1    V2    V3    V4    V5    V6    V7    V8    V9    V10  
   <fct>   <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
 1 republ… n     y     n     y     y     y     n     n     n     y    
 2 republ… n     y     n     y     y     y     n     n     n     n    
 3 democr… <NA>  y     y     <NA>  y     y     n     n     n     n    
 4 democr… n     y     y     n     <NA>  y     n     n     n     n    
 5 democr… y     y     y     n     y     y     n     n     n     n    
 6 democr… n     y     y     n     y     y     n     n     n     n    
 7 democr… n     y     n     y     y     y     n     n     n     n    
 8 republ… n     y     n     y     y     y     n     n     n     n    
 9 republ… n     y     n     y     y     y     n     n     n     n    
10 democr… y     y     y     n     n     n     y     y     y     n    
# ℹ 425 more rows
# ℹ 6 more variables: V11 <fct>, V12 <fct>, V13 <fct>, V14 <fct>,
#   V15 <fct>, V16 <fct>
glimpse(votesTib)                         # 데이터 구조 확인
Rows: 435
Columns: 17
$ Class <fct> republican, republican, democrat, democrat, democrat, …
$ V1    <fct> n, n, NA, n, y, n, n, n, n, y, n, n, n, y, n, n, y, y,…
$ V2    <fct> y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, n, NA,…
$ V3    <fct> n, n, y, y, y, y, n, n, n, y, n, n, y, y, n, n, y, y, …
$ V4    <fct> y, y, NA, n, n, n, y, y, y, n, y, y, n, n, y, y, n, n,…
$ V5    <fct> y, y, y, NA, y, y, y, y, y, n, y, y, n, n, y, y, n, n,…
$ V6    <fct> y, y, y, y, y, y, y, y, y, n, n, y, n, y, y, y, y, n, …
$ V7    <fct> n, n, n, n, n, n, n, n, n, y, n, n, y, y, n, n, n, y, …
$ V8    <fct> n, n, n, n, n, n, n, n, n, y, n, n, y, y, n, n, y, y, …
$ V9    <fct> n, n, n, n, n, n, n, n, n, y, n, n, y, NA, n, n, NA, y…
$ V10   <fct> y, n, n, n, n, n, n, n, n, n, n, n, n, y, n, y, y, n, …
$ V11   <fct> NA, n, y, y, y, n, n, n, n, n, NA, y, n, y, n, n, y, n…
$ V12   <fct> y, y, n, n, NA, n, n, n, y, n, NA, NA, n, NA, y, y, y,…
$ V13   <fct> y, y, y, y, y, y, NA, y, y, n, y, y, y, n, NA, y, NA, …
$ V14   <fct> y, y, y, n, y, y, y, y, y, n, y, y, n, n, NA, NA, n, n…
$ V15   <fct> n, n, n, n, y, y, y, NA, n, NA, n, NA, NA, y, n, n, n,…
$ V16   <fct> y, NA, n, y, y, y, y, y, y, NA, n, NA, NA, NA, NA, NA,…
summary(votesTib)                         # 데이터 요약
        Class        V1         V2         V3         V4     
 democrat  :267   n   :236   n   :192   n   :171   n   :247  
 republican:168   y   :187   y   :195   y   :253   y   :177  
                  NA's: 12   NA's: 48   NA's: 11   NA's: 11  
    V5         V6         V7         V8         V9        V10     
 n   :208   n   :152   n   :182   n   :178   n   :206   n   :212  
 y   :212   y   :272   y   :239   y   :242   y   :207   y   :216  
 NA's: 15   NA's: 11   NA's: 14   NA's: 15   NA's: 22   NA's:  7  
   V11        V12        V13        V14        V15        V16     
 n   :264   n   :233   n   :201   n   :170   n   :233   n   : 62  
 y   :150   y   :171   y   :209   y   :248   y   :174   y   :269  
 NA's: 21   NA's: 31   NA's: 25   NA's: 17   NA's: 28   NA's:104  

3. 변수들 특성 확인

map_dbl(votesTib, ~ sum(is.na(.)))
Class    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10 
    0    12    48    11    11    15    11    14    15    22     7 
  V11   V12   V13   V14   V15   V16 
   21    31    25    17    28   104 
votesUntidy <- gather(votesTib, "Variable", "Value", -Class)      # pivot_longer

ggplot(votesUntidy, aes(Class, fill = Value)) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_bar(position = "fill") +
  theme_bw()


4. 데이터 분할

# Partition (Training Dataset : Test Dataset = 7:3)
pacman::p_load("caret")                                         # For createDataPartition

y <- votesTib$Class                                             # Target

set.seed(200)
ind       <- createDataPartition(y, p = 0.7, list = T)          # Index를 이용하여 7:3으로 분할
votes.trd <- votesTib[ind$Resample1,]                           # Training Dataset
votes.ted <- votesTib[-ind$Resample1,]                          # Test Dataset

detach(package:caret)

5. 모델링



5-1. Define Task

문제 유형 함수
회귀 문제 makeRegrTask
이진 또는 다중 클래스 분류 문제 makeClassifTask
생존 분석 makeSurvTask
군집 분석 makeClusterTask
다중 라벨 분류 문제 makeMultilabelTask
비용 민감 분류 문제 makeCostSensTask
# Naive Bayes : 이진 클래스 분류 문제
votesTask <- makeClassifTask(data = votes.trd,        # Training Dataset
                             target = "Class")        # Target

votesTask
Supervised task: votes.trd
Type: classif
Target: Class
Observations: 305
Features:
   numerics     factors     ordered functionals 
          0          16           0           0 
Missings: TRUE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 2
  democrat republican 
       187        118 
Positive class: democrat

Caution! 클래스 분류 문제에서 Target은 범주형이어야 한다. 특히, 이진 클래스 분류 문제에서, 함수 makeClassifTask는 Target의 첫 번째 “level”을 관심 클래스로 인식한다. 만약 두 번째 “level”이 관심 클래스라면, 인자 positive에 두 번째 “level”을 지정해야 한다.


5-2. Define Learner

# Check Hyperparameter Set
getParamSet("classif.naiveBayes")              
           Type len Def   Constr Req Tunable Trafo
laplace numeric   -   0 0 to Inf   -    TRUE     -

Result! 특정 머신러닝 알고리듬이 가지고 있는 초모수는 함수 getParamSet을 이용하여 확인할 수 있다.

# Define Naive Bayes Learner 
bayes <- makeLearner(cl = "classif.naiveBayes",
                     predict.type = "prob")                      # For 예측 확률과 class 생성

bayes
Learner classif.naiveBayes from package e1071
Type: classif
Name: Naive Bayes; Short name: nbayes
Class: classif.naiveBayes
Properties: twoclass,multiclass,missings,numerics,factors,prob
Predict-Type: prob
Hyperparameters: 

Caution! 함수 makeLearner의 옵션 predict.type = "prob"를 지정하면 예측 확률과 class를 함께 생성할 수 있다. 하지만 "classif.knn"과 같이 예측 확률을 생성할 수 없는 알고리듬이 있기 때문에 옵션을 항상 이용할 수는 없다.


5-3. Train Model

bayesModel <- train(bayes, votesTask)                   # train(Defined Learner in 5-2, Defined Task in 5-1)

bayesModel
Model for learner.id=classif.naiveBayes; learner.class=classif.naiveBayes
Trained on: task.id = votes.trd; obs = 305; features = 16
Hyperparameters: 
getLearnerModel(bayesModel)                             # Extract Trained Model

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
  democrat republican 
 0.6131148  0.3868852 

Conditional probabilities:
            V1
Y                    n         y
  democrat   0.4055556 0.5944444
  republican 0.8189655 0.1810345

            V2
Y                    n         y
  democrat   0.4969697 0.5030303
  republican 0.5188679 0.4811321

            V3
Y                    n         y
  democrat   0.1318681 0.8681319
  republican 0.8448276 0.1551724

            V4
Y                      n           y
  democrat   0.945054945 0.054945055
  republican 0.008547009 0.991452991

            V5
Y                     n          y
  democrat   0.80225989 0.19774011
  republican 0.06034483 0.93965517

            V6
Y                    n         y
  democrat   0.5500000 0.4500000
  republican 0.1196581 0.8803419

            V7
Y                    n         y
  democrat   0.2166667 0.7833333
  republican 0.7631579 0.2368421

            V8
Y                    n         y
  democrat   0.1881720 0.8118280
  republican 0.8392857 0.1607143

            V9
Y                    n         y
  democrat   0.2285714 0.7714286
  republican 0.8706897 0.1293103

            V10
Y                    n         y
  democrat   0.5163043 0.4836957
  republican 0.4310345 0.5689655

            V11
Y                    n         y
  democrat   0.4943820 0.5056180
  republican 0.8918919 0.1081081

            V12
Y                    n         y
  democrat   0.8505747 0.1494253
  republican 0.1388889 0.8611111

            V13
Y                    n         y
  democrat   0.7329545 0.2670455
  republican 0.1339286 0.8660714

            V14
Y                     n          y
  democrat   0.65921788 0.34078212
  republican 0.01769912 0.98230088

            V15
Y                    n         y
  democrat   0.3743017 0.6256983
  republican 0.9266055 0.0733945

            V16
Y                     n          y
  democrat   0.07407407 0.92592593
  republican 0.33009709 0.66990291

5-4. Predict

# 예측 확률과 class 생성
bayesPred <- predict(bayesModel, newdata = votes.ted)    # predict(Trained Model, Test Dataset)

bayesPred      
Prediction: 130 observations
predict.type: prob
threshold: democrat=0.50,republican=0.50
time: 0.03
       truth prob.democrat prob.republican   response
1   democrat  8.238201e-01    1.761799e-01   democrat
2   democrat  1.000000e+00    5.886008e-11   democrat
3 republican  7.645217e-06    9.999924e-01 republican
4   democrat  1.000000e+00    5.920266e-13   democrat
5   democrat  1.000000e+00    3.198378e-11   democrat
6   democrat  1.000000e+00    5.920266e-13   democrat
... (#rows: 130, #cols: 4)
getPredictionProbabilities(bayesPred,                    # Extract 예측 확률 for Test Dataset
                           cl = c("democrat", "republican"))   
        democrat   republican
1   8.238201e-01 1.761799e-01
2   1.000000e+00 5.886008e-11
3   7.645217e-06 9.999924e-01
4   1.000000e+00 5.920266e-13
5   1.000000e+00 3.198378e-11
6   1.000000e+00 5.920266e-13
7   1.000000e+00 8.182869e-13
8   1.011552e-08 1.000000e+00
9   1.000000e+00 9.488841e-10
10  9.999997e-01 2.572045e-07
11  1.000000e+00 5.920266e-13
12  3.915526e-07 9.999996e-01
13  1.000000e+00 5.886008e-11
14  6.707387e-08 9.999999e-01
15  8.534968e-08 9.999999e-01
16  6.230573e-08 9.999999e-01
17  4.507789e-08 1.000000e+00
18  1.000000e+00 4.471440e-11
19  1.000000e+00 1.457115e-10
20  8.628902e-06 9.999914e-01
21  1.000000e+00 9.269518e-11
22  1.079530e-04 9.998920e-01
23  2.647942e-02 9.735206e-01
24  9.555196e-03 9.904448e-01
25  9.675164e-09 1.000000e+00
26  9.266834e-09 1.000000e+00
27  9.998391e-01 1.608586e-04
28  8.599169e-03 9.914008e-01
29  1.671787e-08 1.000000e+00
30  9.998701e-01 1.299446e-04
31  2.139943e-01 7.860057e-01
32  1.000000e+00 7.218537e-12
33  2.180828e-03 9.978192e-01
34  5.959340e-08 9.999999e-01
35  1.254382e-05 9.999875e-01
36  5.959340e-08 9.999999e-01
37  3.403310e-04 9.996597e-01
38  9.999996e-01 4.192667e-07
39  8.778807e-08 9.999999e-01
40  1.000000e+00 9.041840e-10
41  1.000000e+00 1.028385e-08
42  1.425265e-08 1.000000e+00
43  3.556542e-06 9.999964e-01
44  5.959340e-08 9.999999e-01
45  1.000000e+00 2.530841e-11
46  1.000000e+00 7.682795e-12
47  1.000000e+00 4.705050e-11
48  1.000000e+00 3.615572e-11
49  1.000000e+00 2.742456e-10
50  5.818545e-08 9.999999e-01
51  9.999998e-01 2.017958e-07
52  1.000000e+00 3.881701e-09
53  1.013079e-06 9.999990e-01
54  4.603322e-06 9.999954e-01
55  1.000000e+00 7.298744e-09
56  2.876249e-04 9.997124e-01
57  1.000000e+00 8.932280e-13
58  1.000000e+00 4.136830e-09
59  1.000000e+00 4.648504e-11
60  8.778807e-08 9.999999e-01
61  6.230573e-08 9.999999e-01
62  9.999853e-01 1.472976e-05
63  1.000000e+00 5.803491e-11
64  8.013953e-01 1.986047e-01
65  1.425265e-08 1.000000e+00
66  1.000000e+00 1.150764e-10
67  1.000000e+00 1.621414e-10
68  1.000000e+00 7.536611e-12
69  1.013079e-06 9.999990e-01
70  1.000000e+00 7.536611e-12
71  1.000000e+00 1.150764e-10
72  1.000000e+00 4.995224e-12
73  1.000000e+00 3.484213e-09
74  3.438848e-01 6.561152e-01
75  2.755583e-07 9.999997e-01
76  1.000000e+00 4.628407e-10
77  1.000000e+00 5.429128e-11
78  9.999944e-01 5.610058e-06
79  9.999999e-01 8.906709e-08
80  9.930752e-01 6.924833e-03
81  9.999372e-01 6.282123e-05
82  1.638864e-05 9.999836e-01
83  1.305687e-08 1.000000e+00
84  9.266834e-09 1.000000e+00
85  9.999997e-01 2.658417e-07
86  4.086203e-03 9.959138e-01
87  4.516398e-01 5.483602e-01
88  1.000000e+00 2.753713e-10
89  1.000000e+00 4.997368e-11
90  9.998034e-01 1.965639e-04
91  9.947667e-01 5.233313e-03
92  9.847895e-01 1.521048e-02
93  1.202567e-07 9.999999e-01
94  6.667664e-04 9.993332e-01
95  1.000000e+00 7.218537e-12
96  1.000000e+00 8.932280e-13
97  9.999992e-01 8.014432e-07
98  1.000000e+00 5.452702e-12
99  1.000000e+00 8.248300e-10
100 7.304956e-07 9.999993e-01
101 8.547786e-06 9.999915e-01
102 9.999988e-01 1.171998e-06
103 7.843716e-02 9.215628e-01
104 2.490118e-04 9.997510e-01
105 5.707832e-08 9.999999e-01
106 1.842205e-02 9.815780e-01
107 1.305687e-08 1.000000e+00
108 9.945640e-01 5.436017e-03
109 1.000000e+00 6.035988e-10
110 1.000000e+00 1.488419e-08
111 1.000000e+00 9.536544e-09
112 9.999999e-01 6.490770e-08
113 1.251855e-01 8.748145e-01
114 8.042272e-08 9.999999e-01
115 1.253937e-04 9.998746e-01
116 9.999999e-01 1.165042e-07
117 3.248577e-02 9.675142e-01
118 9.999877e-01 1.226982e-05
119 1.000000e+00 1.069932e-12
120 4.911480e-06 9.999951e-01
121 9.800988e-01 1.990121e-02
122 9.999984e-01 1.580283e-06
123 3.280626e-08 1.000000e+00
124 9.999990e-01 1.016716e-06
125 1.000000e+00 1.727983e-10
126 1.458032e-05 9.999854e-01
127 1.206210e-01 8.793790e-01
128 9.999991e-01 8.994777e-07
129 1.000000e+00 1.493769e-08
130 1.000000e+00 2.399307e-09
getPredictionResponse(bayesPred)                         # Extract 예측 class for Test Dataset
  [1] democrat   democrat   republican democrat   democrat  
  [6] democrat   democrat   republican democrat   democrat  
 [11] democrat   republican democrat   republican republican
 [16] republican republican democrat   democrat   republican
 [21] democrat   republican republican republican republican
 [26] republican democrat   republican republican democrat  
 [31] republican democrat   republican republican republican
 [36] republican republican democrat   republican democrat  
 [41] democrat   republican republican republican democrat  
 [46] democrat   democrat   democrat   democrat   republican
 [51] democrat   democrat   republican republican democrat  
 [56] republican democrat   democrat   democrat   republican
 [61] republican democrat   democrat   democrat   republican
 [66] democrat   democrat   democrat   republican democrat  
 [71] democrat   democrat   democrat   republican republican
 [76] democrat   democrat   democrat   democrat   democrat  
 [81] democrat   republican republican republican democrat  
 [86] republican republican democrat   democrat   democrat  
 [91] democrat   democrat   republican republican democrat  
 [96] democrat   democrat   democrat   democrat   republican
[101] republican democrat   republican republican republican
[106] republican republican democrat   democrat   democrat  
[111] democrat   democrat   republican republican republican
[116] democrat   republican democrat   democrat   republican
[121] democrat   democrat   republican democrat   democrat  
[126] republican republican democrat   democrat   democrat  
Levels: democrat republican

5-5. Evaluate

# ConfusionMatrix
calculateConfusionMatrix(bayesPred,                     # 함수 predict의 결과
                         relative = TRUE)               # 비율도 함께 출력 
Relative confusion matrix (normalized by row/column):
            predicted
true         democrat  republican -err.-   
  democrat   0.89/0.97 0.11/0.16  0.11     
  republican 0.04/0.03 0.96/0.84  0.04     
  -err.-          0.03      0.16  0.08     


Absolute confusion matrix:
            predicted
true         democrat republican -err.-
  democrat         71          9      9
  republican        2         48      2
  -err.-            2          9     11
# Performance Measure
performance(bayesPred,                                  # 함수 predict의 결과
            measures = list(mmce, acc))                 # mmce : 평균 오분류율, acc : 정확도
      mmce        acc 
0.08461538 0.91538462 

Caution! 함수 performance의 인자 measures에 사용할 수 있는 평가척도는 여기를 참고한다.

# ROC 곡선
df <- generateThreshVsPerfData(bayesPred, measures = list(fpr, tpr))

plotROCCurves(df) +
  theme_bw()
# AUC
performance(bayesPred,                                   # 함수 predict의 결과
            auc) 
    auc 
0.97975 

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".