Description for Naive Bayes Classification using Package mlr
Naive Bayes의 장점
Naive Bayes의 단점
# Install mlr Package
# install.packages("mlr", dependencies = TRUE) # could take several minutes
# Package Loading
pacman::p_load("mlr",
"tidyverse")
"mlbench"
에서 제공하는 HouseVotes84
데이터셋이다.Class
: 투표한 의원의 정당 이름으로 “democrat”와 “republican”로 구분되어 있다.data(HouseVotes84, package = "mlbench") # Data Loading
votesTib <- HouseVotes84 %>%
as_tibble # Tibble 형태로 변환
votesTib
# A tibble: 435 × 17
Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 republ… n y n y y y n n n y
2 republ… n y n y y y n n n n
3 democr… <NA> y y <NA> y y n n n n
4 democr… n y y n <NA> y n n n n
5 democr… y y y n y y n n n n
6 democr… n y y n y y n n n n
7 democr… n y n y y y n n n n
8 republ… n y n y y y n n n n
9 republ… n y n y y y n n n n
10 democr… y y y n n n y y y n
# ℹ 425 more rows
# ℹ 6 more variables: V11 <fct>, V12 <fct>, V13 <fct>, V14 <fct>,
# V15 <fct>, V16 <fct>
glimpse(votesTib) # 데이터 구조 확인
Rows: 435
Columns: 17
$ Class <fct> republican, republican, democrat, democrat, democrat, …
$ V1 <fct> n, n, NA, n, y, n, n, n, n, y, n, n, n, y, n, n, y, y,…
$ V2 <fct> y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, n, NA,…
$ V3 <fct> n, n, y, y, y, y, n, n, n, y, n, n, y, y, n, n, y, y, …
$ V4 <fct> y, y, NA, n, n, n, y, y, y, n, y, y, n, n, y, y, n, n,…
$ V5 <fct> y, y, y, NA, y, y, y, y, y, n, y, y, n, n, y, y, n, n,…
$ V6 <fct> y, y, y, y, y, y, y, y, y, n, n, y, n, y, y, y, y, n, …
$ V7 <fct> n, n, n, n, n, n, n, n, n, y, n, n, y, y, n, n, n, y, …
$ V8 <fct> n, n, n, n, n, n, n, n, n, y, n, n, y, y, n, n, y, y, …
$ V9 <fct> n, n, n, n, n, n, n, n, n, y, n, n, y, NA, n, n, NA, y…
$ V10 <fct> y, n, n, n, n, n, n, n, n, n, n, n, n, y, n, y, y, n, …
$ V11 <fct> NA, n, y, y, y, n, n, n, n, n, NA, y, n, y, n, n, y, n…
$ V12 <fct> y, y, n, n, NA, n, n, n, y, n, NA, NA, n, NA, y, y, y,…
$ V13 <fct> y, y, y, y, y, y, NA, y, y, n, y, y, y, n, NA, y, NA, …
$ V14 <fct> y, y, y, n, y, y, y, y, y, n, y, y, n, n, NA, NA, n, n…
$ V15 <fct> n, n, n, n, y, y, y, NA, n, NA, n, NA, NA, y, n, n, n,…
$ V16 <fct> y, NA, n, y, y, y, y, y, y, NA, n, NA, NA, NA, NA, NA,…
summary(votesTib) # 데이터 요약
Class V1 V2 V3 V4
democrat :267 n :236 n :192 n :171 n :247
republican:168 y :187 y :195 y :253 y :177
NA's: 12 NA's: 48 NA's: 11 NA's: 11
V5 V6 V7 V8 V9 V10
n :208 n :152 n :182 n :178 n :206 n :212
y :212 y :272 y :239 y :242 y :207 y :216
NA's: 15 NA's: 11 NA's: 14 NA's: 15 NA's: 22 NA's: 7
V11 V12 V13 V14 V15 V16
n :264 n :233 n :201 n :170 n :233 n : 62
y :150 y :171 y :209 y :248 y :174 y :269
NA's: 21 NA's: 31 NA's: 25 NA's: 17 NA's: 28 NA's:104
Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
0 12 48 11 11 15 11 14 15 22 7
V11 V12 V13 V14 V15 V16
21 31 25 17 28 104
votesUntidy <- gather(votesTib, "Variable", "Value", -Class) # pivot_longer
ggplot(votesUntidy, aes(Class, fill = Value)) +
facet_wrap(~ Variable, scales = "free_y") +
geom_bar(position = "fill") +
theme_bw()
# Partition (Training Dataset : Test Dataset = 7:3)
pacman::p_load("caret") # For createDataPartition
y <- votesTib$Class # Target
set.seed(200)
ind <- createDataPartition(y, p = 0.7, list = T) # Index를 이용하여 7:3으로 분할
votes.trd <- votesTib[ind$Resample1,] # Training Dataset
votes.ted <- votesTib[-ind$Resample1,] # Test Dataset
detach(package:caret)
문제 유형 | 함수 |
회귀 문제 | makeRegrTask |
이진 또는 다중 클래스 분류 문제 | makeClassifTask |
생존 분석 | makeSurvTask |
군집 분석 | makeClusterTask |
다중 라벨 분류 문제 | makeMultilabelTask |
비용 민감 분류 문제 | makeCostSensTask |
# Naive Bayes : 이진 클래스 분류 문제
votesTask <- makeClassifTask(data = votes.trd, # Training Dataset
target = "Class") # Target
votesTask
Supervised task: votes.trd
Type: classif
Target: Class
Observations: 305
Features:
numerics factors ordered functionals
0 16 0 0
Missings: TRUE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 2
democrat republican
187 118
Positive class: democrat
Caution!
클래스 분류 문제에서 Target은 범주형이어야 한다. 특히, 이진 클래스 분류 문제에서, 함수 makeClassifTask
는 Target의 첫 번째 “level”을 관심 클래스로 인식한다. 만약 두 번째 “level”이 관심 클래스라면, 인자 positive
에 두 번째 “level”을 지정해야 한다.
makeLearner
를 이용하여 정의한다.makeLearner
의 첫 번째 인자 cl
에는 사용하고자 하는 머신러닝 알고리듬을 문제 유형.알고리듬과 관련된 R 함수 이름
형태로 입력한다.
"regr.알고리듬과 관련된 R 함수 이름"
"classif.알고리듬과 관련된 R 함수 이름"
"surv.알고리듬과 관련된 R 함수 이름"
"cluster.알고리듬과 관련된 R 함수 이름"
"multilabel.알고리듬과 관련된 R 함수 이름"
"mlr"
에서 사용할 수 있는 알고리듬은 여기를 통해서 확인할 수 있다.# Check Hyperparameter Set
getParamSet("classif.naiveBayes")
Type len Def Constr Req Tunable Trafo
laplace numeric - 0 0 to Inf - TRUE -
Result!
특정 머신러닝 알고리듬이 가지고 있는 초모수는 함수 getParamSet
을 이용하여 확인할 수 있다.
# Define Naive Bayes Learner
bayes <- makeLearner(cl = "classif.naiveBayes",
predict.type = "prob") # For 예측 확률과 class 생성
bayes
Learner classif.naiveBayes from package e1071
Type: classif
Name: Naive Bayes; Short name: nbayes
Class: classif.naiveBayes
Properties: twoclass,multiclass,missings,numerics,factors,prob
Predict-Type: prob
Hyperparameters:
Caution!
함수 makeLearner
의 옵션 predict.type = "prob"
를 지정하면 예측 확률과 class를 함께 생성할 수 있다. 하지만 "classif.knn"
과 같이 예측 확률을 생성할 수 없는 알고리듬이 있기 때문에 옵션을 항상 이용할 수는 없다.
bayesModel <- train(bayes, votesTask) # train(Defined Learner in 5-2, Defined Task in 5-1)
bayesModel
Model for learner.id=classif.naiveBayes; learner.class=classif.naiveBayes
Trained on: task.id = votes.trd; obs = 305; features = 16
Hyperparameters:
getLearnerModel(bayesModel) # Extract Trained Model
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
democrat republican
0.6131148 0.3868852
Conditional probabilities:
V1
Y n y
democrat 0.4055556 0.5944444
republican 0.8189655 0.1810345
V2
Y n y
democrat 0.4969697 0.5030303
republican 0.5188679 0.4811321
V3
Y n y
democrat 0.1318681 0.8681319
republican 0.8448276 0.1551724
V4
Y n y
democrat 0.945054945 0.054945055
republican 0.008547009 0.991452991
V5
Y n y
democrat 0.80225989 0.19774011
republican 0.06034483 0.93965517
V6
Y n y
democrat 0.5500000 0.4500000
republican 0.1196581 0.8803419
V7
Y n y
democrat 0.2166667 0.7833333
republican 0.7631579 0.2368421
V8
Y n y
democrat 0.1881720 0.8118280
republican 0.8392857 0.1607143
V9
Y n y
democrat 0.2285714 0.7714286
republican 0.8706897 0.1293103
V10
Y n y
democrat 0.5163043 0.4836957
republican 0.4310345 0.5689655
V11
Y n y
democrat 0.4943820 0.5056180
republican 0.8918919 0.1081081
V12
Y n y
democrat 0.8505747 0.1494253
republican 0.1388889 0.8611111
V13
Y n y
democrat 0.7329545 0.2670455
republican 0.1339286 0.8660714
V14
Y n y
democrat 0.65921788 0.34078212
republican 0.01769912 0.98230088
V15
Y n y
democrat 0.3743017 0.6256983
republican 0.9266055 0.0733945
V16
Y n y
democrat 0.07407407 0.92592593
republican 0.33009709 0.66990291
# 예측 확률과 class 생성
bayesPred <- predict(bayesModel, newdata = votes.ted) # predict(Trained Model, Test Dataset)
bayesPred
Prediction: 130 observations
predict.type: prob
threshold: democrat=0.50,republican=0.50
time: 0.03
truth prob.democrat prob.republican response
1 democrat 8.238201e-01 1.761799e-01 democrat
2 democrat 1.000000e+00 5.886008e-11 democrat
3 republican 7.645217e-06 9.999924e-01 republican
4 democrat 1.000000e+00 5.920266e-13 democrat
5 democrat 1.000000e+00 3.198378e-11 democrat
6 democrat 1.000000e+00 5.920266e-13 democrat
... (#rows: 130, #cols: 4)
getPredictionProbabilities(bayesPred, # Extract 예측 확률 for Test Dataset
cl = c("democrat", "republican"))
democrat republican
1 8.238201e-01 1.761799e-01
2 1.000000e+00 5.886008e-11
3 7.645217e-06 9.999924e-01
4 1.000000e+00 5.920266e-13
5 1.000000e+00 3.198378e-11
6 1.000000e+00 5.920266e-13
7 1.000000e+00 8.182869e-13
8 1.011552e-08 1.000000e+00
9 1.000000e+00 9.488841e-10
10 9.999997e-01 2.572045e-07
11 1.000000e+00 5.920266e-13
12 3.915526e-07 9.999996e-01
13 1.000000e+00 5.886008e-11
14 6.707387e-08 9.999999e-01
15 8.534968e-08 9.999999e-01
16 6.230573e-08 9.999999e-01
17 4.507789e-08 1.000000e+00
18 1.000000e+00 4.471440e-11
19 1.000000e+00 1.457115e-10
20 8.628902e-06 9.999914e-01
21 1.000000e+00 9.269518e-11
22 1.079530e-04 9.998920e-01
23 2.647942e-02 9.735206e-01
24 9.555196e-03 9.904448e-01
25 9.675164e-09 1.000000e+00
26 9.266834e-09 1.000000e+00
27 9.998391e-01 1.608586e-04
28 8.599169e-03 9.914008e-01
29 1.671787e-08 1.000000e+00
30 9.998701e-01 1.299446e-04
31 2.139943e-01 7.860057e-01
32 1.000000e+00 7.218537e-12
33 2.180828e-03 9.978192e-01
34 5.959340e-08 9.999999e-01
35 1.254382e-05 9.999875e-01
36 5.959340e-08 9.999999e-01
37 3.403310e-04 9.996597e-01
38 9.999996e-01 4.192667e-07
39 8.778807e-08 9.999999e-01
40 1.000000e+00 9.041840e-10
41 1.000000e+00 1.028385e-08
42 1.425265e-08 1.000000e+00
43 3.556542e-06 9.999964e-01
44 5.959340e-08 9.999999e-01
45 1.000000e+00 2.530841e-11
46 1.000000e+00 7.682795e-12
47 1.000000e+00 4.705050e-11
48 1.000000e+00 3.615572e-11
49 1.000000e+00 2.742456e-10
50 5.818545e-08 9.999999e-01
51 9.999998e-01 2.017958e-07
52 1.000000e+00 3.881701e-09
53 1.013079e-06 9.999990e-01
54 4.603322e-06 9.999954e-01
55 1.000000e+00 7.298744e-09
56 2.876249e-04 9.997124e-01
57 1.000000e+00 8.932280e-13
58 1.000000e+00 4.136830e-09
59 1.000000e+00 4.648504e-11
60 8.778807e-08 9.999999e-01
61 6.230573e-08 9.999999e-01
62 9.999853e-01 1.472976e-05
63 1.000000e+00 5.803491e-11
64 8.013953e-01 1.986047e-01
65 1.425265e-08 1.000000e+00
66 1.000000e+00 1.150764e-10
67 1.000000e+00 1.621414e-10
68 1.000000e+00 7.536611e-12
69 1.013079e-06 9.999990e-01
70 1.000000e+00 7.536611e-12
71 1.000000e+00 1.150764e-10
72 1.000000e+00 4.995224e-12
73 1.000000e+00 3.484213e-09
74 3.438848e-01 6.561152e-01
75 2.755583e-07 9.999997e-01
76 1.000000e+00 4.628407e-10
77 1.000000e+00 5.429128e-11
78 9.999944e-01 5.610058e-06
79 9.999999e-01 8.906709e-08
80 9.930752e-01 6.924833e-03
81 9.999372e-01 6.282123e-05
82 1.638864e-05 9.999836e-01
83 1.305687e-08 1.000000e+00
84 9.266834e-09 1.000000e+00
85 9.999997e-01 2.658417e-07
86 4.086203e-03 9.959138e-01
87 4.516398e-01 5.483602e-01
88 1.000000e+00 2.753713e-10
89 1.000000e+00 4.997368e-11
90 9.998034e-01 1.965639e-04
91 9.947667e-01 5.233313e-03
92 9.847895e-01 1.521048e-02
93 1.202567e-07 9.999999e-01
94 6.667664e-04 9.993332e-01
95 1.000000e+00 7.218537e-12
96 1.000000e+00 8.932280e-13
97 9.999992e-01 8.014432e-07
98 1.000000e+00 5.452702e-12
99 1.000000e+00 8.248300e-10
100 7.304956e-07 9.999993e-01
101 8.547786e-06 9.999915e-01
102 9.999988e-01 1.171998e-06
103 7.843716e-02 9.215628e-01
104 2.490118e-04 9.997510e-01
105 5.707832e-08 9.999999e-01
106 1.842205e-02 9.815780e-01
107 1.305687e-08 1.000000e+00
108 9.945640e-01 5.436017e-03
109 1.000000e+00 6.035988e-10
110 1.000000e+00 1.488419e-08
111 1.000000e+00 9.536544e-09
112 9.999999e-01 6.490770e-08
113 1.251855e-01 8.748145e-01
114 8.042272e-08 9.999999e-01
115 1.253937e-04 9.998746e-01
116 9.999999e-01 1.165042e-07
117 3.248577e-02 9.675142e-01
118 9.999877e-01 1.226982e-05
119 1.000000e+00 1.069932e-12
120 4.911480e-06 9.999951e-01
121 9.800988e-01 1.990121e-02
122 9.999984e-01 1.580283e-06
123 3.280626e-08 1.000000e+00
124 9.999990e-01 1.016716e-06
125 1.000000e+00 1.727983e-10
126 1.458032e-05 9.999854e-01
127 1.206210e-01 8.793790e-01
128 9.999991e-01 8.994777e-07
129 1.000000e+00 1.493769e-08
130 1.000000e+00 2.399307e-09
getPredictionResponse(bayesPred) # Extract 예측 class for Test Dataset
[1] democrat democrat republican democrat democrat
[6] democrat democrat republican democrat democrat
[11] democrat republican democrat republican republican
[16] republican republican democrat democrat republican
[21] democrat republican republican republican republican
[26] republican democrat republican republican democrat
[31] republican democrat republican republican republican
[36] republican republican democrat republican democrat
[41] democrat republican republican republican democrat
[46] democrat democrat democrat democrat republican
[51] democrat democrat republican republican democrat
[56] republican democrat democrat democrat republican
[61] republican democrat democrat democrat republican
[66] democrat democrat democrat republican democrat
[71] democrat democrat democrat republican republican
[76] democrat democrat democrat democrat democrat
[81] democrat republican republican republican democrat
[86] republican republican democrat democrat democrat
[91] democrat democrat republican republican democrat
[96] democrat democrat democrat democrat republican
[101] republican democrat republican republican republican
[106] republican republican democrat democrat democrat
[111] democrat democrat republican republican republican
[116] democrat republican democrat democrat republican
[121] democrat democrat republican democrat democrat
[126] republican republican democrat democrat democrat
Levels: democrat republican
# ConfusionMatrix
calculateConfusionMatrix(bayesPred, # 함수 predict의 결과
relative = TRUE) # 비율도 함께 출력
Relative confusion matrix (normalized by row/column):
predicted
true democrat republican -err.-
democrat 0.89/0.97 0.11/0.16 0.11
republican 0.04/0.03 0.96/0.84 0.04
-err.- 0.03 0.16 0.08
Absolute confusion matrix:
predicted
true democrat republican -err.-
democrat 71 9 9
republican 2 48 2
-err.- 2 9 11
# Performance Measure
performance(bayesPred, # 함수 predict의 결과
measures = list(mmce, acc)) # mmce : 평균 오분류율, acc : 정확도
mmce acc
0.08461538 0.91538462
Caution!
함수 performance
의 인자 measures
에 사용할 수 있는 평가척도는 여기를 참고한다.
# ROC 곡선
df <- generateThreshVsPerfData(bayesPred, measures = list(fpr, tpr))
plotROCCurves(df) +
theme_bw()
# AUC
performance(bayesPred, # 함수 predict의 결과
auc)
auc
0.97975
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".