Support Vector Machine의 장점

분류 경계가 직사각형만 가능한 의사결정나무의 단점을 해결할 수 있다.
복잡한 비선형 결정 경계를 학습하는데 유용하다.
예측 변수에 분포를 가정하지 않는다.

Support Vector Machine의 단점

초모수가 매우 많으며, 초모수에 민감하다.
- 최적의 모형을 찾기 위해 다양한 커널과 초모수의 조합을 평가해야 한다.
모형 훈련이 느리다.
연속형 예측 변수만 가능하다.
- 범주형 예측 변수는 더미 또는 원-핫 인코딩 변환을 수행해야 한다.
해석하기 어려운 복잡한 블랙박스 모형이다.

실습 자료 : 유니버셜 은행의 고객 2,500명에 대한 자료(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이며, 총 13개의 변수를 포함하고 있다. 이 자료에서 Target은 Personal Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table", "dplyr",
               "caret",
               "ggplot2", "GGally",
               "e1071")


UB <- fread("../Universal Bank_Main.csv")                               # 데이터 불러오기

UB %>%
  as_tibble

# A tibble: 2,500 × 14
      ID   Age Experience Income `ZIP Code` Family CCAvg Education
   <int> <int>      <int>  <int>      <int>  <int> <dbl>     <int>
 1     1    25          1     49      91107      4   1.6         1
 2     2    45         19     34      90089      3   1.5         1
 3     3    39         15     11      94720      1   1           1
 4     4    35          9    100      94112      1   2.7         2
 5     5    35          8     45      91330      4   1           2
 6     6    37         13     29      92121      4   0.4         2
 7     7    53         27     72      91711      2   1.5         2
 8     8    50         24     22      93943      1   0.3         3
 9     9    35         10     81      90089      3   0.6         2
10    10    34          9    180      93023      1   8.9         3
# ℹ 2,490 more rows
# ℹ 6 more variables: Mortgage <int>, `Personal Loan` <int>,
#   `Securities Account` <int>, `CD Account` <int>, Online <int>,
#   CreditCard <int>

2. 데이터 전처리 I

UB %<>%
  data.frame() %>%                                                      # Data Frame 형태로 변환 
  mutate(Personal.Loan = ifelse(Personal.Loan == 1, "yes", "no")) %>%   # Target을 문자형 변수로 변환
  select(-1)                                                            # ID 변수 제거

# 1. Convert to Factor
fac.col <- c("Family", "Education", "Securities.Account", 
             "CD.Account", "Online", "CreditCard",
             # Target
             "Personal.Loan")

UB <- UB %>% 
  mutate_at(fac.col, as.factor)                                         # 범주형으로 변환

glimpse(UB)                                                             # 데이터 구조 확인

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6…
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5…
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,…
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,…
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,…
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0…
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,…
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0…
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, …
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,…
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…

# 2. Convert One-hot Encoding for 범주형 예측 변수
dummies <- dummyVars(formula = ~ .,                                     # formula : ~ 예측 변수 / "." : data에 포함된 모든 변수를 의미
                     data = UB[,-9],                                    # Dataset including Only 예측 변수 -> Target 제외
                     fullRank = FALSE)                                  # fullRank = TRUE : Dummy Variable, fullRank = FALSE : One-hot Encoding

UB.Var   <- predict(dummies, newdata = UB) %>%                          # 범주형 예측 변수에 대한 One-hot Encoding 변환
  data.frame()                                                          # Data Frame 형태로 변환 

glimpse(UB.Var)                                                         # 데이터 구조 확인

Rows: 2,500
Columns: 21
$ Age                  <dbl> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34,…
$ Experience           <dbl> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39,…
$ Income               <dbl> 49, 34, 11, 100, 45, 29, 72, 22, 81, 18…
$ ZIP.Code             <dbl> 91107, 90089, 94720, 94112, 91330, 9212…
$ Family.1             <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
$ Family.2             <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, …
$ Family.3             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, …
$ Family.4             <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
$ CCAvg                <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3,…
$ Education.1          <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Education.2          <dbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, …
$ Education.3          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Mortgage             <dbl> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0,…
$ Securities.Account.0 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
$ Securities.Account.1 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ CD.Account.0         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ CD.Account.1         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Online.0             <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Online.1             <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, …
$ CreditCard.0         <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, …
$ CreditCard.1         <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …

# 3. Combine Target with 변환된 예측 변수
UB.df <- data.frame(Personal.Loan = UB$Personal.Loan, 
                    UB.Var)

UB.df %>%
  as_tibble

# A tibble: 2,500 × 22
   Personal.Loan   Age Experience Income ZIP.Code Family.1 Family.2
   <fct>         <dbl>      <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
 1 no               25          1     49    91107        0        0
 2 no               45         19     34    90089        0        0
 3 no               39         15     11    94720        1        0
 4 no               35          9    100    94112        1        0
 5 no               35          8     45    91330        0        0
 6 no               37         13     29    92121        0        0
 7 no               53         27     72    91711        0        1
 8 no               50         24     22    93943        1        0
 9 no               35         10     81    90089        0        0
10 yes              34          9    180    93023        1        0
# ℹ 2,490 more rows
# ℹ 15 more variables: Family.3 <dbl>, Family.4 <dbl>, CCAvg <dbl>,
#   Education.1 <dbl>, Education.2 <dbl>, Education.3 <dbl>,
#   Mortgage <dbl>, Securities.Account.0 <dbl>,
#   Securities.Account.1 <dbl>, CD.Account.0 <dbl>,
#   CD.Account.1 <dbl>, Online.0 <dbl>, Online.1 <dbl>,
#   CreditCard.0 <dbl>, CreditCard.1 <dbl>

glimpse(UB.df)                                                          # 데이터 구조 확인

Rows: 2,500
Columns: 22
$ Personal.Loan        <fct> no, no, no, no, no, no, no, no, no, yes…
$ Age                  <dbl> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34,…
$ Experience           <dbl> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39,…
$ Income               <dbl> 49, 34, 11, 100, 45, 29, 72, 22, 81, 18…
$ ZIP.Code             <dbl> 91107, 90089, 94720, 94112, 91330, 9212…
$ Family.1             <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
$ Family.2             <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, …
$ Family.3             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, …
$ Family.4             <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
$ CCAvg                <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3,…
$ Education.1          <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Education.2          <dbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, …
$ Education.3          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Mortgage             <dbl> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0,…
$ Securities.Account.0 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
$ Securities.Account.1 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ CD.Account.0         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ CD.Account.1         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Online.0             <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Online.1             <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, …
$ CreditCard.0         <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, …
$ CreditCard.1         <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …

3. 데이터 탐색

ggpairs(UB,                                           # In 2-1
        columns = c("Age", "Experience", "Income",    # 수치형 예측 변수
                    "ZIP.Code", "CCAvg", "Mortgage"),                            
        aes(colour = Personal.Loan)) +                # Target의 범주에 따라 색깔을 다르게 표현
  theme_bw()

ggpairs(UB,                                           # In 2-1
        columns = c("Age", "Experience", "Income",    # 수치형 예측 변수
                    "ZIP.Code", "CCAvg", "Mortgage"), 
        aes(colour = Personal.Loan)) +                # Target의 범주에 따라 색깔을 다르게 표현
  scale_color_brewer(palette="Purples") +             # 특정 색깔 지정
  scale_fill_brewer(palette="Purples") +              # 특정 색깔 지정
  theme_bw()

ggpairs(UB,                                           # In 2-1
        columns = c("Age", "Income",                  # 수치형 예측 변수
                    "Family", "Education"),           # 범주형 예측 변수
        aes(colour = Personal.Loan, alpha = 0.8)) +   # Target의 범주에 따라 색깔을 다르게 표현
  scale_colour_manual(values = c("purple","cyan4")) + # 특정 색깔 지정
  scale_fill_manual(values = c("purple","cyan4")) +   # 특정 색깔 지정
  theme_bw()

4. 데이터 분할

# Partition (Training Dataset : Test Dataset = 7:3)
y      <- UB.df$Personal.Loan                            # Target
 
set.seed(200)
ind    <- createDataPartition(y, p = 0.7, list = T)      # Index를 이용하여 7:3으로 분할
UB.trd <- UB.df[ind$Resample1,]                          # Training Dataset
UB.ted <- UB.df[-ind$Resample1,]                         # Test Dataset

5. 데이터 전처리 II

# Standardization
preProcValues <- preProcess(UB.trd, 
                            method = c("center", "scale"))  # Standardization 정의 -> Training Dataset에 대한 평균과 표준편차 계산 

UB.trd <- predict(preProcValues, UB.trd)                    # Standardization for Training Dataset
UB.ted <- predict(preProcValues, UB.ted)                    # Standardization for Test Dataset

glimpse(UB.trd)                                             # 데이터 구조 확인

Rows: 1,751
Columns: 22
$ Personal.Loan        <fct> no, no, no, no, no, no, no, yes, no, no…
$ Age                  <dbl> -0.05431273, -0.57446728, -0.92123699, …
$ Experience           <dbl> -0.12175295, -0.46882565, -0.98943471, …
$ Income               <dbl> -0.85867297, -1.35649686, 0.56986515, -…
$ ZIP.Code             <dbl> -1.75250883, 0.88354520, 0.53745994, -1…
$ Family.1             <dbl> -0.6355621, 1.5725118, 1.5725118, -0.63…
$ Family.2             <dbl> -0.5774051, -0.5774051, -0.5774051, -0.…
$ Family.3             <dbl> 2.0037210, -0.4987865, -0.4987865, -0.4…
$ Family.4             <dbl> -0.5967491, -0.5967491, -0.5967491, 1.6…
$ CCAvg                <dbl> -0.25119120, -0.53150921, 0.42157204, -…
$ Education.1          <dbl> 1.1482386, 1.1482386, -0.8704018, -0.87…
$ Education.2          <dbl> -0.6196534, -0.6196534, 1.6128838, 1.61…
$ Education.3          <dbl> -0.6408777, -0.6408777, -0.6408777, -0.…
$ Mortgage             <dbl> -0.5664192, -0.5664192, -0.5664192, -0.…
$ Securities.Account.0 <dbl> -2.7998134, 0.3569627, 0.3569627, 0.356…
$ Securities.Account.1 <dbl> 2.7998134, -0.3569627, -0.3569627, -0.3…
$ CD.Account.0         <dbl> 0.2613337, 0.2613337, 0.2613337, 0.2613…
$ CD.Account.1         <dbl> -0.2613337, -0.2613337, -0.2613337, -0.…
$ Online.0             <dbl> 1.2486195, 1.2486195, 1.2486195, 1.2486…
$ Online.1             <dbl> -1.2486195, -1.2486195, -1.2486195, -1.…
$ CreditCard.0         <dbl> 0.6408777, 0.6408777, 0.6408777, -1.559…
$ CreditCard.1         <dbl> -0.6408777, -0.6408777, -0.6408777, 1.5…

glimpse(UB.ted)                                             # 데이터 구조 확인

Rows: 749
Columns: 22
$ Personal.Loan        <fct> no, no, no, no, no, no, no, no, no, no,…
$ Age                  <dbl> -1.7881612, -0.7478521, 1.2460737, 0.81…
$ Experience           <dbl> -1.68358012, -0.64236200, 0.83269699, 0…
$ Income               <dbl> -0.53400522, -0.96689556, -1.11840718, …
$ ZIP.Code             <dbl> -1.17304370, -0.59585545, 1.07366441, 0…
$ Family.1             <dbl> -0.6355621, -0.6355621, 1.5725118, 1.57…
$ Family.2             <dbl> -0.5774051, -0.5774051, -0.5774051, -0.…
$ Family.3             <dbl> -0.4987865, -0.4987865, -0.4987865, -0.…
$ Family.4             <dbl> 1.6747892, 1.6747892, -0.5967491, -0.59…
$ CCAvg                <dbl> -0.19512759, -0.86789083, -0.25119120, …
$ Education.1          <dbl> 1.1482386, -0.8704018, -0.8704018, -0.8…
$ Education.2          <dbl> -0.6196534, 1.6128838, -0.6196534, 1.61…
$ Education.3          <dbl> -0.6408777, -0.6408777, 1.5594690, -0.6…
$ Mortgage             <dbl> -0.5664192, 0.9609885, -0.5664192, -0.5…
$ Securities.Account.0 <dbl> -2.7998134, 0.3569627, 0.3569627, -2.79…
$ Securities.Account.1 <dbl> 2.7998134, -0.3569627, -0.3569627, 2.79…
$ CD.Account.0         <dbl> 0.2613337, 0.2613337, 0.2613337, 0.2613…
$ CD.Account.1         <dbl> -0.2613337, -0.2613337, -0.2613337, -0.…
$ Online.0             <dbl> 1.2486195, -0.8004271, -0.8004271, 1.24…
$ Online.1             <dbl> -1.2486195, 0.8004271, 0.8004271, -1.24…
$ CreditCard.0         <dbl> 0.6408777, 0.6408777, -1.5594690, -1.55…
$ CreditCard.1         <dbl> -0.6408777, -0.6408777, 1.5594690, 1.55…

6. 모형 훈련

Package "e1071"는 Support Vector Machine을 효율적으로 구현할 수 있는 “libsvm”을 R에서 사용할 수 있도록 만든 Package이며, 함수 svm()을 이용하여 Support Vector Machine을 수행할 수 있다. 함수에서 사용할 수 있는 자세한 옵션은 여기를 참고한다.

svm(formula, data, kernel, cost, gamma, probability, ...)

formula : Target과 예측 변수의 관계를 표현하기 위한 함수로써 일반적으로 Target ~ 예측 변수의 형태로 표현한다.
data : formula에 포함하고 있는 변수들의 데이터셋(Data Frame)
kernel : Kernel 함수
- "linear" : $k(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}\boldsymbol{x}'$
- "polynomial" : $k(\boldsymbol{x}, \boldsymbol{x}') = (\gamma \boldsymbol{x}\boldsymbol{x}' + \text{coef0})^{\text{degree}}$
- "radial" : $k(\boldsymbol{x}, \boldsymbol{x}') = \exp\left(-\gamma||\boldsymbol{x}-\boldsymbol{x}'||^2 \right)$
- "sigmoid" : $k(\boldsymbol{x}, \boldsymbol{x}') = tanh(\gamma \boldsymbol{x}\boldsymbol{x}' + \text{coef0})$
cost : 데이터를 잘못 분류하는 선을 그을 경우 지불해야 할 cost
gamma : 개별 case가 결정경계의 위치에 미치는 영향
probability : Test Dataset에 대한 예측 확률의 생성 여부
- TRUE : 함수 predict()를 이용하여 Test Dataset에 대한 예측 확률을 생성할 수 있다.

svm.model.rd <- svm(Personal.Loan ~.,     
                    data = UB.trd,  
                    kernel = "radial", 
                    cost = 1,              
                    gamma = 2,
                    probability = TRUE)

summary(svm.model.rd)


Call:
svm(formula = Personal.Loan ~ ., data = UB.trd, kernel = "radial", 
    cost = 1, gamma = 2, probability = TRUE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  1729

 ( 1549 180 )


Number of Classes:  2 

Levels: 
 no yes

Result! Number of Support Vectors는 결정경계와 가까이 위치한 case의 수이다. 해당 데이터에서는 총 1729개의 case로, "Personal.Loan = no"에 해당하는 case는 1549개, "Personal.Loan = yes"에 해당하는 case는 180개이다. case의 행 번호는 svm.model.rd$index를 이용하여 확인할 수 있다.

# Support Vector Index
svm.model.rd$index

   [1]    1    2    3    4    5    6    7    9   10   11   12   13
  [13]   15   17   18   19   20   21   22   23   25   26   27   28
  [25]   29   30   31   33   35   36   37   39   40   41   42   44
  [37]   45   47   48   49   50   51   52   53   54   55   56   57
  [49]   58   59   60   61   62   63   64   65   66   67   68   70
  [61]   71   72   73   74   75   76   77   78   79   80   81   82
  [73]   83   84   85   86   87   88   89   90   91   92   93   94
  [85]   95   97   98   99  101  102  103  104  105  106  107  108
  [97]  109  111  112  113  114  116  117  118  119  120  121  122
 [109]  123  124  125  126  127  129  131  132  133  134  135  136
 [121]  137  138  139  140  141  142  144  145  146  147  148  149
 [133]  150  151  152  153  154  155  156  157  158  159  160  161
 [145]  162  163  164  165  166  167  169  170  171  173  174  175
 [157]  176  177  178  179  180  181  182  183  184  185  186  187
 [169]  188  189  190  191  192  193  194  195  196  197  198  199
 [181]  201  202  203  205  206  208  209  211  212  213  214  215
 [193]  216  217  218  221  222  223  227  228  230  231  232  233
 [205]  234  235  237  238  239  240  241  242  243  244  247  249
 [217]  250  251  252  253  254  255  256  257  258  259  260  261
 [229]  262  263  264  265  266  267  269  270  271  272  273  274
 [241]  275  276  278  279  280  281  282  283  284  285  286  287
 [253]  288  289  290  292  294  295  296  297  298  299  300  301
 [265]  302  303  304  306  307  309  310  311  312  313  314  315
 [277]  316  317  318  319  320  321  322  323  324  325  328  329
 [289]  330  331  332  333  335  337  338  339  340  341  342  343
 [301]  344  345  346  347  348  349  350  351  352  353  354  355
 [313]  356  357  358  360  361  362  363  364  365  366  367  368
 [325]  369  370  371  372  373  374  376  377  378  379  381  382
 [337]  383  384  385  386  387  388  389  390  391  392  393  394
 [349]  395  396  397  399  400  401  403  404  405  406  407  408
 [361]  409  410  411  412  413  414  415  416  417  418  419  420
 [373]  421  422  423  424  425  426  427  428  429  430  431  432
 [385]  433  434  435  437  438  439  440  443  444  445  446  447
 [397]  448  449  450  451  452  453  454  456  457  459  460  461
 [409]  462  464  465  466  467  468  469  470  471  473  474  475
 [421]  476  477  478  479  480  481  482  483  484  485  486  487
 [433]  488  489  490  491  492  494  495  496  497  498  499  500
 [445]  501  502  503  504  505  506  507  508  510  511  512  513
 [457]  514  515  516  519  520  521  522  523  524  525  526  527
 [469]  528  529  530  531  532  533  534  535  536  537  538  540
 [481]  542  544  545  546  548  549  552  553  554  555  556  557
 [493]  558  559  560  561  562  564  565  566  568  569  570  571
 [505]  572  573  574  575  576  577  578  579  580  581  582  584
 [517]  585  586  587  588  589  590  591  592  593  594  595  596
 [529]  597  598  599  600  601  602  603  604  605  606  607  608
 [541]  609  610  611  612  613  614  615  616  617  618  620  621
 [553]  623  624  625  626  627  628  630  631  632  633  634  635
 [565]  636  637  638  639  640  641  643  644  645  646  647  648
 [577]  649  650  651  652  653  654  655  656  657  658  659  660
 [589]  661  662  665  666  667  668  669  671  673  674  675  676
 [601]  677  678  680  681  682  683  685  686  687  688  689  690
 [613]  692  693  694  695  696  697  698  699  700  701  702  703
 [625]  704  705  706  707  709  710  711  713  714  716  717  718
 [637]  719  721  722  723  724  725  726  728  729  730  731  732
 [649]  733  734  735  736  737  738  739  740  743  746  747  748
 [661]  749  750  752  753  754  755  756  757  758  759  760  761
 [673]  762  763  764  765  766  767  768  769  771  772  773  774
 [685]  775  776  777  778  779  780  781  784  785  786  787  788
 [697]  789  791  792  793  795  797  798  799  800  801  802  803
 [709]  804  805  806  807  808  810  811  812  813  814  815  817
 [721]  818  819  820  821  822  823  824  825  826  827  828  830
 [733]  832  833  835  836  837  838  839  840  841  842  843  844
 [745]  845  846  847  848  849  850  851  853  854  855  856  857
 [757]  858  859  860  862  863  864  865  866  867  868  869  870
 [769]  871  872  873  874  875  876  878  879  880  881  882  883
 [781]  884  885  886  888  889  890  891  892  893  894  895  896
 [793]  897  898  899  900  901  902  903  904  905  906  907  908
 [805]  910  911  912  914  915  917  918  919  920  921  922  923
 [817]  925  926  927  928  929  930  931  933  934  935  936  937
 [829]  938  939  940  941  942  943  944  945  946  947  948  949
 [841]  950  953  954  955  956  957  958  959  960  961  962  963
 [853]  964  965  966  967  968  970  971  972  974  975  977  980
 [865]  981  983  984  985  986  987  988  989  990  991  992  993
 [877]  994  995  996  997  998  999 1000 1001 1002 1003 1004 1005
 [889] 1006 1008 1009 1011 1012 1013 1014 1015 1016 1017 1018 1019
 [901] 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031
 [913] 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1043 1044
 [925] 1046 1047 1048 1050 1051 1052 1053 1054 1055 1056 1057 1058
 [937] 1060 1061 1062 1063 1065 1066 1067 1068 1069 1070 1071 1072
 [949] 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085
 [961] 1086 1087 1088 1089 1091 1092 1093 1094 1095 1096 1097 1098
 [973] 1099 1101 1102 1104 1105 1106 1108 1109 1111 1113 1114 1115
 [985] 1116 1117 1118 1119 1120 1121 1123 1124 1125 1126 1127 1128
 [997] 1129 1130 1132 1134 1135 1136 1137 1139 1140 1141 1143 1144
[1009] 1145 1147 1148 1149 1150 1151 1152 1153 1156 1157 1158 1159
[1021] 1160 1162 1163 1164 1165 1167 1168 1169 1171 1173 1174 1176
[1033] 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188
[1045] 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200
[1057] 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212
[1069] 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225
[1081] 1226 1227 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238
[1093] 1239 1240 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251
[1105] 1252 1254 1255 1256 1258 1259 1261 1263 1264 1265 1266 1267
[1117] 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279
[1129] 1280 1282 1285 1286 1287 1288 1289 1291 1292 1293 1294 1295
[1141] 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307
[1153] 1309 1310 1311 1312 1313 1314 1315 1316 1318 1319 1320 1321
[1165] 1322 1323 1324 1325 1327 1328 1329 1330 1331 1332 1333 1334
[1177] 1335 1336 1337 1338 1339 1340 1341 1343 1344 1346 1347 1348
[1189] 1349 1350 1351 1352 1353 1354 1355 1356 1357 1359 1361 1362
[1201] 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374
[1213] 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386
[1225] 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398
[1237] 1399 1400 1401 1402 1404 1405 1406 1407 1408 1409 1410 1411
[1249] 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424
[1261] 1426 1427 1428 1430 1431 1432 1433 1434 1435 1436 1437 1439
[1273] 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1451 1452
[1285] 1453 1454 1456 1457 1459 1460 1461 1462 1463 1465 1466 1467
[1297] 1468 1469 1470 1472 1473 1474 1475 1476 1477 1478 1479 1480
[1309] 1481 1482 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493
[1321] 1496 1497 1498 1499 1500 1501 1502 1503 1506 1507 1508 1509
[1333] 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521
[1345] 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1534
[1357] 1535 1536 1537 1538 1539 1541 1542 1543 1544 1545 1546 1548
[1369] 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561
[1381] 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1574
[1393] 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1587 1588
[1405] 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601
[1417] 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613
[1429] 1614 1615 1616 1617 1618 1619 1620 1621 1623 1624 1625 1626
[1441] 1627 1628 1629 1631 1632 1633 1634 1635 1636 1637 1638 1639
[1453] 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1652
[1465] 1653 1656 1657 1658 1659 1660 1661 1662 1663 1664 1666 1667
[1477] 1668 1669 1670 1671 1672 1673 1675 1676 1677 1678 1679 1681
[1489] 1682 1683 1684 1685 1686 1687 1688 1690 1691 1692 1693 1694
[1501] 1695 1696 1698 1699 1700 1701 1702 1704 1705 1706 1707 1710
[1513] 1711 1712 1714 1715 1716 1717 1718 1721 1722 1723 1724 1725
[1525] 1726 1727 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738
[1537] 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750
[1549] 1751    8   14   16   24   32   34   38   43   46   69   96
[1561]  110  115  128  130  143  168  172  200  207  210  219  220
[1573]  224  225  226  245  246  248  268  277  293  305  308  326
[1585]  327  334  336  359  375  380  398  402  436  442  455  458
[1597]  463  472  493  509  517  539  541  543  547  550  551  563
[1609]  567  583  619  622  629  642  663  664  670  672  679  684
[1621]  691  708  712  715  720  727  742  744  745  751  770  782
[1633]  783  790  794  796  809  816  829  834  861  877  887  909
[1645]  916  924  932  951  952  969  973  976  978  979  982 1007
[1657] 1042 1045 1049 1059 1064 1090 1100 1103 1107 1110 1112 1122
[1669] 1131 1138 1142 1146 1154 1155 1161 1166 1170 1172 1175 1213
[1681] 1228 1241 1253 1257 1260 1262 1281 1283 1284 1290 1308 1317
[1693] 1326 1342 1345 1358 1360 1403 1412 1425 1429 1438 1450 1458
[1705] 1464 1471 1495 1504 1505 1533 1540 1547 1549 1573 1585 1586
[1717] 1622 1630 1651 1654 1655 1665 1674 1680 1689 1709 1713 1719
[1729] 1728

7. 모형 평가

Caution! 모형 평가를 위해 Test Dataset에 대한 예측 class/확률 이 필요하며, 함수 predict()를 이용하여 생성한다.

# 예측 class 생성 
svm.rd.pred <- predict(svm.model.rd,
                       newdata = UB.ted[,-1],        # Test Dataset including Only 예측 변수   
                       type = "class")               # 예측 class 생성       

svm.rd.pred %>%
  as_tibble

# A tibble: 749 × 1
   value
   <fct>
 1 no   
 2 no   
 3 no   
 4 no   
 5 no   
 6 no   
 7 no   
 8 no   
 9 no   
10 no   
# ℹ 739 more rows

7-1. ConfusionMatrix

CM   <- caret::confusionMatrix(svm.rd.pred, UB.ted$Personal.Loan, 
                               positive = "yes")     # confusionMatrix(예측 class, 실제 class, positive="관심 class")
CM

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  673  76
       yes   0   0
                                          
               Accuracy : 0.8985          
                 95% CI : (0.8746, 0.9192)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : 0.5305          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.8985          
             Prevalence : 0.1015          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : yes

7-2. ROC 곡선

# 예측 확률 생성
test.svm.prob <- predict(svm.model.rd, 
                         newdata = UB.ted[,-1],      # Test Dataset including Only 예측 변수  
                         probability = TRUE)         # 예측 확률 생성       

attr(test.svm.prob, "probabilities") %>%
  as_tibble

# A tibble: 749 × 2
      no     yes
   <dbl>   <dbl>
 1 0.895 0.105  
 2 0.921 0.0786 
 3 0.995 0.00543
 4 0.841 0.159  
 5 0.951 0.0492 
 6 0.995 0.00489
 7 0.997 0.00270
 8 0.878 0.122  
 9 0.835 0.165  
10 0.998 0.00248
# ℹ 739 more rows

test.svm.prob <- attr(test.svm.prob, "probabilities")[,2]   # "Personal.Loan = yes"에 대한 예측 확률

ac  <- UB.ted$Personal.Loan                                 # Test Dataset의 실제 class 
pp  <- as.numeric(test.svm.prob)                            # 예측 확률을 수치형으로 변환

1) Package “pROC”

pacman::p_load("pROC")

svm.roc  <- roc(ac, pp, plot = T, col = "gray")             # roc(실제 class, 예측 확률)
auc      <- round(auc(svm.roc), 3)
legend("bottomright", legend = auc, bty = "n")

Caution! Package "pROC"를 통해 출력한 ROC 곡선은 다양한 함수를 이용해서 그래프를 수정할 수 있다.

# 함수 plot.roc() 이용
plot.roc(svm.roc,   
         col="gray",                                        # Line Color
         print.auc = TRUE,                                  # AUC 출력 여부
         print.auc.col = "red",                             # AUC 글씨 색깔
         print.thres = TRUE,                                # Cutoff Value 출력 여부
         print.thres.pch = 19,                              # Cutoff Value를 표시하는 도형 모양
         print.thres.col = "red",                           # Cutoff Value를 표시하는 도형의 색깔
         auc.polygon = TRUE,                                # 곡선 아래 면적에 대한 여부
         auc.polygon.col = "gray90")                        # 곡선 아래 면적의 색깔

# 함수 ggroc() 이용
ggroc(svm.roc) +
annotate(geom = "text", x = 0.9, y = 1.0,
label = paste("AUC = ", auc),
size = 5,
color="red") +
theme_bw()

2) Package “Epi”

pacman::p_load("Epi")       
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot = "ROC")                                   # ROC(예측 확률, 실제 class)

3) Package “ROCR”

pacman::p_load("ROCR")

svm.pred <- prediction(pp, ac)                              # prediction(예측 확률, 실제 class)    

svm.perf <- performance(svm.pred, "tpr", "fpr")             # performance(, "민감도", "1-특이도")                      
plot(svm.perf, col = "gray")                                # ROC Curve

perf.auc   <- performance(svm.pred, "auc")                  # AUC
auc        <- attributes(perf.auc)$y.values 
legend("bottomright", legend = auc, bty = "n")

7-3. 향상 차트

1) Package “ROCR”

svm.perf <- performance(svm.pred, "lift", "rpp")            # Lift Chart
plot(svm.perf, main = "lift curve", 
     colorize = T,                                          # Coloring according to cutoff
     lwd = 2)

Support Vector Machine with Radial Basis Kernel using Package e1071