Support Vector Machine의 장점

분류 경계가 직사각형만 가능한 의사결정나무의 단점을 해결할 수 있다.
복잡한 비선형 결정 경계를 학습하는데 유용하다.
예측 변수에 분포를 가정하지 않는다.

Support Vector Machine의 단점

초모수가 매우 많으며, 초모수에 민감하다.
- 최적의 모형을 찾기 위해 다양한 커널과 초모수의 조합을 평가해야 한다.
모형 훈련이 느리다.
연속형 예측 변수만 가능하다.
- 범주형 예측 변수는 더미 또는 원-핫 인코딩 변환을 수행해야 한다.
해석하기 어려운 복잡한 블랙박스 모형이다.

실습 자료 : 유니버셜 은행의 고객 2,500명에 대한 자료(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이며, 총 13개의 변수를 포함하고 있다. 이 자료에서 Target은 Personal Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table", 
               "tidyverse", 
               "dplyr",
               "ggplot2", "GGally",
               "caret",
               "doParallel", "parallel")                                # For 병렬 처리

registerDoParallel(cores=detectCores())                                 # 사용할 Core 개수 지정

UB <- fread("../Universal Bank_Main.csv")                               # 데이터 불러오기

UB %>%
  as_tibble

# A tibble: 2,500 × 14
      ID   Age Experience Income `ZIP Code` Family CCAvg Education
   <int> <int>      <int>  <int>      <int>  <int> <dbl>     <int>
 1     1    25          1     49      91107      4   1.6         1
 2     2    45         19     34      90089      3   1.5         1
 3     3    39         15     11      94720      1   1           1
 4     4    35          9    100      94112      1   2.7         2
 5     5    35          8     45      91330      4   1           2
 6     6    37         13     29      92121      4   0.4         2
 7     7    53         27     72      91711      2   1.5         2
 8     8    50         24     22      93943      1   0.3         3
 9     9    35         10     81      90089      3   0.6         2
10    10    34          9    180      93023      1   8.9         3
# ℹ 2,490 more rows
# ℹ 6 more variables: Mortgage <int>, `Personal Loan` <int>,
#   `Securities Account` <int>, `CD Account` <int>, Online <int>,
#   CreditCard <int>

2. 데이터 전처리

UB %<>%
  data.frame() %>%                                                      # Data Frame 형태로 변환 
  mutate(Personal.Loan = ifelse(Personal.Loan == 1, "yes", "no")) %>%   # Target을 문자형 변수로 변환
  select(-1)                                                            # ID 변수 제거

# 1. Convert to Factor
fac.col <- c("Family", "Education", "Securities.Account", 
             "CD.Account", "Online", "CreditCard",
             # Target
             "Personal.Loan")

UB <- UB %>% 
  mutate_at(fac.col, as.factor)                                         # 범주형으로 변환

glimpse(UB)                                                             # 데이터 구조 확인

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6…
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5…
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,…
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,…
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,…
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0…
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,…
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0…
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, …
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,…
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…

# 2. Convert One-hot Encoding for 범주형 예측 변수
dummies <- dummyVars(formula = ~ .,                                     # formula : ~ 예측 변수 / "." : data에 포함된 모든 변수를 의미
                     data = UB[,-9],                                    # Dataset including Only 예측 변수 -> Target 제외
                     fullRank = FALSE)                                  # fullRank = TRUE : Dummy Variable, fullRank = FALSE : One-hot Encoding

UB.Var   <- predict(dummies, newdata = UB) %>%                          # 범주형 예측 변수에 대한 One-hot Encoding 변환
  data.frame()                                                          # Data Frame 형태로 변환 

glimpse(UB.Var)                                                         # 데이터 구조 확인

Rows: 2,500
Columns: 21
$ Age                  <dbl> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34,…
$ Experience           <dbl> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39,…
$ Income               <dbl> 49, 34, 11, 100, 45, 29, 72, 22, 81, 18…
$ ZIP.Code             <dbl> 91107, 90089, 94720, 94112, 91330, 9212…
$ Family.1             <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
$ Family.2             <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, …
$ Family.3             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, …
$ Family.4             <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
$ CCAvg                <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3,…
$ Education.1          <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Education.2          <dbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, …
$ Education.3          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Mortgage             <dbl> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0,…
$ Securities.Account.0 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
$ Securities.Account.1 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ CD.Account.0         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ CD.Account.1         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Online.0             <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Online.1             <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, …
$ CreditCard.0         <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, …
$ CreditCard.1         <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …

# 3. Combine Target with 변환된 예측 변수
UB.df <- data.frame(Personal.Loan = UB$Personal.Loan, 
                    UB.Var)

UB.df %>%
  as_tibble

# A tibble: 2,500 × 22
   Personal.Loan   Age Experience Income ZIP.Code Family.1 Family.2
   <fct>         <dbl>      <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
 1 no               25          1     49    91107        0        0
 2 no               45         19     34    90089        0        0
 3 no               39         15     11    94720        1        0
 4 no               35          9    100    94112        1        0
 5 no               35          8     45    91330        0        0
 6 no               37         13     29    92121        0        0
 7 no               53         27     72    91711        0        1
 8 no               50         24     22    93943        1        0
 9 no               35         10     81    90089        0        0
10 yes              34          9    180    93023        1        0
# ℹ 2,490 more rows
# ℹ 15 more variables: Family.3 <dbl>, Family.4 <dbl>, CCAvg <dbl>,
#   Education.1 <dbl>, Education.2 <dbl>, Education.3 <dbl>,
#   Mortgage <dbl>, Securities.Account.0 <dbl>,
#   Securities.Account.1 <dbl>, CD.Account.0 <dbl>,
#   CD.Account.1 <dbl>, Online.0 <dbl>, Online.1 <dbl>,
#   CreditCard.0 <dbl>, CreditCard.1 <dbl>

glimpse(UB.df)                                                          # 데이터 구조 확인

Rows: 2,500
Columns: 22
$ Personal.Loan        <fct> no, no, no, no, no, no, no, no, no, yes…
$ Age                  <dbl> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34,…
$ Experience           <dbl> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39,…
$ Income               <dbl> 49, 34, 11, 100, 45, 29, 72, 22, 81, 18…
$ ZIP.Code             <dbl> 91107, 90089, 94720, 94112, 91330, 9212…
$ Family.1             <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
$ Family.2             <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, …
$ Family.3             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, …
$ Family.4             <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
$ CCAvg                <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3,…
$ Education.1          <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Education.2          <dbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, …
$ Education.3          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Mortgage             <dbl> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0,…
$ Securities.Account.0 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
$ Securities.Account.1 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ CD.Account.0         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ CD.Account.1         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Online.0             <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, …
$ Online.1             <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, …
$ CreditCard.0         <dbl> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, …
$ CreditCard.1         <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …

3. 데이터 탐색

ggpairs(UB,                                               # In 2-1
        columns = c("Age", "Experience", "Income",        # 수치형 예측 변수
                    "ZIP.Code", "CCAvg", "Mortgage"),                            
        aes(colour = Personal.Loan)) +                    # Target의 범주에 따라 색깔을 다르게 표현
  theme_bw()

ggpairs(UB,                                               # In 2-1
        columns = c("Age", "Experience", "Income",        # 수치형 예측 변수
                    "ZIP.Code", "CCAvg", "Mortgage"), 
        aes(colour = Personal.Loan)) +                    # Target의 범주에 따라 색깔을 다르게 표현
  scale_colour_manual(values = c("#00798c", "#d1495b")) + # 특정 색깔 지정
  scale_fill_manual(values = c("#00798c", "#d1495b")) +   # 특정 색깔 지정
  theme_bw()

ggpairs(UB,                                               # In 2-1
        columns = c("Age", "Income",                      # 수치형 예측 변수
                    "Family", "Education"),               # 범주형 예측 변수
        aes(colour = Personal.Loan, alpha = 0.8)) +       # Target의 범주에 따라 색깔을 다르게 표현
  scale_colour_manual(values = c("#E69F00", "#56B4E9")) + # 특정 색깔 지정
  scale_fill_manual(values = c("#E69F00", "#56B4E9")) +   # 특정 색깔 지정
  theme_bw()

4. 데이터 분할

# Partition (Training Dataset : Test Dataset = 7:3)
y      <- UB.df$Personal.Loan                            # Target
 
set.seed(200)
ind    <- createDataPartition(y, p = 0.7, list = T)      # Index를 이용하여 7:3으로 분할
UB.trd <- UB.df[ind$Resample1,]                          # Training Dataset
UB.ted <- UB.df[-ind$Resample1,]                         # Test Dataset

5. 모형 훈련

Package "caret"은 통합 API를 통해 R로 기계 학습을 실행할 수 있는 매우 실용적인 방법을 제공한다. Package "caret"에서는 초모수의 최적의 조합을 찾는 방법으로 그리드 검색(Grid Search), 랜덤 검색(Random Search), 직접 탐색 범위 설정이 있다. 여기서는 초모수 sigma, C (Cost)의 최적의 조합값을 찾기 위해 그리드 검색을 수행하였고, 이를 기반으로 직접 탐색 범위를 설정하였다. 아래는 그리드 검색을 수행하였을 때 결과이다.

fitControl <- trainControl(method = "cv", number = 5,  # 5-Fold Cross Validation (5-Fold CV)
                           allowParallel = TRUE,       # 병렬 처리
                           classProbs = TRUE)          # For 예측 확률 생성

set.seed(100)                                          # For CV
svm.rd.fit <- train(Personal.Loan ~ ., data = UB.trd, 
                    trControl = fitControl,
                    method = "svmRadial",
                    preProc = c("center", "scale"))    # Standardization for 예측 변수

svm.rd.fit

Support Vector Machines with Radial Basis Function Kernel 

1751 samples
  21 predictor
   2 classes: 'no', 'yes' 

Pre-processing: centered (21), scaled (21) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  0.25  0.9640130  0.7993425
  0.50  0.9708685  0.8335374
  1.00  0.9754400  0.8591523

Tuning parameter 'sigma' was held constant at a value of 0.03323139
Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were sigma = 0.03323139 and C = 1.

plot(svm.rd.fit)                                       # plot

Caution! 함수 train()에 옵션 method = "svmRadial"을 입력하면 Package "kernlab"의 함수 ksvm()를 이용하여 Support Vector Machine을 수행한다. 해당 함수는 Kernel 함수가 Radial Basis일 때, 최적의 sigma 값을 자동으로 찾아준다.
Result! 랜덤하게 결정된 3개의 초모수 C 값과 1개의 sigma 값을 조합하여 만든 3개의 초모수 조합값 (sigma, C)에 대한 정확도를 보여주며, (sigma = 0.03323139, C = 1)일 때 정확도가 가장 높은 것을 알 수 있다. 따라서 그리드 검색을 통해 찾은 최적의 초모수 조합값 (sigma = 0.03323139, C = 1) 근처의 값들을 탐색 범위로 설정하여 훈련을 다시 수행할 수 있다.

customGrid <- expand.grid(sigma = seq(0.02, 0.04, 0.01), # sigma의 탐색 범위
                          C = 1:3)                       # C의 탐색 범위
                          

set.seed(100)                                            # For CV
svm.rd.grid.fit <- train(Personal.Loan ~ ., data = UB.trd, 
                         trControl = fitControl,
                         tuneGrid = customGrid,
                         method = "svmRadial",
                         preProc = c("center", "scale")) # Standardization for 예측 변수
svm.rd.grid.fit

Support Vector Machines with Radial Basis Function Kernel 

1751 samples
  21 predictor
   2 classes: 'no', 'yes' 

Pre-processing: centered (21), scaled (21) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  sigma  C  Accuracy   Kappa    
  0.02   1  0.9725828  0.8417210
  0.02   2  0.9760147  0.8617339
  0.02   3  0.9765861  0.8635022
  0.03   1  0.9731559  0.8461398
  0.03   2  0.9743020  0.8519862
  0.03   3  0.9748751  0.8550152
  0.04   1  0.9737257  0.8482358
  0.04   2  0.9737322  0.8497276
  0.04   3  0.9743036  0.8532982

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were sigma = 0.02 and C = 3.

plot(svm.rd.grid.fit)                                    # Plot

svm.rd.grid.fit$bestTune                                 # 최적의 초모수 조합값

  sigma C
3  0.02 3

Result! (sigma = 0.02, C = 3)일 때 정확도가 가장 높다는 것을 알 수 있으며, (sigma = 0.02, C = 3)를 가지는 모형을 최적의 훈련된 모형으로 선택한다.

6. 모형 평가

Caution! 모형 평가를 위해 Test Dataset에 대한 예측 class/확률 이 필요하며, 함수 predict()를 이용하여 생성한다.

# 예측 class 생성
svm.rd.pred <- predict(svm.rd.grid.fit,                                        
                       newdata = UB.ted[,-1])            # Test Dataset including Only 예측 변수     

svm.rd.pred %>%
  as_tibble

# A tibble: 749 × 1
   value
   <fct>
 1 no   
 2 no   
 3 no   
 4 no   
 5 no   
 6 no   
 7 no   
 8 no   
 9 no   
10 no   
# ℹ 739 more rows

6-1. ConfusionMatrix

CM   <- caret::confusionMatrix(svm.rd.pred, UB.ted$Personal.Loan, 
                               positive = "yes")         # confusionMatrix(예측 class, 실제 class, positive = "관심 class")
CM

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  669  14
       yes   4  62
                                          
               Accuracy : 0.976           
                 95% CI : (0.9623, 0.9857)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.86            
                                          
 Mcnemar's Test P-Value : 0.03389         
                                          
            Sensitivity : 0.81579         
            Specificity : 0.99406         
         Pos Pred Value : 0.93939         
         Neg Pred Value : 0.97950         
             Prevalence : 0.10147         
         Detection Rate : 0.08278         
   Detection Prevalence : 0.08812         
      Balanced Accuracy : 0.90492         
                                          
       'Positive' Class : yes

6-2. ROC 곡선

# 예측 확률 생성 
test.svm.prob <- predict(svm.rd.grid.fit, 
                         newdata = UB.ted[,-1],          # Test Dataset including Only 예측 변수     
                         type = "prob")                  # 예측 확률 생성 

test.svm.prob %>%
  as_tibble

# A tibble: 749 × 2
      no       yes
   <dbl>     <dbl>
 1 0.998 0.00221  
 2 1.00  0.0000217
 3 1.00  0.0000142
 4 1.00  0.0000868
 5 1.00  0.0000108
 6 1.00  0.000139 
 7 1.00  0.000108 
 8 0.995 0.00496  
 9 0.930 0.0702   
10 0.999 0.000530 
# ℹ 739 more rows

test.svm.prob <- test.svm.prob[,2]                       # "Personal.Loan = yes"에 대한 예측 확률

ac  <- UB.ted$Personal.Loan                              # Test Dataset의 실제 class 
pp  <- as.numeric(test.svm.prob)                         # 예측 확률을 수치형으로 변환

1) Package “pROC”

pacman::p_load("pROC")

svm.roc  <- roc(ac, pp, plot = T, col = "gray")          # roc(실제 class, 예측 확률)
auc      <- round(auc(svm.roc), 3)
legend("bottomright", legend = auc, bty = "n")

Caution! Package "pROC"를 통해 출력한 ROC 곡선은 다양한 함수를 이용해서 그래프를 수정할 수 있다.

# 함수 plot.roc() 이용
plot.roc(svm.roc,   
         col="gray",                                     # Line Color
         print.auc = TRUE,                               # AUC 출력 여부
         print.auc.col = "red",                          # AUC 글씨 색깔
         print.thres = TRUE,                             # Cutoff Value 출력 여부
         print.thres.pch = 19,                           # Cutoff Value를 표시하는 도형 모양
         print.thres.col = "red",                        # Cutoff Value를 표시하는 도형의 색깔
         auc.polygon = TRUE,                             # 곡선 아래 면적에 대한 여부
         auc.polygon.col = "gray90")                     # 곡선 아래 면적의 색깔

# 함수 ggroc() 이용
ggroc(svm.roc) +
annotate(geom = "text", x = 0.9, y = 1.0,
label = paste("AUC = ", auc),
size = 5,
color="red") +
theme_bw()

2) Package “Epi”

pacman::p_load("Epi")       
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot = "ROC")                                # ROC(예측 확률, 실제 class)

3) Package “ROCR”

pacman::p_load("ROCR")

svm.pred <- prediction(pp, ac)                           # prediction(예측 확률, 실제 class)    

svm.perf <- performance(svm.pred, "tpr", "fpr")          # performance(, "민감도", "1-특이도")                      
plot(svm.perf, col = "gray")                             # ROC Curve

perf.auc   <- performance(svm.pred, "auc")               # AUC
auc        <- attributes(perf.auc)$y.values 
legend("bottomright", legend = auc, bty = "n")

6-3. 향상 차트

1) Package “ROCR”

svm.perf <- performance(svm.pred, "lift", "rpp")         # Lift Chart
plot(svm.perf, main = "lift curve", 
     colorize = T,                                       # Coloring according to cutoff
     lwd = 2)

Support Vector Machine with Radial Basis Kernel using Package caret