Package "caret"은 다양한 머신러닝 분석을 하나로 모은 패키지이며, trainControl을 이용하여 과적합을 방지할 수 있다. "caret"에서는 다양한 커널과 방법을 이용한 서포트 벡터 머신 분석을 수행할 수 있으며, 그 중 가장 많이 쓰이는 Linear, Polynomial, Radial Basis Kernel들을 이용하여 예제 데이터를 분석한다. 자세한 모델은 여기를 참조한다.
예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은 Person.Loan이다.

1. 데이터 불러오기

pacman::p_load("data.table", "dplyr") 

UB   <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>%   # 데이터 불러오기
  data.frame() %>%                                                     # Data frame 변환환
  mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>%     # Character for classification
  select(-1)                                                           # ID변수 제거거



cols <- c("Family", "Education", "Personal.Loan", "Securities.Account", 
          "CD.Account", "Online", "CreditCard")

UB   <- UB %>% 
  mutate_at(cols, as.factor)                                          # 범주형 변수 변환
 
glimpse(UB)                                                           # 데이터 구조

Rows: 2,500
Columns: 13
$ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code           <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family             <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education          <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan      <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online             <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard         <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~

2. 데이터 분할

pacman::p_load("caret")
# Partition (Traning Data : Test Data = 7:3)
y      <- UB$Personal.Loan                       # Target

set.seed(200)
ind    <- createDataPartition(y, p=0.7, list=F)  # Training Data를 70% 추출
UB.trd <- UB[ind,]                               # Traning Data

UB.ted <- UB[-ind,]                              # Test Data

3. Linear Kernel

3-1. 최적의 모수 찾기

Linear Kernel을 이용한 서포트 벡터 머신의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method="cv", number=5, 
                           search = "random", classProbs =  TRUE) # 5-Fold-Cross Validation
                                                                  # classProbs = TRUE해야 예측 확률 출력할 수 있음NA

set.seed(100)                                   # seed 고정 For Cross Validation


caret.rd.li <- train(Personal.Loan~.,           # Tune Parameter : Cost
                     data = UB.trd, 
                     method = "svmLinear",      # svmLinear : ksvm /  svmLinear2 : svm
                     trControl = fitControl, 
                     tuneLength = 10)           # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.li

Support Vector Machines with Linear Kernel 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  C             Accuracy   Kappa    
    0.05616236  0.9583036  0.7496066
    0.18351002  0.9583053  0.7452690
    1.46897318  0.9600179  0.7554191
    4.07906715  0.9600195  0.7520339
    4.77851063  0.9605877  0.7570026
    9.17926682  0.9588767  0.7462176
    9.74617752  0.9583036  0.7447797
   20.74867286  0.9594465  0.7504374
  145.61743228  0.9605877  0.7583088
  300.76289267  0.9594449  0.7499907

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was C = 4.778511.

plot(caret.rd.li)                             # Accuracy

C = 4.778511일 때 정확도가 가장 높다.
C = 4.778511을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method="cv", number=5, classProbs =  TRUE)    # 5-Fold-Cross Validation

customGrid <- expand.grid(C = seq(4.68,4.88, by=0.01))                   # Random Search의 Best Parameter 기준으로 탐색NA

set.seed(100)                                  # seed 고정 For Cross Validation
caret.gd.li <- train(Personal.Loan~., 
                     data=UB.trd, 
                     method="svmLinear", 
                     trControl = fitControl, 
                     tuneGrid = customGrid)    

caret.gd.li

Support Vector Machines with Linear Kernel 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  4.68  0.9611591  0.7598194
  4.69  0.9600179  0.7535061
  4.70  0.9588734  0.7473399
  4.71  0.9605893  0.7559827
  4.72  0.9588751  0.7480268
  4.73  0.9605893  0.7562346
  4.74  0.9605877  0.7570026
  4.75  0.9594481  0.7505010
  4.76  0.9588751  0.7480323
  4.77  0.9605893  0.7559754
  4.78  0.9588734  0.7467413
  4.79  0.9605893  0.7562346
  4.80  0.9605893  0.7583547
  4.81  0.9600195  0.7558497
  4.82  0.9600195  0.7520339
  4.83  0.9600195  0.7532591
  4.84  0.9588767  0.7449410
  4.85  0.9588751  0.7469430
  4.86  0.9583053  0.7451216
  4.87  0.9600179  0.7516993
  4.88  0.9588751  0.7467558

Accuracy was used to select the optimal model using the
 largest value.
The final value used for the model was C = 4.68.

plot(caret.gd.li)            # Accuracy

C = 4.68일 때 정확도가 가장 높으며, C = 4.778511 보다 정확도가 0.001 증가했다.

# 최종 모형NAcaret.gd.li$finalModel

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 4.68 

Linear (vanilla) kernel function. 

Number of Support Vectors : 185 

Objective Function Value : -821.0561 
Training error : 0.035408 
Probability model included.

3-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.li.pred <- predict(caret.gd.li, newdata=UB.ted)   # predict(svm모형, Test Data)

3-2-1. ConfusionMatrix

confusionMatrix(caret.gd.li.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  667  29
       yes   6  47
                                          
               Accuracy : 0.9533          
                 95% CI : (0.9356, 0.9672)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : 3.36e-08        
                                          
                  Kappa : 0.704           
                                          
 Mcnemar's Test P-Value : 0.0002003       
                                          
            Sensitivity : 0.61842         
            Specificity : 0.99108         
         Pos Pred Value : 0.88679         
         Neg Pred Value : 0.95833         
             Prevalence : 0.10147         
         Detection Rate : 0.06275         
   Detection Prevalence : 0.07076         
      Balanced Accuracy : 0.80475         
                                          
       'Positive' Class : yes

3-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.li.prob <- predict(caret.gd.li, newdata = UB.ted, type="prob")  #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.li.prob <- test.li.prob[,2]                                     # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                 # 실제 클래스래스

pp           <- as.numeric(test.li.prob)                             # "yes"에 대한 예측 확률

tree.roc     <- roc(ac, pp, plot = T, col = "red")                   # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(tree.roc),3)
legend("bottomright",legend=auc, bty="n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot="ROC")    # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                                                  

li.pred <- prediction(test.li.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)     


li.perf <- performance(li.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(li.perf, col="red")                                     # ROC Curve
abline(0,1, col="black")

perf.auc        <- performance(li.pred, "auc")               # AUC
auc             <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")

3-2-3. 향상 차트

1) Package “ROCR”

li.perf       <- performance(li.pred, "lift", "rpp")          # Lift Chart
plot(li.perf, colorize=T, lwd=2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)                  # 실제 클래스를 수치형으로 변환 변환

plotLift(test.li.prob, ac.numeric, cumulative = T, n.buckets = 24)     # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(test.li.prob, ac.numeric)                                 # Top 10% 향상도 출력

[1] 7.359

detach(package:lift)

4. Polynomial Kernel

4-1. 최적의 모수 찾기

Polynomial Kernel을 이용한 서포트 벡터 머신의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method="cv", number=5, 
                           search = "random", classProbs =  TRUE) # 5-Fold-Cross Validation
                                                                  # classProbs = TRUE해야 예측 확률 출력할 수 있음NA

set.seed(100)                                 # seed 고정 For Cross Validation
caret.rd.pl <- train(Personal.Loan~.,
                     data = UB.trd, 
                     method ="svmPoly",       # Tune Parameter : Degree, Scale, Cost
                     trControl = fitControl, 
                     tuneLength = 10)         # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.pl

Support Vector Machines with Polynomial Kernel 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  degree  scale         C            Accuracy   Kappa    
  1       8.046154e-04    9.4247259  0.9525910  0.7201751
  2       8.104079e-05  324.4389194  0.9583036  0.7496066
  2       1.215221e-04   93.9957842  0.9560179  0.7369367
  2       6.923903e-03    5.0092337  0.9611608  0.7671047
  2       7.141716e-03    1.1731437  0.9560212  0.7395297
  2       4.563003e-02    0.5609159  0.9737273  0.8491704
  2       9.339513e-02  635.7719748  0.9640228  0.7886702
  3       7.856877e-04  300.1008960  0.9634449  0.7804680
  3       1.686438e-03   43.0849494  0.9651591  0.7900019
  3       5.861281e-02  486.9318718  0.9617387  0.7681053

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were degree = 2, scale
 = 0.04563003 and C = 0.5609159.

plot(caret.rd.pl)                             # Accuracy

degree = 2, scale = 0.04563003, C = 0.5609159일 때 정확도가 가장 높다.
degree = 2, scale = 0.04563003, C = 0.5609159을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method="cv", number=5, classProbs =  TRUE)    # 5-Fold-Cross Validation

customGrid <- expand.grid(degree = 1:3,
                          scale  = seq(0.044,0.048, by=0.001),
                          C      = seq(0.54,0.58, by=0.01))             # Random Search의 Best Parameter 기준으로 탐색NA

set.seed(100)                             # seed 고정 For Cross Validation
caret.gd.pl <- train(Personal.Loan~., 
                     data=UB.trd, 
                     method="svmPoly",   
                     trControl = fitControl, 
                     tuneGrid = customGrid)    

caret.gd.pl

Support Vector Machines with Polynomial Kernel 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  degree  scale  C     Accuracy   Kappa    
  1       0.044  0.54  0.9548767  0.7294403
  1       0.044  0.55  0.9543053  0.7253132
  1       0.044  0.56  0.9571591  0.7411896
  1       0.044  0.57  0.9548767  0.7309967
  1       0.044  0.58  0.9548767  0.7295914
  1       0.045  0.54  0.9543053  0.7253132
  1       0.045  0.55  0.9548767  0.7280351
  1       0.045  0.56  0.9543053  0.7255243
  1       0.045  0.57  0.9554481  0.7337949
  1       0.045  0.58  0.9548767  0.7294403
  1       0.046  0.54  0.9543053  0.7255243
  1       0.046  0.55  0.9554481  0.7337949
  1       0.046  0.56  0.9554481  0.7306133
  1       0.046  0.57  0.9565910  0.7374654
  1       0.046  0.58  0.9554481  0.7306133
  1       0.047  0.54  0.9548767  0.7280351
  1       0.047  0.55  0.9554481  0.7322385
  1       0.047  0.56  0.9543053  0.7269296
  1       0.047  0.57  0.9560179  0.7347746
  1       0.047  0.58  0.9560179  0.7377363
  1       0.048  0.54  0.9560179  0.7377363
  1       0.048  0.55  0.9554481  0.7322385
  1       0.048  0.56  0.9548767  0.7280351
  1       0.048  0.57  0.9560195  0.7348168
  1       0.048  0.58  0.9565893  0.7397349
  2       0.044  0.54  0.9720130  0.8381218
  2       0.044  0.55  0.9714416  0.8336485
  2       0.044  0.56  0.9714416  0.8335522
  2       0.044  0.57  0.9725845  0.8394412
  2       0.044  0.58  0.9725845  0.8411238
  2       0.045  0.54  0.9714416  0.8347900
  2       0.045  0.55  0.9731559  0.8435490
  2       0.045  0.56  0.9720147  0.8388161
  2       0.045  0.57  0.9720130  0.8374021
  2       0.045  0.58  0.9725845  0.8413149
  2       0.046  0.54  0.9725861  0.8423059
  2       0.046  0.55  0.9725861  0.8412753
  2       0.046  0.56  0.9725861  0.8412753
  2       0.046  0.57  0.9737273  0.8478647
  2       0.046  0.58  0.9731559  0.8448464
  2       0.047  0.54  0.9720147  0.8375276
  2       0.047  0.55  0.9720147  0.8375276
  2       0.047  0.56  0.9725845  0.8413149
  2       0.047  0.57  0.9725845  0.8401044
  2       0.047  0.58  0.9720130  0.8373726
  2       0.048  0.54  0.9731575  0.8450015
  2       0.048  0.55  0.9737273  0.8491704
  2       0.048  0.56  0.9731575  0.8450015
  2       0.048  0.57  0.9725845  0.8402843
  2       0.048  0.58  0.9720147  0.8385582
  3       0.044  0.54  0.9697338  0.8250635
  3       0.044  0.55  0.9708767  0.8314637
  3       0.044  0.56  0.9708767  0.8315284
  3       0.044  0.57  0.9720179  0.8380908
  3       0.044  0.58  0.9731608  0.8447155
  3       0.045  0.54  0.9714481  0.8354477
  3       0.045  0.55  0.9725893  0.8410166
  3       0.045  0.56  0.9725910  0.8412065
  3       0.045  0.57  0.9725910  0.8412188
  3       0.045  0.58  0.9731608  0.8434497
  3       0.046  0.54  0.9725893  0.8413628
  3       0.046  0.55  0.9714465  0.8341349
  3       0.046  0.56  0.9720179  0.8364597
  3       0.046  0.57  0.9720195  0.8378405
  3       0.046  0.58  0.9731624  0.8441446
  3       0.047  0.54  0.9703053  0.8268941
  3       0.047  0.55  0.9714465  0.8352712
  3       0.047  0.56  0.9720195  0.8383592
  3       0.047  0.57  0.9714465  0.8335216
  3       0.047  0.58  0.9714481  0.8337327
  3       0.048  0.54  0.9708767  0.8308504
  3       0.048  0.55  0.9720195  0.8381266
  3       0.048  0.56  0.9720179  0.8369867
  3       0.048  0.57  0.9708767  0.8308504
  3       0.048  0.58  0.9720179  0.8361489

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were degree = 2, scale =
 0.048 and C = 0.55.

plot(caret.gd.pl)            # Accuracy

degree = 2, scale = 0.048, C= 0.55일 때 가장 높으며, 정확도는 똑같다.

# 최종 모형NAcaret.gd.pl$finalModel

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 0.55 

Polynomial kernel function. 
 Hyperparameters : degree =  2  scale =  0.048  offset =  1 

Number of Support Vectors : 211 

Objective Function Value : -80.5831 
Training error : 0.022273 
Probability model included.

4-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.pl.pred <- predict(caret.gd.pl, newdata=UB.ted)   # predict(svm모형, Test Data)

4-2-1. ConfusionMatrix

confusionMatrix(caret.gd.pl.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  668  17
       yes   5  59
                                          
               Accuracy : 0.9706          
                 95% CI : (0.9559, 0.9815)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : 3.493e-14       
                                          
                  Kappa : 0.8268          
                                          
 Mcnemar's Test P-Value : 0.01902         
                                          
            Sensitivity : 0.77632         
            Specificity : 0.99257         
         Pos Pred Value : 0.92188         
         Neg Pred Value : 0.97518         
             Prevalence : 0.10147         
         Detection Rate : 0.07877         
   Detection Prevalence : 0.08545         
      Balanced Accuracy : 0.88444         
                                          
       'Positive' Class : yes

4-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.pl.prob <- predict(caret.gd.pl, newdata = UB.ted, type="prob")  #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.pl.prob <- test.pl.prob[,2]                                     # "yes"에 대한 예측 확률


ac           <- UB.ted$Personal.Loan                                 # 실제 클래스래스

pp           <- as.numeric(test.pl.prob)                             # "yes"에 대한 예측 확률

tree.roc     <- roc(ac, pp, plot = T, col = "red")                   # roc(실제 클래스, 예측 확률)률)

auc          <- round(auc(tree.roc),3)
legend("bottomright",legend=auc, bty="n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot="ROC")    # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                                                  

pl.pred <- prediction(test.pl.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)     


pl.perf <- performance(pl.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(pl.perf, col="red")                                     # ROC Curve
abline(0,1, col="black")

perf.auc        <- performance(pl.pred, "auc")               # AUC
auc             <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")

4-2-3. 향상 차트

1) Package “ROCR”

pl.perf       <- performance(pl.pred, "lift", "rpp")          # Lift Chart
plot(pl.perf, colorize=T, lwd=2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)                  # 실제 클래스를 수치형으로 변환 변환

plotLift(test.pl.prob, ac.numeric, cumulative = T, n.buckets = 24)     # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(test.pl.prob, ac.numeric)                                 # Top 10% 향상도 출력

[1] 8.278

detach(package:lift)

5. Radial Basis Kernel

5-1. 최적의 모수 찾기

Radial Basis Kernel을 이용한 서포트 벡터 머신의 최적의 모수를 찾기 위해서 "Random Search" 방법을 먼저 수행하였다.

fitControl <- trainControl(method="cv", number=5, 
                           search = "random", classProbs =  TRUE) # 5-Fold-Cross Validation
                                                                  # classProbs = TRUE해야 예측 확률 출력할 수 있음NA

set.seed(100)                                  # seed 고정 For Cross Validation
caret.rd.rbf <- train(Personal.Loan~., 
                      data=UB.trd, 
                      method="svmRadial",      # Tune Parameter : Sigma, Cost
                      trControl = fitControl, 
                      tuneLength=20)           # tuneLength (탐색할 후보 모수 갯수) 

caret.rd.rbf

Support Vector Machines with Radial Basis Function Kernel 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  sigma        C             Accuracy   Kappa    
  0.008214323    0.60401863  0.9560147  0.7451182
  0.009768217  905.87377461  0.9714432  0.8335970
  0.010432256    1.38905870  0.9605910  0.7673322
  0.014619919   22.18681328  0.9743036  0.8517002
  0.015575790  159.84518460  0.9754432  0.8595213
  0.016518708   18.86122090  0.9748751  0.8541785
  0.020658156   31.49843758  0.9754449  0.8596771
  0.020863473  332.66065502  0.9685926  0.8160143
  0.025179572    4.40210886  0.9720212  0.8387323
  0.028615078    3.20586906  0.9714497  0.8357256
  0.028821683    1.76570925  0.9743036  0.8532309
  0.031304626    4.82668520  0.9725893  0.8404095
  0.046563920    3.96452937  0.9685926  0.8192963
  0.047259075   75.37896521  0.9680260  0.8158490
  0.062120614    0.76206785  0.9680163  0.8161385
  0.070528378    1.32445041  0.9697322  0.8263388
  0.106177393    0.42395243  0.9668718  0.8185654
  0.118250539    0.09815856  0.9645861  0.8103345
  0.126313873   98.11634013  0.9645942  0.7950592
  0.144773401   63.82696989  0.9634514  0.7914127

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were sigma = 0.02065816 and C
 = 31.49844.

plot(caret.rd.rbf)                             # Accuracy

sigma = 0.02065816, C = 31.49844일 때 정확도가 가장 높다.
sigma = 0.02065816, C = 31.49844을 기준으로 다양한 후보 모수를 주며 Grid Search 방법으로 최적의 모수를 찾는다.

fitControl <- trainControl(method="cv", number=5, classProbs =  TRUE)    # 5-Fold-Cross Validation

customGrid <- expand.grid(sigma = seq(0.018,0.022, by=0.001),
                          C     = seq(31.48,31.52, by=0.01))             # Random Search의 Best Parameter 기준으로 탐색NA

set.seed(100)                           # seed 고정 For Cross Validation
caret.gd.rbf <- train(Personal.Loan~.,
                      data=UB.trd, 
                      method="svmRadial",  
                      trControl = fitControl, 
                      tuneGrid = customGrid)    

caret.gd.rbf

Support Vector Machines with Radial Basis Function Kernel 

1751 samples
  12 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401 
Resampling results across tuning parameters:

  sigma  C      Accuracy   Kappa    
  0.018  31.48  0.9743020  0.8530644
  0.018  31.49  0.9754449  0.8589044
  0.018  31.50  0.9754449  0.8587737
  0.018  31.51  0.9748734  0.8566035
  0.018  31.52  0.9754449  0.8589044
  0.019  31.48  0.9748734  0.8565388
  0.019  31.49  0.9760163  0.8622776
  0.019  31.50  0.9743020  0.8530208
  0.019  31.51  0.9754449  0.8586279
  0.019  31.52  0.9754449  0.8580155
  0.020  31.48  0.9748734  0.8564963
  0.020  31.49  0.9754449  0.8593869
  0.020  31.50  0.9754449  0.8593869
  0.020  31.51  0.9754449  0.8593869
  0.020  31.52  0.9760163  0.8622776
  0.021  31.48  0.9743020  0.8530208
  0.021  31.49  0.9754449  0.8593869
  0.021  31.50  0.9743020  0.8530208
  0.021  31.51  0.9754449  0.8587301
  0.021  31.52  0.9754449  0.8587301
  0.022  31.48  0.9748734  0.8558395
  0.022  31.49  0.9737306  0.8507691
  0.022  31.50  0.9754449  0.8587301
  0.022  31.51  0.9760163  0.8625677
  0.022  31.52  0.9760163  0.8625677

Accuracy was used to select the optimal model using the
 largest value.
The final values used for the model were sigma = 0.019 and C = 31.49.

plot(caret.gd.rbf)            # Accuracy

sigma = 0.019, C = 31.49일 때 가장 높으며, 정확도는 0.001 증가했다.

# 최종 모형NAcaret.gd.rbf$finalModel

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 31.49 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.019 

Number of Support Vectors : 164 

Objective Function Value : -2119.071 
Training error : 0.009709 
Probability model included.

5-2. 모형 평가

# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.rbf.pred <- predict(caret.gd.rbf, newdata=UB.ted)    # predict(svm모형, Test Data)

5-2-1. ConfusionMatrix

confusionMatrix(caret.gd.rbf.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  667  11
       yes   6  65
                                          
               Accuracy : 0.9773          
                 95% CI : (0.9639, 0.9867)
    No Information Rate : 0.8985          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8718          
                                          
 Mcnemar's Test P-Value : 0.332           
                                          
            Sensitivity : 0.85526         
            Specificity : 0.99108         
         Pos Pred Value : 0.91549         
         Neg Pred Value : 0.98378         
             Prevalence : 0.10147         
         Detection Rate : 0.08678         
   Detection Prevalence : 0.09479         
      Balanced Accuracy : 0.92317         
                                          
       'Positive' Class : yes

5-2-2. ROC 곡선

1) Package “pROC”

pacman::p_load("pROC")

test.rbf.prob <- predict(caret.gd.rbf, newdata = UB.ted, type="prob")  #  Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rbf.prob <- test.rbf.prob[,2]                                     # "yes"에 대한 예측 확률


ac            <- UB.ted$Personal.Loan                                  # 실제 클래스래스

pp            <- as.numeric(test.rbf.prob)                             # "yes"에 대한 예측 확률
 
tree.roc      <- roc(ac, pp, plot = T, col = "red")                    # roc(실제 클래스, 예측 확률)률)

auc           <- round(auc(tree.roc),3)
legend("bottomright",legend=auc, bty="n")

detach(package:pROC)

2) Package “Epi”

pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")

ROC(pp, ac, plot="ROC")    # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능

detach(package:Epi)

3) Package “ROCR”

pacman::p_load("ROCR")                                                  

rbf.pred <- prediction(test.rbf.prob, UB.ted$Personal.Loan)    # prediction(예측 확률, 실제 클레스)     


rbf.perf <- performance(rbf.pred, "tpr", "fpr")                # performance(, "민감도", "1-특이도")                        
plot(rbf.perf, col="red")                                      # ROC Curve
abline(0,1, col="black")

perf.auc        <- performance(rbf.pred, "auc")                # AUC
auc             <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")

5-2-3. 향상 차트

1) Package “ROCR”

rbf.perf       <- performance(rbf.pred, "lift", "rpp")          # Lift Chart
plot(rbf.perf, colorize=T, lwd=2)

detach(package:ROCR)

2) Package “lift”

pacman::p_load("lift")

ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0)                  # 실제 클래스를 수치형으로 변환 변환

plotLift(test.rbf.prob, ac.numeric, cumulative = T, n.buckets = 24)    # plotLift(예측 확률, 실제 클래스)스)

TopDecileLift(test.rbf.prob, ac.numeric)                               # Top 10% 향상도 출력

[1] 8.804

detach(package:lift)

6. 모형 비교

6-1. 예측 오차

pacman::p_load("tidyverse")

prev.class <- data.frame(linear= caret.gd.li.pred, poly=caret.gd.pl.pred, radial=caret.gd.rbf.pred, obs=UB.ted$Personal.Loan)

prev.class %>% 
  summarise_all(funs(err=mean(obs!=.))) %>% 
  select(-obs_err) %>% 
  round(3)

  linear_err poly_err radial_err
1      0.047    0.029      0.023

6-2. ROC 곡선

pacman::p_load("plotROC")

prev.prob <- data.frame(linear=test.li.prob, poly=test.pl.prob,radial=test.rbf.prob,obs=UB.ted$Personal.Loan)

df.roc <- prev.prob %>% 
  gather(key=Method,value=score,linear,poly,radial)  # score : 예측 확률


ggroc <- ggplot(df.roc) +
        aes(d=obs,m=score,color=Method) + 
        geom_roc() +                                 # label : Cutoff Value
        theme_classic()

ggroc

calc_auc(ggroc)                         # AUC

  PANEL group       AUC
1     1     1 0.9395284
2     1     2 0.9739188
3     1     3 0.9821889

Support Vector Machine based on Caret

1. 데이터 불러오기

2. 데이터 분할

3. Linear Kernel

3-1. 최적의 모수 찾기

3-2. 모형 평가

3-2-1. ConfusionMatrix

3-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

3-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

4. Polynomial Kernel

4-1. 최적의 모수 찾기

4-2. 모형 평가

4-2-1. ConfusionMatrix

4-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

4-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

5. Radial Basis Kernel

5-1. 최적의 모수 찾기

5-2. 모형 평가

5-2-1. ConfusionMatrix

5-2-2. ROC 곡선

1) Package “pROC”

2) Package “Epi”

3) Package “ROCR”

5-2-3. 향상 차트

1) Package “ROCR”

2) Package “lift”

6. 모형 비교

6-1. 예측 오차

6-2. ROC 곡선

Reuse