R code using Caret Package for Support Vector Machine
Package
"caret"
은 다양한 머신러닝 분석을 하나로 모은 패키지이며,trainControl
을 이용하여 과적합을 방지할 수 있다."caret"
에서는 다양한 커널과 방법을 이용한 서포트 벡터 머신 분석을 수행할 수 있으며, 그 중 가장 많이 쓰이는Linear
,Polynomial
,Radial Basis
Kernel들을 이용하여 예제 데이터를 분석한다. 자세한 모델은 여기를 참조한다.
예제 데이터는 “Universal Bank_Main”로 유니버셜 은행의 고객들에 대한 데이터(출처 : Data Mining for Business Intelligence, Shmueli et al. 2010)이다. 데이터는 총 2500개이며, 변수의 갯수는 13개이다. 여기서 Target은Person.Loan
이다.
pacman::p_load("data.table", "dplyr")
UB <- fread(paste(getwd(),"Universal Bank_Main.csv", sep="/")) %>% # 데이터 불러오기
data.frame() %>% # Data frame 변환환
mutate(Personal.Loan = ifelse(Personal.Loan==1, "yes","no")) %>% # Character for classification
select(-1) # ID변수 제거거
cols <- c("Family", "Education", "Personal.Loan", "Securities.Account",
"CD.Account", "Online", "CreditCard")
UB <- UB %>%
mutate_at(cols, as.factor) # 범주형 변수 변환
glimpse(UB) # 데이터 구조
Rows: 2,500
Columns: 13
$ Age <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 6~
$ Experience <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5~
$ Income <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180,~
$ ZIP.Code <int> 91107, 90089, 94720, 94112, 91330, 92121,~
$ Family <fct> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4,~
$ CCAvg <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0~
$ Education <fct> 1, 1, 1, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2,~
$ Mortgage <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0~
$ Personal.Loan <fct> no, no, no, no, no, no, no, no, no, yes, ~
$ Securities.Account <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
$ CD.Account <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ Online <fct> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1,~
$ CreditCard <fct> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
Linear Kernel을 이용한 서포트 벡터 머신의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method="cv", number=5,
search = "random", classProbs = TRUE) # 5-Fold-Cross Validation
# classProbs = TRUE해야 예측 확률 출력할 수 있음NA
set.seed(100) # seed 고정 For Cross Validation
caret.rd.li <- train(Personal.Loan~., # Tune Parameter : Cost
data = UB.trd,
method = "svmLinear", # svmLinear : ksvm / svmLinear2 : svm
trControl = fitControl,
tuneLength = 10) # tuneLength (탐색할 후보 모수 갯수)
caret.rd.li
Support Vector Machines with Linear Kernel
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
C Accuracy Kappa
0.05616236 0.9583036 0.7496066
0.18351002 0.9583053 0.7452690
1.46897318 0.9600179 0.7554191
4.07906715 0.9600195 0.7520339
4.77851063 0.9605877 0.7570026
9.17926682 0.9588767 0.7462176
9.74617752 0.9583036 0.7447797
20.74867286 0.9594465 0.7504374
145.61743228 0.9605877 0.7583088
300.76289267 0.9594449 0.7499907
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was C = 4.778511.
plot(caret.rd.li) # Accuracy
C
= 4.778511일 때 정확도가 가장 높다.C
= 4.778511을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method="cv", number=5, classProbs = TRUE) # 5-Fold-Cross Validation
customGrid <- expand.grid(C = seq(4.68,4.88, by=0.01)) # Random Search의 Best Parameter 기준으로 탐색NA
set.seed(100) # seed 고정 For Cross Validation
caret.gd.li <- train(Personal.Loan~.,
data=UB.trd,
method="svmLinear",
trControl = fitControl,
tuneGrid = customGrid)
caret.gd.li
Support Vector Machines with Linear Kernel
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
C Accuracy Kappa
4.68 0.9611591 0.7598194
4.69 0.9600179 0.7535061
4.70 0.9588734 0.7473399
4.71 0.9605893 0.7559827
4.72 0.9588751 0.7480268
4.73 0.9605893 0.7562346
4.74 0.9605877 0.7570026
4.75 0.9594481 0.7505010
4.76 0.9588751 0.7480323
4.77 0.9605893 0.7559754
4.78 0.9588734 0.7467413
4.79 0.9605893 0.7562346
4.80 0.9605893 0.7583547
4.81 0.9600195 0.7558497
4.82 0.9600195 0.7520339
4.83 0.9600195 0.7532591
4.84 0.9588767 0.7449410
4.85 0.9588751 0.7469430
4.86 0.9583053 0.7451216
4.87 0.9600179 0.7516993
4.88 0.9588751 0.7467558
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was C = 4.68.
plot(caret.gd.li) # Accuracy
C
= 4.68일 때 정확도가 가장 높으며, C
= 4.778511 보다 정확도가 0.001 증가했다.# 최종 모형NAcaret.gd.li$finalModel
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 4.68
Linear (vanilla) kernel function.
Number of Support Vectors : 185
Objective Function Value : -821.0561
Training error : 0.035408
Probability model included.
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.li.pred <- predict(caret.gd.li, newdata=UB.ted) # predict(svm모형, Test Data)
confusionMatrix(caret.gd.li.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 667 29
yes 6 47
Accuracy : 0.9533
95% CI : (0.9356, 0.9672)
No Information Rate : 0.8985
P-Value [Acc > NIR] : 3.36e-08
Kappa : 0.704
Mcnemar's Test P-Value : 0.0002003
Sensitivity : 0.61842
Specificity : 0.99108
Pos Pred Value : 0.88679
Neg Pred Value : 0.95833
Prevalence : 0.10147
Detection Rate : 0.06275
Detection Prevalence : 0.07076
Balanced Accuracy : 0.80475
'Positive' Class : yes
pacman::p_load("pROC")
test.li.prob <- predict(caret.gd.li, newdata = UB.ted, type="prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.li.prob <- test.li.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.li.prob) # "yes"에 대한 예측 확률
tree.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(tree.roc),3)
legend("bottomright",legend=auc, bty="n")
detach(package:pROC)
pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp, ac, plot="ROC") # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)
pacman::p_load("ROCR")
li.pred <- prediction(test.li.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
li.perf <- performance(li.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(li.perf, col="red") # ROC Curve
abline(0,1, col="black")
perf.auc <- performance(li.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")
li.perf <- performance(li.pred, "lift", "rpp") # Lift Chart
plot(li.perf, colorize=T, lwd=2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(test.li.prob, ac.numeric, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.li.prob, ac.numeric) # Top 10% 향상도 출력
[1] 7.359
detach(package:lift)
Polynomial Kernel을 이용한 서포트 벡터 머신의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method="cv", number=5,
search = "random", classProbs = TRUE) # 5-Fold-Cross Validation
# classProbs = TRUE해야 예측 확률 출력할 수 있음NA
set.seed(100) # seed 고정 For Cross Validation
caret.rd.pl <- train(Personal.Loan~.,
data = UB.trd,
method ="svmPoly", # Tune Parameter : Degree, Scale, Cost
trControl = fitControl,
tuneLength = 10) # tuneLength (탐색할 후보 모수 갯수)
caret.rd.pl
Support Vector Machines with Polynomial Kernel
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 8.046154e-04 9.4247259 0.9525910 0.7201751
2 8.104079e-05 324.4389194 0.9583036 0.7496066
2 1.215221e-04 93.9957842 0.9560179 0.7369367
2 6.923903e-03 5.0092337 0.9611608 0.7671047
2 7.141716e-03 1.1731437 0.9560212 0.7395297
2 4.563003e-02 0.5609159 0.9737273 0.8491704
2 9.339513e-02 635.7719748 0.9640228 0.7886702
3 7.856877e-04 300.1008960 0.9634449 0.7804680
3 1.686438e-03 43.0849494 0.9651591 0.7900019
3 5.861281e-02 486.9318718 0.9617387 0.7681053
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were degree = 2, scale
= 0.04563003 and C = 0.5609159.
plot(caret.rd.pl) # Accuracy
degree
= 2, scale
= 0.04563003, C
= 0.5609159일 때 정확도가 가장 높다.degree
= 2, scale
= 0.04563003, C
= 0.5609159을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method="cv", number=5, classProbs = TRUE) # 5-Fold-Cross Validation
customGrid <- expand.grid(degree = 1:3,
scale = seq(0.044,0.048, by=0.001),
C = seq(0.54,0.58, by=0.01)) # Random Search의 Best Parameter 기준으로 탐색NA
set.seed(100) # seed 고정 For Cross Validation
caret.gd.pl <- train(Personal.Loan~.,
data=UB.trd,
method="svmPoly",
trControl = fitControl,
tuneGrid = customGrid)
caret.gd.pl
Support Vector Machines with Polynomial Kernel
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
degree scale C Accuracy Kappa
1 0.044 0.54 0.9548767 0.7294403
1 0.044 0.55 0.9543053 0.7253132
1 0.044 0.56 0.9571591 0.7411896
1 0.044 0.57 0.9548767 0.7309967
1 0.044 0.58 0.9548767 0.7295914
1 0.045 0.54 0.9543053 0.7253132
1 0.045 0.55 0.9548767 0.7280351
1 0.045 0.56 0.9543053 0.7255243
1 0.045 0.57 0.9554481 0.7337949
1 0.045 0.58 0.9548767 0.7294403
1 0.046 0.54 0.9543053 0.7255243
1 0.046 0.55 0.9554481 0.7337949
1 0.046 0.56 0.9554481 0.7306133
1 0.046 0.57 0.9565910 0.7374654
1 0.046 0.58 0.9554481 0.7306133
1 0.047 0.54 0.9548767 0.7280351
1 0.047 0.55 0.9554481 0.7322385
1 0.047 0.56 0.9543053 0.7269296
1 0.047 0.57 0.9560179 0.7347746
1 0.047 0.58 0.9560179 0.7377363
1 0.048 0.54 0.9560179 0.7377363
1 0.048 0.55 0.9554481 0.7322385
1 0.048 0.56 0.9548767 0.7280351
1 0.048 0.57 0.9560195 0.7348168
1 0.048 0.58 0.9565893 0.7397349
2 0.044 0.54 0.9720130 0.8381218
2 0.044 0.55 0.9714416 0.8336485
2 0.044 0.56 0.9714416 0.8335522
2 0.044 0.57 0.9725845 0.8394412
2 0.044 0.58 0.9725845 0.8411238
2 0.045 0.54 0.9714416 0.8347900
2 0.045 0.55 0.9731559 0.8435490
2 0.045 0.56 0.9720147 0.8388161
2 0.045 0.57 0.9720130 0.8374021
2 0.045 0.58 0.9725845 0.8413149
2 0.046 0.54 0.9725861 0.8423059
2 0.046 0.55 0.9725861 0.8412753
2 0.046 0.56 0.9725861 0.8412753
2 0.046 0.57 0.9737273 0.8478647
2 0.046 0.58 0.9731559 0.8448464
2 0.047 0.54 0.9720147 0.8375276
2 0.047 0.55 0.9720147 0.8375276
2 0.047 0.56 0.9725845 0.8413149
2 0.047 0.57 0.9725845 0.8401044
2 0.047 0.58 0.9720130 0.8373726
2 0.048 0.54 0.9731575 0.8450015
2 0.048 0.55 0.9737273 0.8491704
2 0.048 0.56 0.9731575 0.8450015
2 0.048 0.57 0.9725845 0.8402843
2 0.048 0.58 0.9720147 0.8385582
3 0.044 0.54 0.9697338 0.8250635
3 0.044 0.55 0.9708767 0.8314637
3 0.044 0.56 0.9708767 0.8315284
3 0.044 0.57 0.9720179 0.8380908
3 0.044 0.58 0.9731608 0.8447155
3 0.045 0.54 0.9714481 0.8354477
3 0.045 0.55 0.9725893 0.8410166
3 0.045 0.56 0.9725910 0.8412065
3 0.045 0.57 0.9725910 0.8412188
3 0.045 0.58 0.9731608 0.8434497
3 0.046 0.54 0.9725893 0.8413628
3 0.046 0.55 0.9714465 0.8341349
3 0.046 0.56 0.9720179 0.8364597
3 0.046 0.57 0.9720195 0.8378405
3 0.046 0.58 0.9731624 0.8441446
3 0.047 0.54 0.9703053 0.8268941
3 0.047 0.55 0.9714465 0.8352712
3 0.047 0.56 0.9720195 0.8383592
3 0.047 0.57 0.9714465 0.8335216
3 0.047 0.58 0.9714481 0.8337327
3 0.048 0.54 0.9708767 0.8308504
3 0.048 0.55 0.9720195 0.8381266
3 0.048 0.56 0.9720179 0.8369867
3 0.048 0.57 0.9708767 0.8308504
3 0.048 0.58 0.9720179 0.8361489
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were degree = 2, scale =
0.048 and C = 0.55.
plot(caret.gd.pl) # Accuracy
degree
= 2, scale
= 0.048, C
= 0.55일 때 가장 높으며, 정확도는 똑같다.# 최종 모형NAcaret.gd.pl$finalModel
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 0.55
Polynomial kernel function.
Hyperparameters : degree = 2 scale = 0.048 offset = 1
Number of Support Vectors : 211
Objective Function Value : -80.5831
Training error : 0.022273
Probability model included.
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.pl.pred <- predict(caret.gd.pl, newdata=UB.ted) # predict(svm모형, Test Data)
confusionMatrix(caret.gd.pl.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 668 17
yes 5 59
Accuracy : 0.9706
95% CI : (0.9559, 0.9815)
No Information Rate : 0.8985
P-Value [Acc > NIR] : 3.493e-14
Kappa : 0.8268
Mcnemar's Test P-Value : 0.01902
Sensitivity : 0.77632
Specificity : 0.99257
Pos Pred Value : 0.92188
Neg Pred Value : 0.97518
Prevalence : 0.10147
Detection Rate : 0.07877
Detection Prevalence : 0.08545
Balanced Accuracy : 0.88444
'Positive' Class : yes
pacman::p_load("pROC")
test.pl.prob <- predict(caret.gd.pl, newdata = UB.ted, type="prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.pl.prob <- test.pl.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.pl.prob) # "yes"에 대한 예측 확률
tree.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(tree.roc),3)
legend("bottomright",legend=auc, bty="n")
detach(package:pROC)
pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp, ac, plot="ROC") # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)
pacman::p_load("ROCR")
pl.pred <- prediction(test.pl.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
pl.perf <- performance(pl.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(pl.perf, col="red") # ROC Curve
abline(0,1, col="black")
perf.auc <- performance(pl.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")
pl.perf <- performance(pl.pred, "lift", "rpp") # Lift Chart
plot(pl.perf, colorize=T, lwd=2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(test.pl.prob, ac.numeric, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.pl.prob, ac.numeric) # Top 10% 향상도 출력
[1] 8.278
detach(package:lift)
Radial Basis Kernel을 이용한 서포트 벡터 머신의 최적의 모수를 찾기 위해서
"Random Search"
방법을 먼저 수행하였다.
fitControl <- trainControl(method="cv", number=5,
search = "random", classProbs = TRUE) # 5-Fold-Cross Validation
# classProbs = TRUE해야 예측 확률 출력할 수 있음NA
set.seed(100) # seed 고정 For Cross Validation
caret.rd.rbf <- train(Personal.Loan~.,
data=UB.trd,
method="svmRadial", # Tune Parameter : Sigma, Cost
trControl = fitControl,
tuneLength=20) # tuneLength (탐색할 후보 모수 갯수)
caret.rd.rbf
Support Vector Machines with Radial Basis Function Kernel
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
sigma C Accuracy Kappa
0.008214323 0.60401863 0.9560147 0.7451182
0.009768217 905.87377461 0.9714432 0.8335970
0.010432256 1.38905870 0.9605910 0.7673322
0.014619919 22.18681328 0.9743036 0.8517002
0.015575790 159.84518460 0.9754432 0.8595213
0.016518708 18.86122090 0.9748751 0.8541785
0.020658156 31.49843758 0.9754449 0.8596771
0.020863473 332.66065502 0.9685926 0.8160143
0.025179572 4.40210886 0.9720212 0.8387323
0.028615078 3.20586906 0.9714497 0.8357256
0.028821683 1.76570925 0.9743036 0.8532309
0.031304626 4.82668520 0.9725893 0.8404095
0.046563920 3.96452937 0.9685926 0.8192963
0.047259075 75.37896521 0.9680260 0.8158490
0.062120614 0.76206785 0.9680163 0.8161385
0.070528378 1.32445041 0.9697322 0.8263388
0.106177393 0.42395243 0.9668718 0.8185654
0.118250539 0.09815856 0.9645861 0.8103345
0.126313873 98.11634013 0.9645942 0.7950592
0.144773401 63.82696989 0.9634514 0.7914127
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were sigma = 0.02065816 and C
= 31.49844.
plot(caret.rd.rbf) # Accuracy
sigma
= 0.02065816, C
= 31.49844일 때 정확도가 가장 높다.sigma
= 0.02065816, C
= 31.49844을 기준으로 다양한 후보 모수를 주며 Grid Search
방법으로 최적의 모수를 찾는다.fitControl <- trainControl(method="cv", number=5, classProbs = TRUE) # 5-Fold-Cross Validation
customGrid <- expand.grid(sigma = seq(0.018,0.022, by=0.001),
C = seq(31.48,31.52, by=0.01)) # Random Search의 Best Parameter 기준으로 탐색NA
set.seed(100) # seed 고정 For Cross Validation
caret.gd.rbf <- train(Personal.Loan~.,
data=UB.trd,
method="svmRadial",
trControl = fitControl,
tuneGrid = customGrid)
caret.gd.rbf
Support Vector Machines with Radial Basis Function Kernel
1751 samples
12 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1401, 1401, 1400, 1401, 1401
Resampling results across tuning parameters:
sigma C Accuracy Kappa
0.018 31.48 0.9743020 0.8530644
0.018 31.49 0.9754449 0.8589044
0.018 31.50 0.9754449 0.8587737
0.018 31.51 0.9748734 0.8566035
0.018 31.52 0.9754449 0.8589044
0.019 31.48 0.9748734 0.8565388
0.019 31.49 0.9760163 0.8622776
0.019 31.50 0.9743020 0.8530208
0.019 31.51 0.9754449 0.8586279
0.019 31.52 0.9754449 0.8580155
0.020 31.48 0.9748734 0.8564963
0.020 31.49 0.9754449 0.8593869
0.020 31.50 0.9754449 0.8593869
0.020 31.51 0.9754449 0.8593869
0.020 31.52 0.9760163 0.8622776
0.021 31.48 0.9743020 0.8530208
0.021 31.49 0.9754449 0.8593869
0.021 31.50 0.9743020 0.8530208
0.021 31.51 0.9754449 0.8587301
0.021 31.52 0.9754449 0.8587301
0.022 31.48 0.9748734 0.8558395
0.022 31.49 0.9737306 0.8507691
0.022 31.50 0.9754449 0.8587301
0.022 31.51 0.9760163 0.8625677
0.022 31.52 0.9760163 0.8625677
Accuracy was used to select the optimal model using the
largest value.
The final values used for the model were sigma = 0.019 and C = 31.49.
plot(caret.gd.rbf) # Accuracy
sigma
= 0.019, C
= 31.49일 때 가장 높으며, 정확도는 0.001 증가했다.# 최종 모형NAcaret.gd.rbf$finalModel
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 31.49
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.019
Number of Support Vectors : 164
Objective Function Value : -2119.071
Training error : 0.009709
Probability model included.
# 적합된 모형으로 Test Data의 클래스 예측NAcaret.gd.rbf.pred <- predict(caret.gd.rbf, newdata=UB.ted) # predict(svm모형, Test Data)
confusionMatrix(caret.gd.rbf.pred, UB.ted$Personal.Loan, positive = "yes") # confusionMatrix(예측 클래스, 실제 클래스, positive = "관심 클래스") 클래스")
Confusion Matrix and Statistics
Reference
Prediction no yes
no 667 11
yes 6 65
Accuracy : 0.9773
95% CI : (0.9639, 0.9867)
No Information Rate : 0.8985
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8718
Mcnemar's Test P-Value : 0.332
Sensitivity : 0.85526
Specificity : 0.99108
Pos Pred Value : 0.91549
Neg Pred Value : 0.98378
Prevalence : 0.10147
Detection Rate : 0.08678
Detection Prevalence : 0.09479
Balanced Accuracy : 0.92317
'Positive' Class : yes
pacman::p_load("pROC")
test.rbf.prob <- predict(caret.gd.rbf, newdata = UB.ted, type="prob") # Training Data로 적합시킨 모형에 대한 Test Data의 각 클래스에 대한 예측 확률
test.rbf.prob <- test.rbf.prob[,2] # "yes"에 대한 예측 확률
ac <- UB.ted$Personal.Loan # 실제 클래스래스
pp <- as.numeric(test.rbf.prob) # "yes"에 대한 예측 확률
tree.roc <- roc(ac, pp, plot = T, col = "red") # roc(실제 클래스, 예측 확률)률)
auc <- round(auc(tree.roc),3)
legend("bottomright",legend=auc, bty="n")
detach(package:pROC)
pacman::p_load("devtools", "Epi")
# install_version("etm", version = "1.1", repos = "http://cran.us.r-project.org")
ROC(pp, ac, plot="ROC") # ROC(예측 확률, 실제 클래스) / 최적의 cutoff value 예측 가능
detach(package:Epi)
pacman::p_load("ROCR")
rbf.pred <- prediction(test.rbf.prob, UB.ted$Personal.Loan) # prediction(예측 확률, 실제 클레스)
rbf.perf <- performance(rbf.pred, "tpr", "fpr") # performance(, "민감도", "1-특이도")
plot(rbf.perf, col="red") # ROC Curve
abline(0,1, col="black")
perf.auc <- performance(rbf.pred, "auc") # AUC
auc <- attributes(perf.auc)$y.values
legend("bottomright", legend=auc, bty="n")
rbf.perf <- performance(rbf.pred, "lift", "rpp") # Lift Chart
plot(rbf.perf, colorize=T, lwd=2)
detach(package:ROCR)
pacman::p_load("lift")
ac.numeric <- ifelse(UB.ted$Personal.Loan=="yes",1,0) # 실제 클래스를 수치형으로 변환 변환
plotLift(test.rbf.prob, ac.numeric, cumulative = T, n.buckets = 24) # plotLift(예측 확률, 실제 클래스)스)
TopDecileLift(test.rbf.prob, ac.numeric) # Top 10% 향상도 출력
[1] 8.804
detach(package:lift)
pacman::p_load("tidyverse")
prev.class <- data.frame(linear= caret.gd.li.pred, poly=caret.gd.pl.pred, radial=caret.gd.rbf.pred, obs=UB.ted$Personal.Loan)
prev.class %>%
summarise_all(funs(err=mean(obs!=.))) %>%
select(-obs_err) %>%
round(3)
linear_err poly_err radial_err
1 0.047 0.029 0.023
pacman::p_load("plotROC")
prev.prob <- data.frame(linear=test.li.prob, poly=test.pl.prob,radial=test.rbf.prob,obs=UB.ted$Personal.Loan)
df.roc <- prev.prob %>%
gather(key=Method,value=score,linear,poly,radial) # score : 예측 확률
ggroc <- ggplot(df.roc) +
aes(d=obs,m=score,color=Method) +
geom_roc() + # label : Cutoff Value
theme_classic()
ggroc
calc_auc(ggroc) # AUC
PANEL group AUC
1 1 1 0.9395284
2 1 2 0.9739188
3 1 3 0.9821889
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".