Stacking based on Tidymodels

Machine Learning

R code using Tidymodels Package for Stacking

Yeongeun Jeon , Jung In Seo
05-10-2022

Package tidymodels (Ver 0.2.0)는 R에서 머신러닝(Machine Learning)을 tidyverse principle로 수행할 수 있게끔 해주는 패키지 묶음이다. 특히, 모델링에 필요한 필수 패키지들을 대부분 포함하고 있기 때문에 데이터 전처리부터 시각화, 모델링, 예측까지 모든 과정을 tidy framework로 진행할 수 있다. 또한, Package caret을 완벽하게 대체하며 보다 더 빠르고 직관적인 코드로 모델링을 수행할 수 있다. Package tidymodels를 이용하여 Stacking을 수행하는 방법을 설명하기 위해 “Heart Disease Prediction” 데이터를 예제로 사용한다. 이 데이터는 환자의 심장병을 예측하기 위해 총 918명의 환자에 대한 10개의 예측변수로 이루어진 데이터이다(출처 : Package MLDataR, Gary Hutson 2021). 여기서 TargetHeartDisease이다.



Stacking을 위한 개별 모형 : Support Vector Machine, Random Forest, XGBoost


1. 데이터 불러오기

# install.packages("tidymodels")
pacman::p_load("MLDataR",                                              # For Data
               "data.table", "magrittr",
               "tidymodels",
               "stacks",                                               # For Stacking
               "doParallel", "parallel")

registerDoParallel(cores=detectCores())


data(heartdisease)
data <- heartdisease %>%
  mutate(HeartDisease = ifelse(HeartDisease==0, "no", "yes"))


cols <- c("Sex", "RestingECG", "Angina", "HeartDisease")

data   <- data %>% 
  mutate_at(cols, as.factor)                                        


glimpse(data)                                                          
Rows: 918
Columns: 10
$ Age              <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37,~
$ Sex              <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F~
$ RestingBP        <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140~
$ Cholesterol      <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207~
$ FastingBS        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ RestingECG       <fct> Normal, Normal, ST, Normal, Normal, Normal,~
$ MaxHR            <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130,~
$ Angina           <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N~
$ HeartPeakReading <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5~
$ HeartDisease     <fct> no, yes, no, yes, no, no, no, no, yes, no, ~

Caution! 스태킹에서 이진 분류 문제인 경우, 관심 클래스가 Target의 두 번째 Level이기 때문에 이에 맞게 설정하는 것이 필요하다.


2. 데이터 분할

set.seed(100)                                                          
data.split <- initial_split(data, prop = 0.7, strata = HeartDisease)   
HD.train   <- training(data.split)
HD.test    <- testing(data.split)

3. Stacking


3-1. fit_resamples()를 이용한 Stacking


3-1-1. 후보 개별 모형 정의 및 평가


3-1-1-1. 전처리 정의

rec  <- recipe(HeartDisease ~ ., data = HD.train) %>%                  # recipe(formula, data)
  step_normalize(all_numeric_predictors()) %>%                         
  step_dummy(all_nominal_predictors(), one_hot = TRUE)                 

3-1-1-2. 모형 정의

# Support Vector Machine (Kernel : Radial Basis Function)
svm.rbf.mod <- svm_rbf(cost      = 2.95,                               
                       rbf_sigma = 0.00460) %>%                        
  set_mode("classification") %>%                                       
  set_engine("kernlab")                                                


svm.rbf.mod %>% 
  translate()
Radial Basis Function Support Vector Machine Specification (classification)

Main Arguments:
  cost = 2.95
  rbf_sigma = 0.0046

Computational engine: kernlab 

Model fit template:
kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = 2.95, 
    kernel = "rbfdot", prob.model = TRUE, kpar = list(sigma = ~0.0046))
# Random Forest
rf.mod <- rand_forest(mtry  = 5,                                       
                      trees = 551,                                     
                      min_n = 5) %>%                                   
  set_mode("classification") %>%                                       
  set_engine("randomForest" ,                                          
             importance = TRUE)


rf.mod %>%
  translate()
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 5
  trees = 551
  min_n = 5

Engine-Specific Arguments:
  importance = TRUE

Computational engine: randomForest 

Model fit template:
randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
    mtry = min_cols(~5, x), ntree = 551, nodesize = min_rows(~5, 
        x), importance = TRUE)
# XGBoost
xgb.mod <- boost_tree(mtry           = 12,                             
                      trees          = 881,                           
                      tree_depth     = 9,                              
                      learn_rate     = 0.00250,                        
                      min_n          = 2,                              
                      loss_reduction = 1.88,                           
                      sample_size    = 0.628,                          
                      stop_iter      = 10) %>%                         
  set_mode("classification") %>%                                       
  set_engine("xgboost")                                                           


xgb.mod %>% 
  translate()
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = 12
  trees = 881
  min_n = 2
  tree_depth = 9
  learn_rate = 0.0025
  loss_reduction = 1.88
  sample_size = 0.628
  stop_iter = 10

Computational engine: xgboost 

Model fit template:
parsnip::xgb_train(x = missing_arg(), y = missing_arg(), colsample_bynode = 12, 
    nrounds = 881, min_child_weight = 2, max_depth = 9, eta = 0.0025, 
    gamma = 1.88, subsample = 0.628, early_stop = 10, nthread = 1, 
    verbose = 0)

3-1-1-3. Workflow 정의

ml.models <- workflow_set(preproc = list(basic = rec),                 
                          models  = list(svm  = svm.rbf.mod,          
                                         rf   = rf.mod,  
                                         xgb  = xgb.mod),   
                          cross = TRUE)                                

ml.models                                                           
# A workflow set/tibble: 3 x 4
  wflow_id  info             option    result    
  <chr>     <list>           <list>    <list>    
1 basic_svm <tibble [1 x 4]> <opts[0]> <list [0]>
2 basic_rf  <tibble [1 x 4]> <opts[0]> <list [0]>
3 basic_xgb <tibble [1 x 4]> <opts[0]> <list [0]>

Result! 총 3개의 Workflow가 생성되었다.


3-1-1-4. Resampling

ml.models.re <- ml.models %>%
  workflow_map("fit_resamples",                                             
               seed = 100, verbose = TRUE,                                  # Options to workflow_map()
               # Options to fit_resamples() 
               resamples = vfold_cv(HD.train, v = 5),                       
               control   = control_resamples(save_pred = TRUE,              
                                             parallel_over = "everything",  
                                             save_workflow = TRUE),        
               metrics = metric_set(roc_auc, accuracy))                    

ml.models.re   
# A workflow set/tibble: 3 x 4
  wflow_id  info             option    result   
  <chr>     <list>           <list>    <list>   
1 basic_svm <tibble [1 x 4]> <opts[3]> <rsmp[+]>
2 basic_rf  <tibble [1 x 4]> <opts[3]> <rsmp[+]>
3 basic_xgb <tibble [1 x 4]> <opts[3]> <rsmp[+]>

Caution! 분류 문제인 경우, 후보 개별 모형마다 각 클래스에 대한 예측 결과들이 저장된다.


3-1-2. 메타 모형 구축을 위한 데이터 셋 생성(후보 개별 모형의 예측 결과 추가)

3-1-2-1. Data Stack 객체 초기화

stack.condidate <- stacks()                                            
stack.condidate
# A data stack with 0 model definitions and 0 candidate members.

3-1-2-2. Data Stack에 후보 개별 모형의 예측 결과 추가

stack.condidate %<>%                                        
  add_candidates(ml.models.re)                                        
stack.condidate
# A data stack with 3 model definitions and 3 candidate members:
#   basic_svm: 1 model configuration
#   basic_rf: 1 model configuration
#   basic_xgb: 1 model configuration
# Outcome: HeartDisease (factor)
as_tibble(stack.condidate)                                             
# A tibble: 642 x 7
   HeartDisease .pred_no_basic_svm_~ .pred_yes_basic~ .pred_no_basic_~
   <fct>                       <dbl>            <dbl>            <dbl>
 1 no                          0.855           0.145             0.877
 2 no                          0.757           0.243             0.746
 3 no                          0.672           0.328             0.644
 4 no                          0.878           0.122             0.735
 5 no                          0.912           0.0877            0.978
 6 no                          0.833           0.167             0.926
 7 no                          0.794           0.206             0.784
 8 no                          0.916           0.0839            0.996
 9 no                          0.700           0.300             0.666
10 no                          0.862           0.138             0.969
# ... with 632 more rows, and 3 more variables:
#   .pred_yes_basic_rf_1_1 <dbl>, .pred_no_basic_xgb_1_1 <dbl>,
#   .pred_yes_basic_xgb_1_1 <dbl>

Result! 첫 번째 열은 Assessment 그룹의 실제 Target이며, 두 번째 열부터 후보 개별 모형들의 예측 결과를 나타낸다.


3-1-3. 메타 모형 적합

set.seed(100)
ens <- blend_predictions(stack.condidate,
                         penalty = 10^(-2:-1),                        
                         mixture = 1,                                 
                         non_negative = TRUE)                         
ens  
# A tibble: 3 x 3
  member                  type        weight
  <chr>                   <chr>        <dbl>
1 .pred_yes_basic_xgb_1_1 boost_tree   74.1 
2 .pred_yes_basic_rf_1_1  rand_forest   2.19
3 .pred_yes_basic_svm_1_1 svm_rbf       1.80

Caution! 이진 분류 문제인 경우, 첫 번째 클래스(Level)에 대한 예측 확률을 제외한다. 그렇기 때문에 관심 클래스가 두 번째 클래스가 되도록 설정해줘야 한다. 이 예제에서 HeartDisease의 두 번째 클래스는 “yes”이므로 “yes”대한 예측 확률을 이용하여 결합한다.
Result! 후보 개별 모형들의 “스태킹 계수(Weight)”가 모두 0이 아니므로, 모든 후보 개별 모형들의 예측 결과를 결합하여 최종 예측 결과를 도출한다.

autoplot(ens) +
  theme_bw()
autoplot(ens, "weights") +
  geom_text(aes(x = weight + 0.01, label = model), hjust = 0) +
  theme_bw() + 
  theme(legend.position = "none") +
  lims(x=c(-0.01, 80))

Caution! Bootstraps Resampling 방법을 이용하여 최적의 패널티 값을 찾는다.

ens[["equations"]][["prob"]][[".pred_yes"]]
stats::binomial()$linkinv(-38.9986290185502 + (.pred_yes_basic_svm_1_1 * 
    1.80175373899096) + (.pred_yes_basic_rf_1_1 * 2.19381682641345) + 
    (.pred_yes_basic_xgb_1_1 * 74.1497432054049))

3-1-4. 최종 개별 모형 적합

set.seed(100)
fit.ens <- fit_members(ens)                                          
fit.ens
# A tibble: 3 x 3
  member                  type        weight
  <chr>                   <chr>        <dbl>
1 .pred_yes_basic_xgb_1_1 boost_tree   74.1 
2 .pred_yes_basic_rf_1_1  rand_forest   2.19
3 .pred_yes_basic_svm_1_1 svm_rbf       1.80

3-1-5. 예측

ens.pred <- predict(fit.ens, HD.test) %>%                                    # predict(, type = "class" / "prob")
  bind_cols(HD.test)
ens.pred
# A tibble: 276 x 11
   .pred_class   Age Sex   RestingBP Cholesterol FastingBS RestingECG
   <fct>       <dbl> <fct>     <dbl>       <dbl>     <dbl> <fct>     
 1 no             54 M           110         208         0 Normal    
 2 yes            37 M           140         207         0 Normal    
 3 no             37 F           130         211         0 Normal    
 4 no             39 M           120         204         0 Normal    
 5 yes            49 M           140         234         0 Normal    
 6 no             42 F           115         211         0 ST        
 7 no             60 M           100         248         0 Normal    
 8 no             36 M           120         267         0 Normal    
 9 no             43 F           100         223         0 Normal    
10 no             36 M           130         209         0 Normal    
# ... with 266 more rows, and 4 more variables: MaxHR <dbl>,
#   Angina <fct>, HeartPeakReading <dbl>, HeartDisease <fct>
classification_metrics <- metric_set(accuracy, mcc, 
                                     f_meas, kap,
                                     sens, spec)                             # Assessment Measure
classification_metrics(ens.pred, truth = HeartDisease,                       
                       estimate = .pred_class)              
# A tibble: 6 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.793
2 mcc      binary         0.582
3 f_meas   binary         0.767
4 kap      binary         0.582
5 sens     binary         0.764
6 spec     binary         0.817
member_preds <- HD.test %>%
  select(HeartDisease) %>%
  bind_cols(predict(fit.ens, HD.test, members = TRUE))                       # predict(, members = TRUE)
member_preds
# A tibble: 276 x 5
   HeartDisease .pred_class .pred_class_basic_svm_1_1 .pred_class_bas~
   <fct>        <fct>       <fct>                     <fct>           
 1 no           no          no                        no              
 2 yes          yes         yes                       yes             
 3 no           no          no                        no              
 4 no           no          no                        no              
 5 yes          yes         yes                       yes             
 6 no           no          no                        no              
 7 yes          no          no                        yes             
 8 yes          no          no                        no              
 9 no           no          no                        no              
10 no           no          no                        no              
# ... with 266 more rows, and 1 more variable:
#   .pred_class_basic_xgb_1_1 <fct>

Caution! 함수 predict(, members = TRUE)를 통해 구축된 최종 개별 모형들의 Test Data에 대한 예측 결과를 출력할 수 있다.

map_dfr(member_preds, 
        accuracy,                                                                             
        truth = HeartDisease, data = member_preds) %>%               
  mutate(member = colnames(member_preds))
# A tibble: 5 x 4
  .metric  .estimator .estimate member                   
  <chr>    <chr>          <dbl> <chr>                    
1 accuracy binary         1     HeartDisease             
2 accuracy binary         0.793 .pred_class              
3 accuracy binary         0.790 .pred_class_basic_svm_1_1
4 accuracy binary         0.797 .pred_class_basic_rf_1_1 
5 accuracy binary         0.790 .pred_class_basic_xgb_1_1
member_preds
# A tibble: 276 x 5
   HeartDisease .pred_class .pred_class_basic_svm_1_1 .pred_class_bas~
   <fct>        <fct>       <fct>                     <fct>           
 1 no           no          no                        no              
 2 yes          yes         yes                       yes             
 3 no           no          no                        no              
 4 no           no          no                        no              
 5 yes          yes         yes                       yes             
 6 no           no          no                        no              
 7 yes          no          no                        yes             
 8 yes          no          no                        no              
 9 no           no          no                        no              
10 no           no          no                        no              
# ... with 266 more rows, and 1 more variable:
#   .pred_class_basic_xgb_1_1 <fct>
# Ref. https://www.hfshr.xyz/posts/2020-11-30-model-stacking/
multi_metric <- metric_set(accuracy, kap)

map_dfr(
  member_preds,
  ~ multi_metric(
    member_preds,
    truth = HeartDisease,                                                   
    estimate = .x
  ),
  .id = "model"
) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  filter(model != "HeartDisease") %>%                                        
  mutate(model = if_else(model == ".pred_class", "model_stack", model)) %>% 
  arrange(desc(accuracy)) %>% 
  mutate(across(where(is.numeric), round, 2))
# A tibble: 4 x 3
  model                     accuracy   kap
  <chr>                        <dbl> <dbl>
1 .pred_class_basic_rf_1_1      0.8   0.59
2 model_stack                   0.79  0.58
3 .pred_class_basic_svm_1_1     0.79  0.57
4 .pred_class_basic_xgb_1_1     0.79  0.58

3-2. tune_grid()를 이용한 Stacking


3-2-1. 후보 개별 모형 정의 및 평가


3-2-1-1. 전처리 정의

rec  <- recipe(HeartDisease ~ ., data = HD.train) %>%                  # recipe(formula, data)
  step_normalize(all_numeric_predictors()) %>%                         
  step_dummy(all_nominal_predictors(), one_hot = TRUE)                 

3-2-1-2. 모형 정의

# Support Vector Machine (Kernel : Radial Basis Function)
svm.rbf.mod <- svm_rbf(cost      = tune(),                            
                       rbf_sigma = tune()) %>%                         
  set_mode("classification") %>%                                      
  set_engine("kernlab")                                                



svm.rbf.mod %>% 
  translate()
Radial Basis Function Support Vector Machine Specification (classification)

Main Arguments:
  cost = tune()
  rbf_sigma = tune()

Computational engine: kernlab 

Model fit template:
kernlab::ksvm(x = missing_arg(), data = missing_arg(), C = tune(), 
    kernel = "rbfdot", prob.model = TRUE, kpar = list(sigma = ~tune()))
# Random Forest
rf.mod <- rand_forest(mtry  = tune(),                                 
                      trees = tune(),                                  
                      min_n = tune()) %>%                              
  set_mode("classification") %>%                                       
  set_engine("randomForest" ,                                          
             importance = TRUE)


rf.mod %>%
  translate()
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Engine-Specific Arguments:
  importance = TRUE

Computational engine: randomForest 

Model fit template:
randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
    mtry = min_cols(~tune(), x), ntree = tune(), nodesize = min_rows(~tune(), 
        x), importance = TRUE)
# XGBoost
xgb.mod <- boost_tree(mtry           = tune(),                               
                      trees          = tune(),                         
                      tree_depth     = tune(),                         
                      learn_rate     = tune(),                         
                      min_n          = tune(),                         
                      loss_reduction = tune(),                         
                      sample_size    = tune(),                         
                      stop_iter      = tune()) %>%                    
  set_mode("classification") %>%                                       
  set_engine("xgboost")                                                  


xgb.mod %>% 
  translate()
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()
  loss_reduction = tune()
  sample_size = tune()
  stop_iter = tune()

Computational engine: xgboost 

Model fit template:
parsnip::xgb_train(x = missing_arg(), y = missing_arg(), colsample_bynode = tune(), 
    nrounds = tune(), min_child_weight = tune(), max_depth = tune(), 
    eta = tune(), gamma = tune(), subsample = tune(), early_stop = tune(), 
    nthread = 1, verbose = 0)

3-2-1-3. Workflow 정의

ml.models <- workflow_set(preproc = list(basic = rec),                
                          models  = list(svm  = svm.rbf.mod,          
                                         rf   = rf.mod,  
                                         xgb  = xgb.mod),   
                          cross = TRUE)                                

ml.models                                                             
# A workflow set/tibble: 3 x 4
  wflow_id  info             option    result    
  <chr>     <list>           <list>    <list>    
1 basic_svm <tibble [1 x 4]> <opts[0]> <list [0]>
2 basic_rf  <tibble [1 x 4]> <opts[0]> <list [0]>
3 basic_xgb <tibble [1 x 4]> <opts[0]> <list [0]>

Result! 총 3개의 Workflow가 생성되었다.


3-2-1-4. 모수 범위 업데이트

rf.param <- extract_parameter_set_dials(rf.mod) %>%                          
  update(mtry =  mtry(c(1, 
                        ncol(select(juice(prep(rec)), -HeartDisease))  
  ))) 

# rf.param %>%
#   extract_parameter_dials("mtry")

xgb.param <- extract_parameter_set_dials(xgb.mod) %>%                          
  update(mtry =  mtry(c(1, 
                        ncol(select(juice(prep(rec)), -HeartDisease))  
  ))) 

ml.models %<>%
  option_add(param_info = rf.param, id  = "basic_rf") %>%
  option_add(param_info = xgb.param, id = "basic_xgb")

ml.models 
# A workflow set/tibble: 3 x 4
  wflow_id  info             option    result    
  <chr>     <list>           <list>    <list>    
1 basic_svm <tibble [1 x 4]> <opts[0]> <list [0]>
2 basic_rf  <tibble [1 x 4]> <opts[1]> <list [0]>
3 basic_xgb <tibble [1 x 4]> <opts[1]> <list [0]>

Result! “basic_rf”와 “basic_xgb”의 option열이 opts[1]로 바뀌었다.


3-2-1-5. 모수 튜닝


ml.models.tune <- ml.models %>%
  workflow_map("tune_grid",                                                 
               seed = 100, verbose = TRUE,                                  # Options to workflow_map()
               # Options to tune_grid() 
               grid = 10,                                                   
               resamples = vfold_cv(HD.train, v = 5),                      
               control   = control_grid(save_pred = TRUE,                   
                                        parallel_over = "everything",       
                                        save_workflow = TRUE),              
               metrics = metric_set(roc_auc, accuracy))                    
ml.models.tune        
# A workflow set/tibble: 3 x 4
  wflow_id  info             option    result   
  <chr>     <list>           <list>    <list>   
1 basic_svm <tibble [1 x 4]> <opts[4]> <tune[+]>
2 basic_rf  <tibble [1 x 4]> <opts[5]> <tune[+]>
3 basic_xgb <tibble [1 x 4]> <opts[5]> <tune[+]>

Caution! 분류 문제인 경우, 후보 개별 모형마다 각 클래스에 대한 예측 결과들이 저장된다.


3-2-2. 메타 모형 구축을 위한 데이터 셋 생성(후보 개별 모형의 예측 결과 추가)

3-2-2-1. Data Stack 객체 초기화

stack.condidate <- stacks()                                         
stack.condidate
# A data stack with 0 model definitions and 0 candidate members.

3-2-2-2. Data Stack에 후보 개별 모형의 예측 결과 추가

stack.condidate %<>%                                        
  add_candidates(ml.models.tune)                                      
stack.condidate
# A data stack with 3 model definitions and 30 candidate members:
#   basic_svm: 10 model configurations
#   basic_rf: 10 model configurations
#   basic_xgb: 10 model configurations
# Outcome: HeartDisease (factor)

Result! 모수 튜닝에서 각 모형마다 10개의 다른 모수 집합을 찾았기 때문에 30(10 \(\times\) 3(모형 개수))개의 후보 개별 모형이 추가되었다.

as_tibble(stack.condidate)                                             
# A tibble: 642 x 61
   HeartDisease .pred_no_basic_svm_~ .pred_no_basic_~ .pred_no_basic_~
   <fct>                       <dbl>            <dbl>            <dbl>
 1 no                          0.451            0.451            0.991
 2 no                          0.450            0.450            0.799
 3 no                          0.430            0.430            0.928
 4 no                          0.441            0.441            0.990
 5 no                          0.430            0.430            0.997
 6 no                          0.463            0.463            0.993
 7 no                          0.463            0.463            0.993
 8 no                          0.450            0.450            0.996
 9 no                          0.441            0.441            0.971
10 no                          0.450            0.450            0.997
# ... with 632 more rows, and 57 more variables:
#   .pred_no_basic_svm_1_10 <dbl>, .pred_no_basic_svm_1_01 <dbl>,
#   .pred_no_basic_svm_1_03 <dbl>, .pred_no_basic_svm_1_05 <dbl>,
#   .pred_no_basic_svm_1_06 <dbl>, .pred_no_basic_svm_1_08 <dbl>,
#   .pred_no_basic_svm_1_09 <dbl>, .pred_yes_basic_svm_1_04 <dbl>,
#   .pred_yes_basic_svm_1_07 <dbl>, .pred_yes_basic_svm_1_02 <dbl>,
#   .pred_yes_basic_svm_1_10 <dbl>, ...

Result! 첫 번째 열은 Assessment 그룹의 실제 Target이며, 두 번째 열부터 후보 개별 모형들의 예측 결과를 나타낸다.

collect_parameters(stack.condidate, "basic_rf")                                           
# A tibble: 10 x 4
   member         mtry trees min_n
   <chr>         <int> <int> <int>
 1 basic_rf_1_01     8  1762    22
 2 basic_rf_1_02    12   392    33
 3 basic_rf_1_03    10  1810    14
 4 basic_rf_1_04    13  1372     9
 5 basic_rf_1_05     1  1133    36
 6 basic_rf_1_06     4   733    25
 7 basic_rf_1_07     2    82    17
 8 basic_rf_1_08     5   551     5
 9 basic_rf_1_09     7   883    13
10 basic_rf_1_10     7  1537    30

Caution! 함수 collect_parameters()를 이용하여 후보 개별 모형의 모수 값을 확인할 수 있다.


3-2-3. 메타 모형 적합

set.seed(100)
ens <- blend_predictions(stack.condidate,
                         penalty = 10^(-2:-1),                         
                         mixture = 1,                                 
                         non_negative = TRUE)                          
ens     
# A tibble: 2 x 3
  member                  type        weight
  <chr>                   <chr>        <dbl>
1 .pred_yes_basic_rf_1_07 rand_forest  2.14 
2 .pred_yes_basic_rf_1_09 rand_forest  0.813

Caution! 이진 분류 문제인 경우, 첫 번째 클래스(Level)에 대한 예측 확률을 제외한다. 그렇기 때문에 관심 클래스가 두 번째 클래스가 되도록 지정해줘야 한다. 이 예제에서 HeartDisease의 두 번째 클래스는 “yes”이므로 “yes”대한 예측 확률을 이용하여 결합한다.
Result! “스태킹 계수(Weight)”가 0이 아닌 모델은 총 2개이다. 즉, 30개의 후보 개별 모형들 중 2개의 후보 개별 모형들의 예측 결과만을 결합하여 최종 예측 결과를 도출한다.

autoplot(ens) +
  theme_bw()
autoplot(ens, "weights") +
  geom_text(aes(x = weight + 0.01, label = model), hjust = 0) +
  theme_bw() + 
  theme(legend.position = "none") +
  lims(x = c(-0.01, 2.5)) 

Caution! Bootstraps Resampling 방법을 이용하여 최적의 패널티 값을 찾는다.

ens[["equations"]][["prob"]][[".pred_yes"]]
stats::binomial()$linkinv(-1.37396293891493 + (.pred_yes_basic_rf_1_07 * 
    2.13901404602268) + (.pred_yes_basic_rf_1_09 * 0.812546524162881))

3-2-4. 최종 개별 모형 적합


set.seed(100)
fit.ens <- fit_members(ens)                                          
fit.ens
# A tibble: 2 x 3
  member                  type        weight
  <chr>                   <chr>        <dbl>
1 .pred_yes_basic_rf_1_07 rand_forest  2.14 
2 .pred_yes_basic_rf_1_09 rand_forest  0.813
fit.ens$member_fits$basic_rf_1_07 %>%
  extract_fit_engine()

Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~82L, mtry = min_cols(~2L,      x), nodesize = min_rows(~17L, x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 82
No. of variables tried at each split: 2

        OOB estimate of  error rate: 18.54%
Confusion matrix:
     no yes class.error
no  229  58   0.2020906
yes  61 294   0.1718310

Caution! 함수 extract_fit_engine()를 통해 구축된 모형을 확인할 수 있다.


3-2-5. 예측

classification_metrics <- metric_set(accuracy, mcc, 
                                     f_meas, kap,
                                     sens, spec)                            
classification_metrics(ens.pred, truth = HeartDisease,                   
                       estimate = .pred_class)              
# A tibble: 6 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.793
2 mcc      binary         0.582
3 f_meas   binary         0.767
4 kap      binary         0.582
5 sens     binary         0.764
6 spec     binary         0.817
member_preds <- HD.test %>%
  select(HeartDisease) %>%
  bind_cols(predict(fit.ens, HD.test, members = TRUE))                       # predict(, members = TRUE)
member_preds
# A tibble: 276 x 4
   HeartDisease .pred_class .pred_class_basic_rf_1_07 .pred_class_bas~
   <fct>        <fct>       <fct>                     <fct>           
 1 no           no          no                        no              
 2 yes          yes         yes                       yes             
 3 no           no          no                        no              
 4 no           no          no                        no              
 5 yes          yes         yes                       yes             
 6 no           no          no                        no              
 7 yes          yes         yes                       yes             
 8 yes          no          no                        no              
 9 no           no          no                        no              
10 no           no          no                        no              
# ... with 266 more rows

Caution! 함수 predict(, members = TRUE)를 통해 구축된 최종 개별 모형들의 Test Data에 대한 예측 결과를 출력할 수 있다.

map_dfr(member_preds, 
        accuracy,                                                                                
        truth = HeartDisease, data = member_preds) %>%                     
  mutate(member = colnames(member_preds))
# A tibble: 4 x 4
  .metric  .estimator .estimate member                   
  <chr>    <chr>          <dbl> <chr>                    
1 accuracy binary         1     HeartDisease             
2 accuracy binary         0.812 .pred_class              
3 accuracy binary         0.808 .pred_class_basic_rf_1_07
4 accuracy binary         0.812 .pred_class_basic_rf_1_09
member_preds
# A tibble: 276 x 4
   HeartDisease .pred_class .pred_class_basic_rf_1_07 .pred_class_bas~
   <fct>        <fct>       <fct>                     <fct>           
 1 no           no          no                        no              
 2 yes          yes         yes                       yes             
 3 no           no          no                        no              
 4 no           no          no                        no              
 5 yes          yes         yes                       yes             
 6 no           no          no                        no              
 7 yes          yes         yes                       yes             
 8 yes          no          no                        no              
 9 no           no          no                        no              
10 no           no          no                        no              
# ... with 266 more rows
# Ref. https://www.hfshr.xyz/posts/2020-11-30-model-stacking/
multi_metric <- metric_set(accuracy, kap)

map_dfr(
  member_preds,
  ~ multi_metric(
    member_preds,
    truth = HeartDisease,                                                   
    estimate = .x
  ),
  .id = "model"
) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  filter(model != "HeartDisease") %>%                                        
  mutate(model = if_else(model == ".pred_class", "model_stack", model)) %>%  
  arrange(desc(accuracy)) %>% 
  mutate(across(where(is.numeric), round, 2))
# A tibble: 3 x 3
  model                     accuracy   kap
  <chr>                        <dbl> <dbl>
1 model_stack                   0.81  0.62
2 .pred_class_basic_rf_1_09     0.81  0.62
3 .pred_class_basic_rf_1_07     0.81  0.61

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".