Package tidymodels (Ver 0.2.0)는 R에서 머신러닝(Machine Learning)을 tidyverse principle로 수행할 수 있게끔 해주는 패키지 묶음이다. 특히, 모델링에 필요한 필수 패키지들을 대부분 포함하고 있기 때문에 데이터 전처리부터 시각화, 모델링, 예측까지 모든 과정을 tidy framework로 진행할 수 있다. 또한, Package caret을 완벽하게 대체하며 보다 더 빠르고 직관적인 코드로 모델링을 수행할 수 있다. Package tidymodels를 이용하여 XGBoost를 수행하는 방법을 설명하기 위해 “Heart Disease Prediction” 데이터를 예제로 사용한다. 이 데이터는 환자의 심장병을 예측하기 위해 총 918명의 환자에 대한 10개의 예측변수로 이루어진 데이터이다(출처 : Package MLDataR, Gary Hutson 2021). 여기서 Target은 HeartDisease이다.

0. Schematic Diagram

1. 데이터 불러오기

# install.packages("tidymodels")
pacman::p_load("MLDataR",                                              # For Data
               "data.table", "magrittr",
               "tidymodels",
               "doParallel", "parallel")

registerDoParallel(cores=detectCores())


data(heartdisease)
data <- heartdisease %>%
  mutate(HeartDisease = ifelse(HeartDisease==0, "no", "yes"))


cols <- c("Sex", "RestingECG", "Angina", "HeartDisease")

data   <- data %>% 
  mutate_at(cols, as.factor)                                           # 범주형 변수 변환

glimpse(data)                                                          # 데이터 구조

Rows: 918
Columns: 10
$ Age              <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37,~
$ Sex              <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F~
$ RestingBP        <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140~
$ Cholesterol      <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207~
$ FastingBS        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ RestingECG       <fct> Normal, Normal, ST, Normal, Normal, Normal,~
$ MaxHR            <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130,~
$ Angina           <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N~
$ HeartPeakReading <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5~
$ HeartDisease     <fct> no, yes, no, yes, no, no, no, no, yes, no, ~

2. 데이터 분할

set.seed(100)                                                          # seed 고정
data.split <- initial_split(data, prop = 0.7, strata = HeartDisease)   # initial_split(, strata = 층화추출할 변수)NAHD.train   <- training(data.split)
HD.test    <- testing(data.split)

3. XGBoost

3-1. 전처리 정의

Workflow를 이용하기 위해 먼저 전처리를 정의한다.

rec  <- recipe(HeartDisease ~ ., data = HD.train) %>%                  # recipe(formula, data)
  step_normalize(all_numeric_predictors()) %>%                         # 모든 수치형 예측변수들을 표준화  step_dummy(all_nominal_predictors(), one_hot = TRUE)                 # 모든 범주형 예측변수들에 대해 원-핫 인코딩 더미변수 생성NA

3-2. 모형 정의

모형을 구축하기 위해 모형을 먼저 정의한다.
- 모형을 정의하기 위해 모형 타입(Type)과 모형 종류(set_mode) 그리고 사용할 패키지(set_engine)가 필요하다.
  - 모형 타입 : 사용하고자하는 머신러닝 함수 정의
    - XGBoost는 함수 boost_tree()를 사용한다.
  - 모형 종류 : Target 유형 정의
    - 분류(Classification) 또는 회귀(Regresssion) 중 하나를 선택한다.
  - 사용할 패키지 : 사용하고자하는 Package 정의
    - XGBoost는 Package xgboost를 사용할 수 있다.
XGBoost의 모수에는 mtry, trees, tree_depth, learn_rate, min_n, loss_reduction, sample_size, stop_iter이 있다.
- mtry : 노드를 분할할 때 랜덤하게 선택되는 예측변수 개수
- trees : 생성하고자 하는 트리의 개수
- tree_depth : 생성된 트리의 최대 깊이
- learn_rate : 학습률
- min_n : 노드를 분할하기 위해 필요한 최소 가중치 합
- loss_reduction : 노드 분할을 위한 최소 손실 감소값
- sample_size : 각 부스팅 단계에서 사용되는 훈련 데이터의 비율
- stop_iter : 조기종료를 위한 부스팅 횟수
튜닝하고 싶은 모수는 함수 tune()으로 지정한다.

xgb.tune.mod <- boost_tree(mtry           = tune(),                        # mtry(colsample_bynode) : 노드를 분할할 때 랜덤하게 선택되는 예측변수의 개수       
                           trees          = tune(),                        # trees(nrounds) : 생성하고자 하는 트리의 개수                            tree_depth     = tune(),                        # tree_depth(max_depth) : 생성된 트리의 최대 깊이                           learn_rate     = tune(),                        # learn_rate(eta) : 학습률                           min_n          = tune(),                        # min_n(min_child_weight) : 노드를 분할하기 위해 필요한 최소 가중치 합 NA= tune(),                        # loss_reduction(gamma) : 노드 분할을 위한 최소 손실 감소값                            sample_size    = tune(),                        # sample_size(subsample) : 각 부스팅 단계에서 사용되는 훈련 데이터의 비율
                           stop_iter      = tune()) %>%                    # stop_iter(early_stop) : 조기종료를 위한 부스팅 횟수  
  set_mode("classification") %>%                                           # Target 유형 정의(classification / regression)NAset_engine("xgboost")                                                    # 사용하고자하는 패키지 정의                                     NA# 실제 패키지에 어떻게 적용되는지 확인NAxgb.tune.mod %>% 
  translate()

Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()
  loss_reduction = tune()
  sample_size = tune()
  stop_iter = tune()

Computational engine: xgboost 

Model fit template:
parsnip::xgb_train(x = missing_arg(), y = missing_arg(), colsample_bynode = tune(), 
    nrounds = tune(), min_child_weight = tune(), max_depth = tune(), 
    eta = tune(), gamma = tune(), subsample = tune(), early_stop = tune(), 
    nthread = 1, verbose = 0)

Caution! 함수 translate()를 통해 위에서 정의한 “xgb.tune.mod”가 실제로 Package xgboost의 함수 xgb_train()에 어떻게 적용되는지 확인할 수 있다.

3-3. Workflow 정의

앞에서 정의한 전처리와 모형을 이용하여 Workflow를 정의한다.

xgb.tune.wflow <- workflow() %>%                                           # Workflow 이용  add_recipe(rec) %>%                                                      # 3-1에서 정의의
  add_model(xgb.tune.mod)                                                  # 3-2에서 정의의

3-4. 모수 범위 확인

함수 extract_parameter_set_dials()를 이용하여 모수들의 정보를 확인할 수 있다.

xgb.param <- extract_parameter_set_dials(xgb.tune.wflow)                 
xgb.param

Collection of 8 parameters for tuning

     identifier           type    object
           mtry           mtry nparam[?]
          trees          trees nparam[+]
          min_n          min_n nparam[+]
     tree_depth     tree_depth nparam[+]
     learn_rate     learn_rate nparam[+]
 loss_reduction loss_reduction nparam[+]
    sample_size    sample_size nparam[+]
      stop_iter      stop_iter nparam[+]

Model parameters needing finalization:
   # Randomly Selected Predictors ('mtry')

See `?dials::finalize` or `?dials::update.parameters` for more information.

Result! object열에서 nparam은 모수값이 수치형임을 의미한다. 또한, nparam[+]는 해당 모수의 범위가 명확하게 주어졌음을 의미하고, nparam[?]는 모수의 범위에서 상한 또는 하한의 값이 명확하지 않다는 것을 의미한다. 이러한 경우, 상한 또는 하한의 값을 명확하게 결정하여야 한다.

함수 extract_parameter_dials()를 이용하여 모수의 범위를 자세히 확인할 수 있다.

xgb.param %>%
  extract_parameter_dials("mtry")

# Randomly Selected Predictors (quantitative)
Range: [1, ?]

Result! mtry의 상한이 ?이므로 상한값을 결정하여야 한다.

함수 update()를 이용하여 직접 범위를 지정할 수 있다.

# 함수 update()를 이용한 수정NA## 전처리가 적용된 데이터의 예측변수 개수가 상한이 되도록 설정정
xgb.param %<>%
  update(mtry =  mtry(c(1L, 
                        ncol(select(juice(prep(rec)), -HeartDisease))      # juice(prep(rec)) : Recipe 적용 -> 전처리가 적용된 데이터셋 생성, ncol(select(., -Target)) : 전처리가 적용된 데이터의 예측변수 개수  ))) 

xgb.param %>%
  extract_parameter_dials("mtry")

# Randomly Selected Predictors (quantitative)
Range: [1, 13]

Result! mtry의 상한이 13으로 수정되었다.

3-5. 모형 적합

3-5-1. Resampling 정의

XGBoost의 최적의 모수 조합을 찾기 위해 Resampling 방법 중 하나인 K-Fold Cross-Validation을 사용한다.

set.seed(100)
train.fold    <- vfold_cv(HD.train, v = 5)

3-5-2. 최적의 모수 조합 찾기

최적의 모수 조합을 찾기 위해 Regular Grid, Latin Hypercube, Expand Grid를 사용한다.

3-5-2-1. Regular Grid

set.seed(100)
grid <-  xgb.param %>%                                                     
  grid_regular(levels = c(rep(1, 7), 2))
grid

# A tibble: 2 x 8
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
1     1     1     2          1      0.001   0.0000000001         0.1
2     1     1     2          1      0.001   0.0000000001         0.1
# ... with 1 more variable: stop_iter <int>

Result! 각 모수별로 (1, 1, 1, 1, 1, 1, 1, 2)개씩 후보값을 랜덤하게 할당함으로써 총 2(1 \(\times\) 1 \(\times\) 1 \(\times\) 1 \(\times\) 1 \(\times\) 1 \(\times\) 1 \(\times\) 2)개의 후보 모수 조합을 생성하였다.

# 모형 적합NAset.seed(100)
xgb.tune.grid.fit <- xgb.tune.wflow %>%                                    # 3-3에서 정의의
  tune_grid(
    train.fold,                                                            # 3-5-1에서 정의 : Resampling -> 5-Cross-Validationn
    grid = grid,                                                           # 3-5-2-1에서 정의 : 후보 모수 집합     control = control_grid(save_pred = TRUE,                               # Resampling의 Assessment 결과 저장NA= "everything"),                  # 병렬 처리(http:://tune.tidymodels.org/reference/control_grid.html) ) 
    metrics = metric_set(roc_auc, accuracy)                                # Assessment 그룹에 대한 Assessment Measure
  )

# 그래프
autoplot(xgb.tune.grid.fit) + 
  scale_color_viridis_d(direction = -1) + 
  theme(legend.position = "top") +
  theme_bw()

# 지정된 Metric 측면에서 성능이 우수한 모형을 순서대로 확인NAshow_best(xgb.tune.grid.fit, metric = "roc_auc")                           # show_best(, "accuracy")

# A tibble: 2 x 14
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
1     1     1     2          1      0.001   0.0000000001         0.1
2     1     1     2          1      0.001   0.0000000001         0.1
# ... with 7 more variables: stop_iter <int>, .metric <chr>,
#   .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
#   .config <fct>

# 최적의 모수 조합 확인NAbest.xgb.grid <- xgb.tune.grid.fit %>% 
  select_best("roc_auc")
best.xgb.grid

# A tibble: 1 x 9
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
1     1     1     2          1      0.001   0.0000000001         0.1
# ... with 2 more variables: stop_iter <int>, .config <fct>

Result! mtry = 1, trees = 1,min_n = 2, tree_depth = 1, learn_rate = 0.001, loss_reduction = 0.0000000001, sample_size = 0.1, stop_iter = 3일 때 “ROC AUC” 측면에서 가장 우수한 성능을 보여준다.

3-5-2-2. Latin Hypercube

set.seed(100)
random <- xgb.param %>%                                                    
  grid_latin_hypercube(size = 10)
random

# A tibble: 10 x 8
    mtry trees min_n tree_depth learn_rate loss_reduction sample_size
   <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
 1     3   563     7          3    0.120         5.48e- 3       0.499
 2     3  1191    10         11    0.00116       3.62e+ 0       0.429
 3    10  1610    33          6    0.0836        6.89e- 7       0.938
 4    11   372    39         11    0.0442        3.22e- 9       0.262
 5     7   733    17         13    0.0201        5.19e- 8       0.680
 6     5  1933    25          7    0.00906       1.55e- 1       0.766
 7    12   881     2          9    0.00250       1.88e+ 0       0.628
 8     9  1550    32          4    0.197         5.35e- 4       0.370
 9     2    83    17         15    0.0154        9.41e- 6       0.899
10     6  1338    23          2    0.00508       5.19e-10       0.119
# ... with 1 more variable: stop_iter <int>

Result! 10개의 후보 모수 조합을 랜덤하게 생성하였다.

# 모형 적합NAset.seed(100)
xgb.tune.random.fit <- xgb.tune.wflow %>%                                  # 3-3에서 정의의
  tune_grid(
    train.fold,                                                            # 3-5-1에서 정의 : Resampling -> 5-Cross-Validationn
    grid = random,                                                         # 3-5-2-2에서 정의 : 후보 모수 집합     control = control_grid(save_pred = TRUE,                               # Resampling의 Assessment 결과 저장NA= "everything"),                  # 병렬 처리(http:://tune.tidymodels.org/reference/control_grid.html) ) 
    metrics = metric_set(roc_auc, accuracy)                                # Assessment 그룹에 대한 Assessment Measure
  )

# 그래프
autoplot(xgb.tune.random.fit) + 
  scale_color_viridis_d(direction = -1) + 
  theme(legend.position = "top") +
  theme_bw()

# 지정된 Metric 측면에서 성능이 우수한 모형을 순서대로 확인NAshow_best(xgb.tune.random.fit, "roc_auc")                                   # show_best(, "accuracy")

# A tibble: 5 x 14
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
1     2    83    17         15    0.0154    0.00000941         0.899
2    12   881     2          9    0.00250   1.88               0.628
3     3   563     7          3    0.120     0.00548            0.499
4     3  1191    10         11    0.00116   3.62               0.429
5     7   733    17         13    0.0201    0.0000000519       0.680
# ... with 7 more variables: stop_iter <int>, .metric <chr>,
#   .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
#   .config <fct>

# 최적의 모수 조합 확인NAbest.xgb.random <- xgb.tune.random.fit %>% 
  select_best("roc_auc")                                                    # select_best("accuracy")
best.xgb.random

# A tibble: 1 x 9
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <int> <int>      <int>      <dbl>          <dbl>       <dbl>
1     2    83    17         15     0.0154     0.00000941       0.899
# ... with 2 more variables: stop_iter <int>, .config <fct>

Result! mtry = 12, trees = 881,min_n = 2, tree_depth = 9, learn_rate = 0.0025, loss_reduction = 1.88, sample_size = 0.628, stop_iter = 10일 때 “ROC AUC” 측면에서 가장 우수한 성능을 보여준다.

3-5-2-3. Expand Grid

Latin Hypercube 방법에서 최적의 모수 조합인 mtry = 12, trees = 881,min_n = 2, tree_depth = 9, learn_rate = 0.0025, loss_reduction = 1.88, sample_size = 0.628, stop_iter = 10을 기준으로 다양한 후보값을 생성한다.

egrid <- expand.grid(mtry           = 11:12,
                     trees          = 881,
                     min_n          = 2,
                     tree_depth     = 8:9,
                     learn_rate     = 0.0025,
                     loss_reduction = 1.88,
                     sample_size    = 0.628,
                     stop_iter      = 9:10)
egrid

  mtry trees min_n tree_depth learn_rate loss_reduction sample_size
1   11   881     2          8     0.0025           1.88       0.628
2   12   881     2          8     0.0025           1.88       0.628
3   11   881     2          9     0.0025           1.88       0.628
4   12   881     2          9     0.0025           1.88       0.628
5   11   881     2          8     0.0025           1.88       0.628
6   12   881     2          8     0.0025           1.88       0.628
7   11   881     2          9     0.0025           1.88       0.628
8   12   881     2          9     0.0025           1.88       0.628
  stop_iter
1         9
2         9
3         9
4         9
5        10
6        10
7        10
8        10

Result! 후보 모수값들의 집합이 생성되었다.

# 모형 적합NAset.seed(100)
xgb.tune.egrid.fit <- xgb.tune.wflow %>%                                   # 3-3에서 정의의
  tune_grid(
    train.fold,                                                            # 3-5-1에서 정의 : Resampling -> 5-Cross-Validationn
    grid = egrid,                                                          # 3-5-2-3에서 정의 : 후보 모수 집합     control = control_grid(save_pred = TRUE,                               # Resampling의 Assessment 결과 저장NA= "everything"),                  # 병렬 처리(http:://tune.tidymodels.org/reference/control_grid.html) ) 
    metrics = metric_set(roc_auc, accuracy)                                # Assessment 그룹에 대한 Assessment Measure
  )

# 그래프
autoplot(xgb.tune.egrid.fit) + 
  scale_color_viridis_d(direction = -1) + 
  theme(legend.position = "top") +
  theme_bw()

# Ref. https://juliasilge.com/blog/xgboost-tune-volleyball/
xgb.tune.egrid.fit %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, mtry:stop_iter) %>%
  pivot_longer(mtry:stop_iter,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC") +
  theme_bw()

# 지정된 Metric 측면에서 성능이 우수한 모형을 순서대로 확인NAshow_best(xgb.tune.egrid.fit, "roc_auc")                                    # show_best(, "accuracy")

# A tibble: 5 x 14
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <dbl> <dbl>      <int>      <dbl>          <dbl>       <dbl>
1    12   881     2          9     0.0025           1.88       0.628
2    12   881     2          8     0.0025           1.88       0.628
3    11   881     2          9     0.0025           1.88       0.628
4    11   881     2          9     0.0025           1.88       0.628
5    12   881     2          9     0.0025           1.88       0.628
# ... with 7 more variables: stop_iter <int>, .metric <chr>,
#   .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
#   .config <fct>

# 최적의 모수 조합 확인NAbest.xgb.egrid <- xgb.tune.egrid.fit %>% 
  select_best("roc_auc")
best.xgb.egrid

# A tibble: 1 x 9
   mtry trees min_n tree_depth learn_rate loss_reduction sample_size
  <int> <dbl> <dbl>      <int>      <dbl>          <dbl>       <dbl>
1    12   881     2          9     0.0025           1.88       0.628
# ... with 2 more variables: stop_iter <int>, .config <fct>

Result! mtry = 12, trees = 881,min_n = 2, tree_depth = 9, learn_rate = 0.0025, loss_reduction = 1.88, sample_size = 0.628, stop_iter = 9일 때 “ROC AUC” 측면에서 가장 우수한 성능을 보여준다.

3-5-3. 최적의 모수 조합을 이용한 모형 적합

최적의 모수 조합 mtry = 12, trees = 881,min_n = 2, tree_depth = 9, learn_rate = 0.0025, loss_reduction = 1.88, sample_size = 0.628, stop_iter = 9를 이용하여 모형을 구축한다.
함수 finalize_workflow()을 이용하여 앞에서 정의한 “workflow(xgb.tune.wflow)”를 최적의 모수 조합을 가지는 “workflow”로 업데이트한다.

# Workflow에 최적의 모수값 업데이트final.xgb.wflow <- xgb.tune.wflow %>%                                      # 3-3에서 정의의
  finalize_workflow(best.xgb.egrid)                                        # finalize_workflow : 최적의 모수 조합을 가지는 workflow로 업데이트NAfinal.xgb.wflow

== Workflow ==========================================================
Preprocessor: Recipe
Model: boost_tree()

-- Preprocessor ------------------------------------------------------
2 Recipe Steps

* step_normalize()
* step_dummy()

-- Model -------------------------------------------------------------
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = 12
  trees = 881
  min_n = 2
  tree_depth = 9
  learn_rate = 0.0025
  loss_reduction = 1.88
  sample_size = 0.628
  stop_iter = 9

Computational engine: xgboost

Caution! 함수 last_fit()은 최적의 모수 조합에 대해 Training Data를 이용한 모형 적합과 Test Data에 대한 예측을 한 번에 수행할 수 있지만 seed 고정이 되지 않아 Reproducibility (재생산성)가 만족되지 않는다. 따라서, 모형 적합(함수 fit())과 예측(함수 augment())을 각각 수행하였다.

# 모형 적합NAset.seed(100)
final.xgb <- final.xgb.wflow %>% 
  fit(data = HD.train)
final.xgb

== Workflow [trained] ================================================
Preprocessor: Recipe
Model: boost_tree()

-- Preprocessor ------------------------------------------------------
2 Recipe Steps

* step_normalize()
* step_dummy()

-- Model -------------------------------------------------------------
##### xgb.Booster
raw: 18.8 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.0025, max_depth = 9L, 
    gamma = 1.88, colsample_bytree = 1, colsample_bynode = 0.923076923076923, 
    min_child_weight = 2, subsample = 0.628, objective = "binary:logistic"), 
    data = x$data, nrounds = 881, watchlist = x$watchlist, verbose = 0, 
    early_stopping_rounds = 9L, nthread = 1)
params (as set within xgb.train):
  eta = "0.0025", max_depth = "9", gamma = "1.88", colsample_bytree = "1", colsample_bynode = "0.923076923076923", min_child_weight = "2", subsample = "0.628", objective = "binary:logistic", nthread = "1", silent = "1"
xgb.attributes:
  best_iteration, best_msg, best_ntreelimit, best_score, niter
callbacks:
  cb.evaluation.log()
  cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize, 
    verbose = verbose)
# of features: 13 
niter: 14
best_iteration : 5 
best_ntreelimit : 5 
best_score : 0.138629 
nfeatures : 13 
evaluation_log:
    iter training_error
       1       0.190031
       2       0.179128
---                    
      13       0.152648
      14       0.151090

# 최종 모형NAfinal.xgb %>% 
  extract_fit_engine()

##### xgb.Booster
raw: 18.8 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.0025, max_depth = 9L, 
    gamma = 1.88, colsample_bytree = 1, colsample_bynode = 0.923076923076923, 
    min_child_weight = 2, subsample = 0.628, objective = "binary:logistic"), 
    data = x$data, nrounds = 881, watchlist = x$watchlist, verbose = 0, 
    early_stopping_rounds = 9L, nthread = 1)
params (as set within xgb.train):
  eta = "0.0025", max_depth = "9", gamma = "1.88", colsample_bytree = "1", colsample_bynode = "0.923076923076923", min_child_weight = "2", subsample = "0.628", objective = "binary:logistic", nthread = "1", silent = "1"
xgb.attributes:
  best_iteration, best_msg, best_ntreelimit, best_score, niter
callbacks:
  cb.evaluation.log()
  cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize, 
    verbose = verbose)
# of features: 13 
niter: 14
best_iteration : 5 
best_ntreelimit : 5 
best_score : 0.138629 
nfeatures : 13 
evaluation_log:
    iter training_error
       1       0.190031
       2       0.179128
---                    
      13       0.152648
      14       0.151090

Caution! mtry = 12로 설정했기 때문에 12(mtry)/13(총 예측변수 개수)의 값이 colsample_bynode = 0.923076923076923로 출력된다.

3-5-3-1. 변수 중요도

final.xgb %>% 
  extract_fit_engine() %>%
  vip::vip() +
  theme_bw()

3-6. 예측

xgb.pred <- augment(final.xgb, HD.test)  
xgb.pred

# A tibble: 276 x 13
     Age Sex   RestingBP Cholesterol FastingBS RestingECG MaxHR Angina
   <dbl> <fct>     <dbl>       <dbl>     <dbl> <fct>      <dbl> <fct> 
 1    54 M           110         208         0 Normal       142 N     
 2    37 M           140         207         0 Normal       130 Y     
 3    37 F           130         211         0 Normal       142 N     
 4    39 M           120         204         0 Normal       145 N     
 5    49 M           140         234         0 Normal       140 Y     
 6    42 F           115         211         0 ST           137 N     
 7    60 M           100         248         0 Normal       125 N     
 8    36 M           120         267         0 Normal       160 N     
 9    43 F           100         223         0 Normal       142 N     
10    36 M           130         209         0 Normal       178 N     
# ... with 266 more rows, and 5 more variables:
#   HeartPeakReading <dbl>, HeartDisease <fct>, .pred_class <fct>,
#   .pred_no <dbl>, .pred_yes <dbl>

3-7. 모형 평가

3-7-1. 평가 척도

conf_mat(xgb.pred, truth = HeartDisease, estimate = .pred_class)       # truth : 실제 클래스,  estimate : 예측 클래스 클래스

          Truth
Prediction  no yes
       no   98  27
       yes  25 126

conf_mat(xgb.pred, truth = HeartDisease, estimate = .pred_class) %>%
  autoplot(type = "mosaic")                                            # autoplot(type = "heatmap")

classification_metrics <- metric_set(accuracy, mcc, 
                                     f_meas, kap,
                                     sens, spec, roc_auc)              # Test Data에 대한 Assessment Measure
classification_metrics(xgb.pred, truth = HeartDisease,                 # truth : 실제 클래스,  estimate : 예측 클래스 클래스
                       estimate = .pred_class,
                       .pred_yes, event_level = "second")              # For roc_auc

# A tibble: 7 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.812
2 mcc      binary         0.619
3 f_meas   binary         0.829
4 kap      binary         0.619
5 sens     binary         0.824
6 spec     binary         0.797
7 roc_auc  binary         0.878

Caution! “ROC AUC”를 계산하기 위해서는 관심 클래스에 대한 예측 확률이 필요하다. 예제 데이터에서 관심 클래스는 “yes”이므로 “yes”에 대한 예측 확률 결과인 .pred_yes가 사용되었다. 또한, Target인 “HeartDisease” 변수의 유형을 “Factor” 변환하면 알파벳순으로 클래스를 부여하기 때문에 관심 클래스 “yes”가 두 번째 클래스가 된다. 따라서 옵션 event_level = "second"을 사용하여 관심 클래스가 “yes”임을 명시해주어야 한다.

3-7-2. 그래프

Caution! 함수 “roc_curve(), gain_curve(), lift_curve(), pr_curve()”에서는 첫번째 클래스(Level)를 관심 클래스로 인식한다. R에서는 함수 Factor()를 이용하여 변수 유형을 변환하면 알파벳순(영어) 또는 오름차순(숫자)으로 클래스를 부여하므로 “HeartDisease” 변수의 경우 “no”가 첫번째 클래스가 되고 “yes”가 두번째 클래스가 된다. 따라서, 예제 데이터에서 관심 클래스는 “yes”이기 때문에 옵션 event_level = "second"을 사용하여 관심 클래스가 “yes”임을 명시해주어야 한다.

3-7-2-1. ROC Curve

xgb.pred %>% 
  roc_curve(truth = HeartDisease, .pred_yes,                           # truth : 실제 클래스, 관심 클래스 예측 확률측 확률
            event_level = "second") %>%                                
  autoplot()

3-7-2-2. Gain Curve

xgb.pred %>% 
  gain_curve(truth = HeartDisease, .pred_yes,                          # truth : 실제 클래스, 관심 클래스 예측 확률측 확률
             event_level = "second") %>%                               
  autoplot()

3-7-2-3. Lift Curve

xgb.pred %>% 
  lift_curve(truth = HeartDisease, .pred_yes,                          # truth : 실제 클래스, 관심 클래스 예측 확률측 확률
             event_level = "second") %>%                               
  autoplot()

3-7-2-4. Precision Recall Curve

xgb.pred %>% 
  pr_curve(truth = HeartDisease, .pred_yes,                            # truth : 실제 클래스, 관심 클래스 예측 확률  확률 
           event_level = "second") %>%                                
  autoplot()

XGBoost based on Tidymodels