Package tidymodels (Ver 0.2.0)는 R에서 머신러닝(Machine Learning)을 tidyverse principle로 수행할 수 있게끔 해주는 패키지 묶음이다. 특히, 모델링에 필요한 필수 패키지들을 대부분 포함하고 있기 때문에 데이터 전처리부터 시각화, 모델링, 예측까지 모든 과정을 tidy framework로 진행할 수 있다. 또한, Package caret을 완벽하게 대체하며 보다 더 빠르고 직관적인 코드로 모델링을 수행할 수 있다.

출처 : https://cehs-research.github.io/PSY-6600_public/slides/ch0_getting_started_r.html#26

출처 : https://rpubs.com/hoanganhngo610/553547

Package tidymodels를 이용하여 머신러닝을 수행하는 방법을 설명하기 위해 “Heart Disease Prediction” 데이터를 예제로 사용한다. 이 데이터는 환자의 심장병을 예측하기 위해 총 918명의 환자에 대한 10개의 예측변수로 이루어진 데이터이다(출처 : Package MLDataR, Gary Hutson 2021). 여기서 Target은 HeartDisease이다.

1. 데이터 불러오기

# install.packages("tidymodels")
pacman::p_load("MLDataR",                                              # For Data
               "data.table", "magrittr",
               "tidymodels",
               "doParallel", "parallel")

registerDoParallel(cores=detectCores())


data(heartdisease)
data <- heartdisease %>%
  mutate(HeartDisease = ifelse(HeartDisease==0, "no", "yes"))


cols <- c("Sex", "RestingECG", "Angina", "HeartDisease")

data   <- data %>% 
  mutate_at(cols, as.factor)                                           # 범주형 변수 변환

glimpse(data)                                                          # 데이터 구조

Rows: 918
Columns: 10
$ Age              <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37,~
$ Sex              <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F~
$ RestingBP        <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140~
$ Cholesterol      <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207~
$ FastingBS        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ RestingECG       <fct> Normal, Normal, ST, Normal, Normal, Normal,~
$ MaxHR            <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130,~
$ Angina           <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N~
$ HeartPeakReading <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5~
$ HeartDisease     <fct> no, yes, no, yes, no, no, no, no, yes, no, ~

2. 데이터 분할

tidymodels 안에는 다양한 Package들이 포함되어 있는데 데이터 분할은 Package rsample에 있는 함수 initial_split()를 이용한다.

set.seed(100)                                                          # seed 고정
data.split <- initial_split(data, prop = 0.7, strata = HeartDisease)   # initial_split(, strata = 층화추출할 변수)NAHD.train   <- training(data.split)
HD.test    <- testing(data.split)

3. 데이터 전처리

데이터 전처리는 Package recipes를 이용한다.
Package recipes의 과정은 여기를 참조한다.
1. 필요한 재료를 구한다.
  - 함수 recipe() 이용
2. 재료에 어떤 레시피를 적용할 것인가 생각한다.
  - 함수 step_*() 이용
3. 레시피를 기반으로 요리하기 전에 준비한다.
  - 함수 prep() 이용
4. 준비된 레시피로 요리를 한다.
  - 함수 juice()와 함수 bake() 이용

3-1. 변수 정의

필요한 재료를 구한다. \(\rightarrow\) Target과 예측변수에 대해 함수 recipe()로 명시한다.

rec  <- recipe(HeartDisease ~ ., data = HD.train)                      # recipe(formula, data) HeartDisease -> Target, 나머지 모든 변수 -> 예측변수변수
rec

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          9

summary(rec)

# A tibble: 10 x 4
   variable         type    role      source  
   <chr>            <chr>   <chr>     <chr>   
 1 Age              numeric predictor original
 2 Sex              nominal predictor original
 3 RestingBP        numeric predictor original
 4 Cholesterol      numeric predictor original
 5 FastingBS        numeric predictor original
 6 RestingECG       nominal predictor original
 7 MaxHR            numeric predictor original
 8 Angina           nominal predictor original
 9 HeartPeakReading numeric predictor original
10 HeartDisease     nominal outcome   original

3-2. 전처리 정의

재료에 어떤 레시피를 적용할 것인가 생각한다. \(\rightarrow\) 어떤 전처리 과정을 진행할 것인지 정의한다.
각 변수에 적용할 수 있는 전처리 함수는 여기를 잠조한다.

rec1 <- rec %>%
  step_normalize(all_numeric_predictors()) %>%                         # 모든 수치형 예측변수들을 표준화  step_dummy(all_nominal_predictors(), one_hot = TRUE)                 # 모든 범주형 예측변수들에 대해 원-핫 인코딩 더미변수 생성  NArec1

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          9

Operations:

Centering and scaling for all_numeric_predictors()
Dummy variables from all_nominal_predictors()

3-3. 전처리 계산

레시피를 기반으로 요리하기 전에 준비한다. \(\rightarrow\) 위에서 정의된 전처리 과정을 데이터에 적용하기 위해 필요한 부분을 계산 또는 추정한다.
- 예를 들어, 표준화를 위해 Training Data의 평균과 표준편차를 계산하여 저장한다.
함수 prep(, retain = TRUE)를 이용하여 전처리가 적용된 Training Data를 준비(저장)한다.

rec.prepped   <- prep(rec1, training = HD.train, retain = TRUE)
rec.prepped

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          9

Training data contained 642 data points and no missing data.

Operations:

Centering and scaling for Age, RestingBP, Cholesterol, FastingB... [trained]
Dummy variables from Sex, RestingECG, Angina [trained]

3-4. 전처리된 데이터 생성

준비된 레시피로 요리를 한다. \(\rightarrow\) 전처리를 적용한 결과를 생성한다.
- 이때, 함수 juice()와 함수 bake()를 이용한다.
  - 함수 juice() 오직 Training Data에 대해서만 적용할 수 있으며, 함수 bake()는 새로운 데이터에 적용할 때 사용한다.
전처리 계산에서 옵션 retain = TRUE를 이용하여 전처리가 적용된 Training Data를 저장했기 때문에, 함수 juice()를 이용하여 전처리된 Training Data를 생성한다.
- 이것은 bake(prep(), Training Data)과 같다.
Test Data에 대해서는 함수 bake()를 이용하여 Training Data로부터 계산된 전처리를 적용, 전처리가 적용된 Test Data를 생성한다.

train.prepped <- juice(rec.prepped)                                    # 전처리가 적용된 Training Data를 생성생성
test.prepped  <- bake(rec.prepped, HD.test)                            # 전처리가 적용된 Test Data를 생성생성

glimpse(train.prepped)

Rows: 642
Columns: 14
$ Age               <dbl> -1.47736664, -1.79906343, 0.02388507, -1.5~
$ RestingBP         <dbl> 0.4331599, -0.1020377, 0.9683574, -0.63723~
$ Cholesterol       <dbl> 0.802221831, 0.746860505, -0.065105609, 1.~
$ FastingBS         <dbl> -0.5445247, -0.5445247, -0.5445247, -0.544~
$ MaxHR             <dbl> 1.36842135, -1.54125781, -0.59757808, 1.28~
$ HeartPeakReading  <dbl> -0.84518925, -0.84518925, -0.84518925, -0.~
$ HeartDisease      <fct> no, no, no, no, no, no, no, no, no, no, no~
$ Sex_F             <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, ~
$ Sex_M             <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, ~
$ RestingECG_LVH    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ RestingECG_Normal <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ~
$ RestingECG_ST     <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
$ Angina_N          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ~
$ Angina_Y          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~

glimpse(test.prepped)

Rows: 276
Columns: 14
$ Age               <dbl> 0.02388507, -1.79906343, -1.79906343, -1.5~
$ RestingBP         <dbl> -1.1724328, 0.4331599, -0.1020377, -0.6372~
$ Cholesterol       <dbl> 0.05484393, 0.04561704, 0.08252459, 0.0179~
$ FastingBS         <dbl> -0.5445247, -0.5445247, -0.5445247, -0.544~
$ MaxHR             <dbl> 0.188821690, -0.283018173, 0.188821690, 0.~
$ HeartPeakReading  <dbl> -0.84518925, 0.53667365, -0.84518925, -0.8~
$ HeartDisease      <fct> no, yes, no, no, yes, no, yes, yes, no, no~
$ Sex_F             <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, ~
$ Sex_M             <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, ~
$ RestingECG_LVH    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ RestingECG_Normal <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, ~
$ RestingECG_ST     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, ~
$ Angina_N          <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, ~
$ Angina_Y          <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~

4. 모형 적합

모형 적합에는 Package parsnip를 이용한다.

4-1. 모형 정의

모형을 구축하기 위해 모형을 먼저 정의한다.
- 모형을 정의하기 위해 모형 타입(Type)과 모형 종류(set_mode) 그리고 사용할 패키지(set_engine)가 필요하다.
  - 모형 타입 : 사용하고자하는 머신러닝 함수 정의
    - 예를 들어, Random Forest는 함수 rand_forest()를 사용한다.
  - 모형 종류 : Target 유형 정의
    - 분류(Classification) 또는 회귀(Regresssion) 중 하나를 선택한다.
  - 사용할 패키지 : 사용하고자하는 Package 정의
    - Random Forest는 Package randomForest, ranger, spark를 사용할 수 있다.
Package tidymodels에서 사용할 수 있는 모형은 여기를 참조한다.

# 모형 정의NArf.mod <- rand_forest(mtry = 12, trees = 100) %>%                      # 모형 타입  NAset_mode("classification") %>%                                       # Target 유형 정의(classification /  regression)NAset_engine("randomForest",                                           # 사용하고자하는 패키지 정의(randomForest / ranger / spark)NA= TRUE)                                        # randomForest 패키지의 함수에 대한 옵션 지정 NA# 위 모형 정의 과정을 한 번에 정의하는 방법    NA# rf.mod <- rand_forest(mode ="classification", engine = "randomForest", mtry = 5, trees = 100)  

rf.mod %>%
  translate()

Random Forest Model Specification (classification)

Main Arguments:
  mtry = 12
  trees = 100

Engine-Specific Arguments:
  importance = TRUE

Computational engine: randomForest 

Model fit template:
randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
    mtry = min_cols(~12, x), ntree = 100, importance = TRUE)

Caution! 함수 translate()를 통해 위에서 정의한 “rf.mod”가 실제로 Package randomForest의 함수 randomForest()에 어떻게 적용되는지 확인할 수 있다.

?rand_forest를 통해서 모형의 모수를 확인할 수 있다.
Random Forest는 mtry, trees, min_n을 모수로 가진다.
- mtry : 노드를 분할할 때 랜덤하게 선택되는 후보 예측변수 개수
- trees : 생성하고자 하는 트리의 개수
- min_n : 터미널 노드(Terminal Node)의 최소 개수

 ?rand_forest                                                         # 모형 특정 함수 안에 어떤 모수가 있는 지 확인 가능

4-2. 모형 적합

set.seed(100)
rf.fit <- rf.mod %>%
  fit(HeartDisease ~ ., data = train.prepped)                         # fit(set_engine를 통해 정의된 model_spec, 모형식, data) = fit_xy(x = , y = ) NArf.fit

parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~100, mtry = min_cols(~12,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 12

        OOB estimate of  error rate: 19.63%
Confusion matrix:
     no yes class.error
no  226  61   0.2125436
yes  65 290   0.1830986

5. 예측

예측 및 모형 평가는 4-2. 모형 적합에서 얻은 적합 결과 rf.fit을 이용한다.

5-1. 예측 클래스

rf.pred.class <- predict(rf.fit, test.prepped)
rf.pred.class

# A tibble: 276 x 1
   .pred_class
   <fct>      
 1 no         
 2 yes        
 3 no         
 4 no         
 5 yes        
 6 no         
 7 yes        
 8 no         
 9 no         
10 no         
# ... with 266 more rows

5-2. 예측 확률

rf.pred.prob <- predict(rf.fit, test.prepped, type = "prob")
rf.pred.prob

# A tibble: 276 x 2
   .pred_no .pred_yes
      <dbl>     <dbl>
 1     0.91      0.09
 2     0.3       0.7 
 3     0.99      0.01
 4     0.95      0.05
 5     0.29      0.71
 6     0.97      0.03
 7     0.45      0.55
 8     0.68      0.32
 9     0.99      0.01
10     0.95      0.05
# ... with 266 more rows

5-3. 모든 예측 결과 출력

함수 augment()를 이용해서 예측 클래스와 예측 확률을 포함한 결과를 한 번에 출력할 수 있다.

rf.pred <- augment(rf.fit, test.prepped)
rf.pred

# A tibble: 276 x 17
       Age RestingBP Cholesterol FastingBS    MaxHR HeartPeakReading
     <dbl>     <dbl>       <dbl>     <dbl>    <dbl>            <dbl>
 1  0.0239    -1.17       0.0548    -0.545  0.189            -0.845 
 2 -1.80       0.433      0.0456    -0.545 -0.283             0.537 
 3 -1.80      -0.102      0.0825    -0.545  0.189            -0.845 
 4 -1.58      -0.637      0.0179    -0.545  0.307            -0.845 
 5 -0.512      0.433      0.295     -0.545  0.110             0.0761
 6 -1.26      -0.905      0.0825    -0.545 -0.00778          -0.845 
 7  0.667     -1.71       0.424     -0.545 -0.480             0.0761
 8 -1.91      -0.637      0.599     -0.545  0.897             1.92  
 9 -1.16      -1.71       0.193     -0.545  0.189            -0.845 
10 -1.91      -0.102      0.0641    -0.545  1.60             -0.845 
# ... with 266 more rows, and 11 more variables: HeartDisease <fct>,
#   Sex_F <dbl>, Sex_M <dbl>, RestingECG_LVH <dbl>,
#   RestingECG_Normal <dbl>, RestingECG_ST <dbl>, Angina_N <dbl>,
#   Angina_Y <dbl>, .pred_class <fct>, .pred_no <dbl>,
#   .pred_yes <dbl>

rf.pred %>%
  select(contains(".pred"))                                      # 예측 결과들만 추출

# A tibble: 276 x 3
   .pred_class .pred_no .pred_yes
   <fct>          <dbl>     <dbl>
 1 no              0.91      0.09
 2 yes             0.3       0.7 
 3 no              0.99      0.01
 4 no              0.95      0.05
 5 yes             0.29      0.71
 6 no              0.97      0.03
 7 yes             0.45      0.55
 8 no              0.68      0.32
 9 no              0.99      0.01
10 no              0.95      0.05
# ... with 266 more rows

6. 모형 평가

출처 : https://velog.io/@hajeongjj/Eval-Metrics

분류와 회귀에 사용될 수 있는 척도는 여기를 참조한다.
자세한 예제는 여기를 참조한다.

6-1. 척도

6-1-1. ConfusionMatrix

conf_mat(rf.pred, truth = HeartDisease, estimate = .pred_class)        # truth : 실제 클래스,  estimate : 예측 클래스 클래스

          Truth
Prediction  no yes
       no   97  28
       yes  26 125

conf_mat(rf.pred, truth = HeartDisease, estimate = .pred_class) %>%
  autoplot(type = "mosaic")                                            # autoplot(type = "heatmap")

6-1-2. Accuracy

accuracy(rf.pred, truth = HeartDisease, estimate = .pred_class)        # truth : 실제 클래스,  estimate : 예측 클래스 클래스

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.804

6-1-3. 여러 척도를 한 번에 나타내기

함수 metric_set()를 통해 여러 척도를 한 번에 나타낼 수 있다. 함수의 자세한 옵션은 여기를 참조한다.

classification_metrics <- metric_set(accuracy, mcc, 
                                     f_meas, kap,
                                     sens, spec, roc_auc)              # Test Data에 대한 Assessment Measure
classification_metrics(rf.pred, 
                       truth = HeartDisease, estimate = .pred_class,   # truth : 실제 클래스, estimate : 예측 클래스 클래스
                       .pred_yes, event_level = "second")              # For roc_auc

# A tibble: 7 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.804
2 mcc      binary         0.605
3 f_meas   binary         0.822
4 kap      binary         0.605
5 sens     binary         0.817
6 spec     binary         0.789
7 roc_auc  binary         0.869

Caution! “ROC AUC”를 계산하기 위해서는 관심 클래스에 대한 예측 확률이 필요하다. 예제 데이터에서 관심 클래스는 “yes”이므로 “yes”에 대한 예측 확률 결과인 .pred_yes가 사용되었다. 또한, Target인 “HeartDisease” 변수의 유형을 “Factor” 변환하면 알파벳순으로 클래스를 부여하기 때문에 관심 클래스 “yes”가 두 번째 클래스가 된다. 따라서 옵션 event_level = "second"을 사용하여 관심 클래스가 “yes”임을 명시해주어야 한다.

6-2. 그래프

Caution! 함수 “roc_curve(), gain_curve(), lift_curve(), pr_curve()”에서는 첫번째 클래스(Level)를 관심 클래스로 인식한다. R에서는 함수 Factor()를 이용하여 변수 유형을 변환하면 알파벳순(영어) 또는 오름차순(숫자)으로 클래스를 부여하므로 “HeartDisease” 변수의 경우 “no”가 첫번째 클래스가 되고 “yes”가 두번째 클래스가 된다. 따라서, 예제 데이터에서 관심 클래스는 “yes”이기 때문에 옵션 event_level = "second"을 사용하여 관심 클래스가 “yes”임을 명시해주어야 한다.

6-2-1. ROC Curve

rf.pred %>% 
  roc_curve(truth = HeartDisease, .pred_yes,                           # truth : 실제 클래스,  관심 클래스 예측 확률  확률 
            event_level = "second") %>%                                
  autoplot()

6-2-2. Gain Curve

Gain Curve는 관심 클래스 대비 해당 분위수에서의 관심 클래스 비율을 나타낸 그래프이다.

rf.pred %>% 
  gain_curve(truth = HeartDisease, .pred_yes,                          # truth : 실제 클래스,  관심 클래스 예측 확률  확률 
             event_level = "second") %>%                               
  autoplot()

Caution! 관심 클래스의 예측 확률을 내림차순으로 정렬한 후 그래프로 나타낸다. 함수 gain_curve()에 대한 자세한 설명은 여기를 참조한다.
Result! x축인 %Tested은 Test Data의 분위수이며, y축인 %Found은 관심 클래스 대비 해당 분위수에서의 관심 클래스 비율(즉, 해당 분위수에서의 관심 클래스 빈도 / Test Data에서의 관심 클래스 빈도)을 나타낸다. 그리고 회색 영역의 삼각형은 ‘Perfect’ Gain Curve으로 정확도가 100%인 모형에 대한 Gain Curve이다.

6-2-3. Lift Curve

Lift Curve는 전체 반응률 대비 해당 분위수의 반응률를 나타낸 그래프이다.

rf.pred %>% 
  lift_curve(truth = HeartDisease, .pred_yes,                          # truth : 실제 클래스,  관심 클래스 예측 확률  확률 
             event_level = "second") %>%                               
  autoplot()

Caution! 관심 클래스의 예측 확률을 내림차순으로 정렬한 후 그래프로 나타낸다. 함수 lift_curve()에 대한 자세한 설명은 여기를 참조한다.
Result! x축인 %Tested은 Test Data의 분위수이며, y축인 Lift는 전체 관심 클래스 비율 대비 해당 분위수의 관심 클래스 비율(함수 gain_curve()의 y축 %Found)을 x축 %Tested으로 나눈 값(함수 gain_curve()의 y축 %Found/x축 %Tested)을 나타낸다.

6-2-4. Precision Recall Curve

rf.pred %>% 
  pr_curve(truth = HeartDisease, .pred_yes,                            # truth : 실제 클래스,  관심 클래스 예측 확률  확률 
           event_level = "second") %>%                                 
  autoplot()

7. Resampling 방법

Resampling 방법은 Training Data의 일부를 모형 구축에 사용하고 구축된 모형을 평가하기 위해 또 다른 일부를 사용하는 과정을 반복한다. 이러한 방법은 모형의 성능을 측정하기 위해 사용되며, 모수 튜닝에서 최적의 모수 조합을 찾을 때 사용한다.

출처 : https://www.tmwr.org/resampling.html

위의 그림처럼 Training Data를 Analysis 그룹과 Assessment 그룹으로 분할한다.
- Analysis 그룹 : 모형을 구축하기 위해 사용하는 데이터셋
- Assessment 그룹 : 구축된 모형을 평가하기 위해 사용하는 데이터셋
만약, Resampling 방법을 통해 20개의 모형이 구축되면 모형 평가 척도값도 20개가 나온다.
이러한 경우, 최종 모형 평가 척도값은 20개의 평균을 사용한다.
- 이 방법은 일반화 특성이 매우 우수한 것으로 알려져 있다.

7-1. K-Fold Cross-Vailidation

K-Fold Cross Validation은 Training Data를 K개의 Fold로 나눈 후, K-1개의 Fold(Analysis 그룹)를 이용하여 모형을 구축하고 1개의 Fold(Assessment 그룹)를 이용하여 구축된 모형을 평가한다.
총 K번을 반복하기 때문에 K개의 평가 척도값이 계산된다.

set.seed(100)
vfold_cv(data, v)

data : Data Frame 형태의 (Training) Data
v : Fold 개수

7-2. Repeated K-Fold Cross-Vailidation

K-Fold Cross-Vailidation를 반복하는 Resampling 방법이다.

vfold_cv(data, v, repeats)

data : Data Frame 형태의 (Training) Data
v : Fold 개수
repeats : K-Fold Cross-Vailidation 반복 수

7-3. Leave-One-Out Cross-Validation

Training Data에서 오직 한 개의 Data Point만을 이용하여 평가하고 나머지 Data Point들을 이용하여 모형을 구축한다.
총 데이터 개수만큼 반복하기 때문에 평가 척도값도 데이터 개수만큼 계산된다.

loo_cv(data)

data : Data Frame 형태의 (Training) Data

7-4. Monte Carlo Cross-Validation

Training Data를 랜덤하게 Analysis 그룹과 Assessment 그룹으로 나눈다.
따라서, 한 개의 Data Point가 Assessment 그룹으로 여러 번 선택될 수 있다.

mc_cv(data, prop, times)

data : Data Frame 형태의 (Training) Data
prop : Analysis 그룹의 비율
times : Resampling 반복 수

7-5. Validation Set

전체 Data를 Test Data와 Not Test Data로 나누고, Not Test Data는 Training Data와 Validation Data로 나눈다.
- Training Data : 모형을 구축하기 위해 사용하는 데이터셋
- Validation Data : 구축된 모형을 평가하기 위해 사용하는 데이터셋

validation_split(data, prop)

data : Data Frame 형태의 (Not Test) Data
prop : Training Data의 비율

7-6. Bootstrapping

Training Data에서 중복을 허용하면서 랜덤하게 Training Data와 똑같은 크기의 Analysis 그룹을 만든다.
- 중복을 허용했기 때문에, Analysis 그룹에 한 개의 Data Point가 여러 번 선택될 수 있다.
Analysis 그룹에 포함되지 않은 Data Point는 Assessment 그룹으로 분류되어 구축된 모형을 평가하는 데 사용된다.

bootstraps(data, times)

data : Data Frame 형태의 (Training) Data
times : Resampling 반복 수

7-7. Resampling 방법을 이용한 모형 적합

예를 위해, 위에서 소개한 Resampling 방법 중 가장 많이 사용되는 K-Fold Cross-Validation을 이용한다.

Caution! Resampling 방법을 통해 계산된 Assessment 그룹의 평가 척도값이 Workflow를 사용했을 때(tidymodels_Ver.2)와 결과가 다르다. 그래서, 모수 튜닝을 할 때, Workflow를 사용할 때와 최적의 모수 조합이 다를 수 있다.

7-7-1. Resampling 방법 정의

set.seed(100)                                                          # seed 고정
train.fold <- vfold_cv(train.prepped, v = 5)                         
train.fold

#  5-fold cross-validation 
# A tibble: 5 x 2
  splits            id   
  <list>            <chr>
1 <split [513/129]> Fold1
2 <split [513/129]> Fold2
3 <split [514/128]> Fold3
4 <split [514/128]> Fold4
5 <split [514/128]> Fold5

Caution! 데이터를 먼저 5개의 Fold로 나눈 후, Analysis 그룹의 Dataset(4개의 Fold에 속한 513명의 환자)을 이용하여 모형을 구축하고 Assessment 그룹의 Dataset(1개의 Fold에 속한 129명의 환자)을 이용하여 구축된 모형을 평가한다.

7-7-2. 모형 적합

Resampling 방법을 적용한 결과는 함수 fit_resamples()를 이용하여 확인할 수 있다.

set.seed(100)
rf.fit.rs <- fit_resamples(object = rf.mod,                            # 4-1에서 정의의
                           preprocessor = HeartDisease ~ .,            # preprocessor : Formular / Recipe
                           resamples = train.fold)                     # 7-7-1에서 정의 : Resampling 방법 적용 

rf.fit.rs

# Resampling results
# 5-fold cross-validation 
# A tibble: 5 x 4
  splits            id    .metrics         .notes          
  <list>            <chr> <list>           <list>          
1 <split [513/129]> Fold1 <tibble [2 x 4]> <tibble [0 x 3]>
2 <split [513/129]> Fold2 <tibble [2 x 4]> <tibble [0 x 3]>
3 <split [514/128]> Fold3 <tibble [2 x 4]> <tibble [0 x 3]>
4 <split [514/128]> Fold4 <tibble [2 x 4]> <tibble [0 x 3]>
5 <split [514/128]> Fold5 <tibble [2 x 4]> <tibble [0 x 3]>

Result! .metrics열은 Assessment 그룹의 평가 척도값이 포함되어 있으며, .notes열은 Resampling 동안에 생성되는 에러나 경고를 포함한다.

# Assessment 그룹에 대한 평균 Assessment Measurecollect_metrics(rf.fit.rs)

# A tibble: 2 x 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <fct>               
1 accuracy binary     0.812     5 0.00282 Preprocessor1_Model1
2 roc_auc  binary     0.865     5 0.00348 Preprocessor1_Model1

# 각 Assessment 그룹에 대한 Assessment Measuree
collect_metrics(rf.fit.rs, summarize = FALSE)

# A tibble: 10 x 5
   id    .metric  .estimator .estimate .config             
   <chr> <chr>    <chr>          <dbl> <fct>               
 1 Fold1 accuracy binary         0.806 Preprocessor1_Model1
 2 Fold1 roc_auc  binary         0.867 Preprocessor1_Model1
 3 Fold2 accuracy binary         0.814 Preprocessor1_Model1
 4 Fold2 roc_auc  binary         0.875 Preprocessor1_Model1
 5 Fold3 accuracy binary         0.805 Preprocessor1_Model1
 6 Fold3 roc_auc  binary         0.869 Preprocessor1_Model1
 7 Fold4 accuracy binary         0.812 Preprocessor1_Model1
 8 Fold4 roc_auc  binary         0.855 Preprocessor1_Model1
 9 Fold5 accuracy binary         0.820 Preprocessor1_Model1
10 Fold5 roc_auc  binary         0.860 Preprocessor1_Model1

Caution! 함수 collect_metrics(, summarize = FALSE)을 이용하면 각 Assessment 그룹에 대한 모형 평가 척도값을 확인할 수 있다.

7-7-3. 각 Fold에 대한 결과

함수 control_resamples()를 이용하여 Assessment 그룹에 대한 예측 결과를 저장할 수 있다.

re.control <- control_resamples(save_pred = TRUE,                      # Resampling의 Assessment 결과 저장NA= "everything")          # 병렬 처리(http:://tune.tidymodels.org/reference/control_grid.html)l)
set.seed(100)
fit_resamples(object = rf.mod,                                         # 4-1에서 정의의
              preprocessor = HeartDisease ~ .,                         # preprocessor : Formular / Recipe
              resamples = train.fold,                                  # 7-7-1에서 정의 : Resampling 방법 적용                                
              control = re.control) %>%
  collect_predictions()

# A tibble: 642 x 7
   id    .pred_no .pred_yes  .row .pred_class HeartDisease .config    
   <chr>    <dbl>     <dbl> <int> <fct>       <fct>        <fct>      
 1 Fold1     0.89      0.11     5 no          no           Preprocess~
 2 Fold1     1         0       11 no          no           Preprocess~
 3 Fold1     0.66      0.34    13 no          no           Preprocess~
 4 Fold1     0.55      0.45    14 no          no           Preprocess~
 5 Fold1     1         0       22 no          no           Preprocess~
 6 Fold1     0.98      0.02    28 no          no           Preprocess~
 7 Fold1     0.86      0.14    33 no          no           Preprocess~
 8 Fold1     0.81      0.19    39 no          no           Preprocess~
 9 Fold1     0.87      0.13    46 no          no           Preprocess~
10 Fold1     0.92      0.08    52 no          no           Preprocess~
# ... with 632 more rows

Caution! 함수 collect_predictions()을 이용하면 각 Assessment 그룹에 대한 예측 결과를 확인할 수 있다.

8. 모수 튜닝

모형을 구축하기 위해서는 모수를 먼저 명시해줘야 한다.
최적의 모수 조합을 찾기 위해, 후보 모수 집합에 대해 Resampling 방법을 적용하여 각 후보 모수 집합에 대한 모형을 평가한다.

8-1. Regular Grid

Package dials에 있는 함수 grid_regular()를 이용한다.
모수별로 후보 모수값들을 지정한 후 조합을 생성한다.
- 예를 들어, Random Forest의 경우 mtry = 2, 4, trees = 100, 200, min_n = 5, 6 처럼 모수별로 후보 모수값을 먼저 지정한 후 이에 대한 조합 {(mtry, trees, min_n) : (2, 100, 5), (2, 100, 6), (2, 200, 5), (2, 200, 6), (4, 100, 5), (4, 100, 6), (4, 200, 5), (4, 200, 6)}을 생성한다.

grid_regular(x, levels)

x : param 객체, 리스트 또는 모수
levels : 각 모수에 대한 후보 개수

8-2. Irregular Grid

8-2-1. Random Grid

후보 모수 집합을 랜덤하게 생성한다.

grid_random(x, size)

x : param 객체, 리스트 또는 모수
size : 후보 모수 집합 개수

8-2-2. Latin Hypercube

Random Grid와 같이 후보 모수 조합을 랜덤하게 생성을 하지만, 모수 공간이 중복될 가능성이 가장 낮은 랜덤 조합을 찾기 때문에 Random Grid보다 더 효율적이다.

grid_latin_hypercube(x, size)

x : param 객체, 리스트 또는 모수
size : 후보 모수 집합 개수

8-3. Expand Grid

모수별로 직접 후보값을 할당한다.

expand.grid(...)

... : 모수 이름과 후보값을 벡터 또는 범주로 지정

8-4. 모수 튜닝을 이용한 모형 적합

Latin Hypercube로 후보 모수 집합을 랜덤하게 생성한 후, K-Fold Cross-Validation을 이용하여 각 후보 모수의 조합에 대해 평가한다.

8-4-1. 모형 정의

모형에서 튜닝하고 싶은 모수는 함수 tune()으로 지정한다.

# 모형 정의 NArf.tune.mod <- rand_forest(mtry = tune(), trees = tune()) %>%          # Tuning 하고 싶은 모수를 tune() 으로 설정정
  set_mode("classification") %>%                                         
  set_engine("randomForest")

8-4-2. 모수 범위 확인

함수 extract_parameter_set_dials()를 이용하여 모수들의 정보를 확인한다.

rf.param    <- extract_parameter_set_dials(rf.tune.mod)               
rf.param

Collection of 2 parameters for tuning

 identifier  type    object
       mtry  mtry nparam[?]
      trees trees nparam[+]

Model parameters needing finalization:
   # Randomly Selected Predictors ('mtry')

See `?dials::finalize` or `?dials::update.parameters` for more information.

Result! object열에서 nparam은 모수값이 수치형임을 나타낸다. 또한, nparam[+]는 해당 모수의 범위가 명확하게 주어졌음을 의미하고, nparam[?]는 모수의 범위에서 상한 또는 하한의 값이 명확하지 않다는 것을 의미한다. 이러한 경우, 상한 또는 하한의 값을 명확하게 결정하여야 한다.

rf.param %>%
  extract_parameter_dials("mtry")

# Randomly Selected Predictors (quantitative)
Range: [1, ?]

Caution! 함수 extract_parameter_dials()를 이용하여 모수 범위를 자세히 확인할 수 있다.
Result! mtry의 상한이 ?이므로 상한값을 결정하여야 한다.

8-4-3. 모수 범위 수정

모수의 범위를 수정하는 방법은 두 가지가 있다.
1. 함수 update()를 이용하여 상한값과 하한값을 직접 할당한다.
  - update(모수 이름 = 모수 이름(c(하한, 상한)))
2. 함수 finalize()를 이용하여 데이터셋을 기반으로 한 자동화 방법을 통해 상한값과 하한값을 결정한다.

# 1. 직접 할당하는 방법NA## 전처리가 적용된 데이터의 예측변수 개수가 상한이 되도록 설정정
rf.param %<>%
  update(mtry =  mtry(c(1, ncol(train.prepped)-1)))                    

# 2. 데이터셋을 기반으로 결정하는 방법NA# rf.param %<>%
#   finalize(train.prepped)                                               # 상한이 train.prepped의 변수 개수 14개로 수정됨 

rf.param %>%
  extract_parameter_dials("mtry")

# Randomly Selected Predictors (quantitative)
Range: [1, 13]

Result! mtry의 상한이 13으로 수정되었다.

8-4-4. 후보 모수 집합에 대한 모형 평가

함수 tune_grid()를 이용하여 각 후보 모수 조합에 대한 모형 평가를 한다.
예를 위해, 각 후보 모수 조합에 대해 5-Cross-Validation을 적용하여 모형 평가를 한다.

set.seed(100)
train.fold <- vfold_cv(train.prepped, v = 5)                           # 5-Cross-Validation                              
                       
set.seed(100)
rf.tune.fit <- tune_grid(object = rf.tune.mod,                         # 8-4-1에서 정의의
                         preprocessor = HeartDisease ~.,               # preprocessor : Formular / Recipe
                         resamples = train.fold,
                         grid = rf.param %>%                           # 후보 모수 집합 NAgrid_latin_hypercube(size = 10),
                         control = control_grid(save_pred = TRUE,      # Resampling의 Assessment 결과 저장NA= "everything")               # 병렬 처리(http:://tune.tidymodels.org/reference/control_grid.html))))
                         # metrics = metric_set(roc_auc, accuracy)                          # Assessment 그룹에 대한 Assessment Measure
)

rf.tune.fit

# Tuning results
# 5-fold cross-validation 
# A tibble: 5 x 5
  splits            id    .metrics          .notes   .predictions
  <list>            <chr> <list>            <list>   <list>      
1 <split [513/129]> Fold1 <tibble [20 x 6]> <tibble> <tibble>    
2 <split [513/129]> Fold2 <tibble [20 x 6]> <tibble> <tibble>    
3 <split [514/128]> Fold3 <tibble [20 x 6]> <tibble> <tibble>    
4 <split [514/128]> Fold4 <tibble [20 x 6]> <tibble> <tibble>    
5 <split [514/128]> Fold5 <tibble [20 x 6]> <tibble> <tibble>

# Assessment 그룹에 대한 평균 Assessment Measurecollect_metrics(rf.tune.fit)

# A tibble: 20 x 8
    mtry trees .metric  .estimator  mean     n std_err .config        
   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <fct>          
 1     6  1363 accuracy binary     0.819     5 0.00855 Preprocessor1_~
 2     6  1363 roc_auc  binary     0.870     5 0.00556 Preprocessor1_~
 3     4  1791 accuracy binary     0.821     5 0.0112  Preprocessor1_~
 4     4  1791 roc_auc  binary     0.873     5 0.00776 Preprocessor1_~
 5     8  1010 accuracy binary     0.819     5 0.0102  Preprocessor1_~
 6     8  1010 roc_auc  binary     0.868     5 0.00486 Preprocessor1_~
 7     2   572 accuracy binary     0.827     5 0.0161  Preprocessor1_~
 8     2   572 roc_auc  binary     0.879     5 0.00892 Preprocessor1_~
 9     8   333 accuracy binary     0.810     5 0.00871 Preprocessor1_~
10     8   333 roc_auc  binary     0.868     5 0.00444 Preprocessor1_~
11    10  1533 accuracy binary     0.810     5 0.00817 Preprocessor1_~
12    10  1533 roc_auc  binary     0.866     5 0.00406 Preprocessor1_~
13     2   681 accuracy binary     0.829     5 0.0162  Preprocessor1_~
14     2   681 roc_auc  binary     0.879     5 0.00893 Preprocessor1_~
15    11   151 accuracy binary     0.813     5 0.00722 Preprocessor1_~
16    11   151 roc_auc  binary     0.861     5 0.00423 Preprocessor1_~
17    13  1882 accuracy binary     0.812     5 0.00473 Preprocessor1_~
18    13  1882 roc_auc  binary     0.863     5 0.00429 Preprocessor1_~
19     5   938 accuracy binary     0.821     5 0.00844 Preprocessor1_~
20     5   938 roc_auc  binary     0.871     5 0.00655 Preprocessor1_~

Result! 각 후보 모수 조합별로 성능을 확인할 수 있다. 평가 척도는 기본적으로 “Accuracy”와 “ROC AUC”이다.

# 그래프
autoplot(rf.tune.fit) + 
  scale_color_viridis_d(direction = -1) + 
  theme(legend.position = "top") + 
  theme_bw()

8-4-5. 최적의 모수 조합 확인

함수 show_best()를 이용하여 예측 성능이 좋은 모형을 내림차순으로 확인할 수 있다.
함수 select_best()를 이용하여 성능이 가장 좋은 최적의 모수 조합을 확인할 수 있다.
아래 예제는 “ROC AUC”를 기준으로 그 결과를 살펴본다.

# Metric 기준으로 예측 성능이 우수한 모수 조합 순서대로 확인NAshow_best(rf.tune.fit, "roc_auc")                                      # show_best(, "accuracy")

# A tibble: 5 x 8
   mtry trees .metric .estimator  mean     n std_err .config          
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>            
1     2   681 roc_auc binary     0.879     5 0.00893 Preprocessor1_Mo~
2     2   572 roc_auc binary     0.879     5 0.00892 Preprocessor1_Mo~
3     4  1791 roc_auc binary     0.873     5 0.00776 Preprocessor1_Mo~
4     5   938 roc_auc binary     0.871     5 0.00655 Preprocessor1_Mo~
5     6  1363 roc_auc binary     0.870     5 0.00556 Preprocessor1_Mo~

# 최적의 모수 조합 확인NAbest.rf <- rf.tune.fit %>% 
  select_best("roc_auc")                                               # select_best("accuracy")
best.rf

# A tibble: 1 x 3
   mtry trees .config              
  <int> <int> <fct>                
1     2   681 Preprocessor1_Model07

Result! mtry = 2, trees = 681일 때 “ROC AUC” 측면에서 모형의 예측 성능이 가장 좋다.

8-4-6. 최적의 모수 조합을 이용한 모형 적합

위에서 찾은 최적의 모수 조합을 이용하여 모형을 구축한다.
함수 finalize_model()을 이용하여 8-4-1에서 정의한 “rf.tune.mod”을 최적의 모수 조합을 가지는 “모형”으로 업데이트한다.

final.rf <- rf.tune.mod %>%                                            # 8-4-1에서 정의의
  finalize_model(best.rf)                                              # finalize_model : 최적의 모수 조합을 가지는 model로 업데이트NAfinal.rf

Random Forest Model Specification (classification)

Main Arguments:
  mtry = 2
  trees = 681

Computational engine: randomForest

# 모형 적합NAset.seed(100)
final.rf.fit <- final.rf %>% 
  fit(HeartDisease ~ ., data = train.prepped)
final.rf.fit

parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~681L, mtry = min_cols(~2L,      x)) 
               Type of random forest: classification
                     Number of trees: 681
No. of variables tried at each split: 2

        OOB estimate of  error rate: 17.45%
Confusion matrix:
     no yes class.error
no  234  53   0.1846690
yes  59 296   0.1661972

# 최종 모형NAfinal.rf.fit %>% 
  extract_fit_engine()


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~681L, mtry = min_cols(~2L,      x)) 
               Type of random forest: classification
                     Number of trees: 681
No. of variables tried at each split: 2

        OOB estimate of  error rate: 17.45%
Confusion matrix:
     no yes class.error
no  234  53   0.1846690
yes  59 296   0.1661972

8-5. 예측

pred <- augment(final.rf.fit, test.prepped)                             # predict(final.rf.fit, test.prepped) 
pred

# A tibble: 276 x 17
       Age RestingBP Cholesterol FastingBS    MaxHR HeartPeakReading
     <dbl>     <dbl>       <dbl>     <dbl>    <dbl>            <dbl>
 1  0.0239    -1.17       0.0548    -0.545  0.189            -0.845 
 2 -1.80       0.433      0.0456    -0.545 -0.283             0.537 
 3 -1.80      -0.102      0.0825    -0.545  0.189            -0.845 
 4 -1.58      -0.637      0.0179    -0.545  0.307            -0.845 
 5 -0.512      0.433      0.295     -0.545  0.110             0.0761
 6 -1.26      -0.905      0.0825    -0.545 -0.00778          -0.845 
 7  0.667     -1.71       0.424     -0.545 -0.480             0.0761
 8 -1.91      -0.637      0.599     -0.545  0.897             1.92  
 9 -1.16      -1.71       0.193     -0.545  0.189            -0.845 
10 -1.91      -0.102      0.0641    -0.545  1.60             -0.845 
# ... with 266 more rows, and 11 more variables: HeartDisease <fct>,
#   Sex_F <dbl>, Sex_M <dbl>, RestingECG_LVH <dbl>,
#   RestingECG_Normal <dbl>, RestingECG_ST <dbl>, Angina_N <dbl>,
#   Angina_Y <dbl>, .pred_class <fct>, .pred_no <dbl>,
#   .pred_yes <dbl>

8-6. 모형 적합과 예측을 한 번에 하기

함수 last_fit()를 이용하여 최적의 모수 조합에 대해 Training Data를 이용한 모형 적합과 Test Data에 대한 예측을 한 번에 수행할 수 있다.

# Ref. https://github.com/tidymodels/tune/issues/300 (last_fit과 fit의 결과가 다름))
#      https://github.com/tidymodels/tune/pull/323 (last_fit seed X)

final.rf.last <- last_fit(final.rf, 
                          preprocessor = HeartDisease ~ .,             # preprocessor : Formular / Recipe
                          split = data.split)
final.rf.last

# Resampling results
# Manual resampling 
# A tibble: 1 x 6
  splits            id       .metrics .notes   .predictions .workflow 
  <list>            <chr>    <list>   <list>   <list>       <list>    
1 <split [642/276]> train/t~ <tibble> <tibble> <tibble>     <workflow>

# 구축된 모형final.rf.last %>%
  extract_fit_engine()


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~681L, mtry = min_cols(~2L,      x)) 
               Type of random forest: classification
                     Number of trees: 681
No. of variables tried at each split: 2

        OOB estimate of  error rate: 17.45%
Confusion matrix:
     no yes class.error
no  231  56   0.1951220
yes  56 299   0.1577465

# 예측 결과
final.rf.last %>%
  collect_predictions()

# A tibble: 276 x 7
   id        .pred_no .pred_yes  .row .pred_class HeartDisease .config
   <chr>        <dbl>     <dbl> <int> <fct>       <fct>        <fct>  
 1 train/te~    0.931    0.0690     8 no          no           Prepro~
 2 train/te~    0.314    0.686      9 yes         yes          Prepro~
 3 train/te~    0.977    0.0235    11 no          no           Prepro~
 4 train/te~    0.909    0.0910    13 no          no           Prepro~
 5 train/te~    0.307    0.693     14 yes         yes          Prepro~
 6 train/te~    0.990    0.0103    15 no          no           Prepro~
 7 train/te~    0.517    0.483     19 no          yes          Prepro~
 8 train/te~    0.633    0.367     20 no          yes          Prepro~
 9 train/te~    0.968    0.0323    21 no          no           Prepro~
10 train/te~    0.952    0.0485    26 no          no           Prepro~
# ... with 266 more rows

# Test Data에 대한 Assessment Measure
final.rf.last %>%
  collect_metrics()

# A tibble: 2 x 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <fct>               
1 accuracy binary         0.812 Preprocessor1_Model1
2 roc_auc  binary         0.886 Preprocessor1_Model1

Caution! 함수 last_fit()은 seed 고정이 되지 않아 Reproducibility (재생산성)가 만족되지 않는다.

Machine Learning Analysis using Tidymodels Package (Ver. 1)