6. モデルの評価とハイパーパラメータのチューニングのベストプラクティス

クイープ訳, 福島真太朗監訳の「Python機械学習プログラミング: 達人データサイエンティストによる理論と実践（第三版）」（インプレス, 2016） ISBN: 978-4-295-01007-4 をtidymodelsで実装してみる。

原著について

著者: Sebastian Raschka and Vahid Mirjalili
書名: Python Machine Learning
出版社: Packt Publishing
出版年: 2019
リポジトリ: https://github.com/rasbt/python-machine-learning-book-3rd-edition (コードライセンス: MIT)

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.0     ✔ recipes      1.0.1
✔ dials        1.0.0     ✔ rsample      1.0.0
✔ dplyr        1.0.9     ✔ tibble       3.1.8
✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
✔ infer        1.0.2     ✔ tune         1.0.0
✔ modeldata    1.0.0     ✔ workflows    1.0.0
✔ parsnip      1.0.0     ✔ workflowsets 1.0.0
✔ purrr        0.3.4     ✔ yardstick    1.0.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

# rsample, recipes, parsnip, yardstick, workflows

6.1 パイプラインによるワークフローの効率化

6.1.1 Breast Cancer Wisconsinデータセットを読み込む

Breast Cancer WisconsinデータセットはUCIから提供されている。 Rではmlbenchパッケージによってこのデータを利用可能になる。

# Breast Cancer Wisconsinデータセットの読み込み
library(mlbench)
data("BreastCancer")
# 変数の数に注意
dim(BreastCancer)

[1] 699  11

tbl_BreastCancer <- 
  BreastCancer |> 
  tibble::as_tibble() |> 
  janitor::clean_names() |> 
  mutate(across(.cols = c(contains("_"), "mitoses"), .fns = as.numeric, .names = "{.col}"))

# 書籍とは件数が異なるので注意（書籍では569件）
nrow(tbl_BreastCancer)

[1] 699

# class変数に悪性腫瘍(malignant)か良性腫瘍(benign)かのラベルが割り当てられる
levels(tbl_BreastCancer$class)

[1] "benign"    "malignant"

データセットの分割はrsampleパッケージを使って行う。

# 訓練データ80%、テストデータ20%
set.seed(123)
cancer_split <- 
  initial_split(tbl_BreastCancer, prop = 0.8, strata = class)

cancer_train <-
  training(cancer_split)
cancer_test <-
  testing(cancer_split)

6.1.2 パイプラインで変換器と推定器を結合する

tidymodelsでのパイプラインは前処理とモデルの実行・評価を分けて実行する。前処理にはrecipes、モデルの定義と実行はparsnip、評価はyardstickパッケージの関数をそれぞれ利用する。また、これらのパッケージや処理をワーフフローとしてまとめるためにworkflowパッケージが役に立つ。

pca_rec <- 
  recipe(class ~ ., data = cancer_train) |>
  # id列の識別
  update_role(id, new_role = "id") |> 
  # 標準化
  step_normalize(all_numeric()) |> 
  # 欠損を含む行の除外
  step_naomit(all_numeric()) |> 
  # PCA
  step_pca(all_numeric(), num_comp = 2) |> 
  prep()

# GLM
glm_spec <-
  logistic_reg() |>
  set_mode("classification") |>
  set_engine("glm")

# ワークフローの定義
cancer_wflow <- 
  # 宣言
  workflow() %>%
  # 前処理レシピ
  add_recipe(pca_rec) |> 
  # モデルの仕様
  add_model(glm_spec)

# workflow()関数に直接レシピとモデルを指定しても良い
# workflow(preprocessor = pca_rec, spec = glm_spec)

# 学習データをモデルに当てはめてモデルの訓練を行う
cancer_fit <- 
  cancer_wflow %>% 
  # 与えるデータは前処理適用前の学習データ
  fit(data = cancer_train)

# テストデータ（評価データ）を使った予測
# predict(cancer_fit, cancer_test)

cancer_aug <- 
  augment(cancer_fit, cancer_test)

# yardstickパッケージの関数による性能評価
cancer_aug %>%
  accuracy(class, .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.985

6.2 k分割交差検証を使ったモデルの性能の評価

6.2.1 ホールドアウト法

6.2.2 k分割交差検証

rsamplesパッケージでのk分割交差検証を実行する関数はvfold_cv()。分割する

set.seed(123)
# 学習データに対して、さらに分割データセットを作成する
cancer_kfold <-
  cancer_train |> 
  # classによる層化
  vfold_cv(v = 10, strata = class)

cancer_kfold

#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits           id    
   <list>           <chr> 
 1 <split [501/57]> Fold01
 2 <split [501/57]> Fold02
 3 <split [502/56]> Fold03
 4 <split [502/56]> Fold04
 5 <split [502/56]> Fold05
 6 <split [502/56]> Fold06
 7 <split [503/55]> Fold07
 8 <split [503/55]> Fold08
 9 <split [503/55]> Fold09
10 <split [503/55]> Fold10

交差検証によるリサンプリングデータに対してワークフローを適用するには、tune::fit_resamples()関数を使うまた、交差検証用のデータからモデルの評価指標を得るにはtune::collect_metrics()を用いる。

set.seed(123)
glm_fit_fold <- 
  cancer_wflow %>% 
  fit_resamples(cancer_kfold,
                control = control_resamples(
                  save_pred = TRUE))

# glm_fit_fold |> 
#   nplyr::nest_filter(.metrics, .metric == "accuracy") |> 
#   tidyr::unnest(.metrics)

# 各foldで求められた評価指標の平均値を最終的な指標として採用する
collect_metrics(glm_fit_fold)

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.965    10 0.00426 Preprocessor1_Model1
2 roc_auc  binary     0.994    10 0.00172 Preprocessor1_Model1

tidymodelsの枠組みにおいて、前処理の手順、複数のモデルを管理するには workflowsetsパッケージを利用すると便利。このパッケージでは、recipes、parsnipによるレシピとモデルの仕様を組み合わせたモデルの管理・運用が可能となる。

6.3 学習曲線と検証曲線によるアルゴリズムの検証

6.3.1 学習曲線を使ってバイアスとバリアンスの問題を診断する

6.3.2 検証曲線を使って過学習と学習不足を明らかにする

6.4 グリッドサーチによる機械学習モデルのチューニング

6.4.1 グリッドサーチを使ったハイパーパラメータのチューニング

# bare_nuclei
std_rec <- 
  recipe(class ~ ., 
         data = cancer_train) |>
  update_role(id, new_role = "id") |>
  step_rm(bare_nuclei) |> 
  step_normalize(all_predictors())

spec_rbf_svm <- 
  svm_rbf() |> 
  set_engine("kernlab", scaled = TRUE) |> 
  set_mode("classification")

# svm_linear_spec <- 
#   svm_poly(degree = 1) %>%
#   set_mode("classification") %>%
#   set_engine("kernlab")

# spec_rbf_svm |>
#   fit(class ~ ., data = cancer_train)
# spec_rbf_svm |> 
#   set_args(cost = 0.1) |> 
#   fit(class ~ ., data = cancer_train)

spec_rbf_svm_tune <-
  spec_rbf_svm |> 
  set_args(cost = tune())

# param_grid <- 
#   grid_regular(cost(), levels = 10)
# 
# 
cancer_tune_wflow <-
  workflow(preprocessor = std_rec, spec = spec_rbf_svm_tune)

tune_res <- 
  tune_grid(
  cancer_tune_wflow,
  metrics = metric_set(accuracy, roc_auc),
  resamples = cancer_kfold)

collect_metrics(tune_res)

# A tibble: 20 × 7
       cost .metric  .estimator  mean     n std_err .config              
      <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1  0.0767  accuracy binary     0.939    10 0.0106  Preprocessor1_Model01
 2  0.0767  roc_auc  binary     0.972    10 0.00576 Preprocessor1_Model01
 3 12.2     accuracy binary     0.945    10 0.0109  Preprocessor1_Model02
 4 12.2     roc_auc  binary     0.967    10 0.00781 Preprocessor1_Model02
 5  0.0289  accuracy binary     0.939    10 0.0108  Preprocessor1_Model03
 6  0.0289  roc_auc  binary     0.977    10 0.00573 Preprocessor1_Model03
 7  0.00321 accuracy binary     0.656    10 0.00146 Preprocessor1_Model04
 8  0.00321 roc_auc  binary     0.976    10 0.00552 Preprocessor1_Model04
 9  1.25    accuracy binary     0.952    10 0.00797 Preprocessor1_Model05
10  1.25    roc_auc  binary     0.973    10 0.00584 Preprocessor1_Model05
11  0.00203 accuracy binary     0.656    10 0.00146 Preprocessor1_Model06
12  0.00203 roc_auc  binary     0.974    10 0.00591 Preprocessor1_Model06
13  8.02    accuracy binary     0.950    10 0.00968 Preprocessor1_Model07
14  8.02    roc_auc  binary     0.970    10 0.00610 Preprocessor1_Model07
15  2.14    accuracy binary     0.953    10 0.00716 Preprocessor1_Model08
16  2.14    roc_auc  binary     0.973    10 0.00569 Preprocessor1_Model08
17  0.0186  accuracy binary     0.893    10 0.0404  Preprocessor1_Model09
18  0.0186  roc_auc  binary     0.975    10 0.00581 Preprocessor1_Model09
19  0.219   accuracy binary     0.943    10 0.00900 Preprocessor1_Model10
20  0.219   roc_auc  binary     0.973    10 0.00575 Preprocessor1_Model10

show_best(tune_res, metric = "accuracy")

# A tibble: 5 × 7
    cost .metric  .estimator  mean     n std_err .config              
   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1  2.14  accuracy binary     0.953    10 0.00716 Preprocessor1_Model08
2  1.25  accuracy binary     0.952    10 0.00797 Preprocessor1_Model05
3  8.02  accuracy binary     0.950    10 0.00968 Preprocessor1_Model07
4 12.2   accuracy binary     0.945    10 0.0109  Preprocessor1_Model02
5  0.219 accuracy binary     0.943    10 0.00900 Preprocessor1_Model10

show_best(tune_res, metric = "roc_auc")

# A tibble: 5 × 7
     cost .metric .estimator  mean     n std_err .config              
    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1 0.0289  roc_auc binary     0.977    10 0.00573 Preprocessor1_Model03
2 0.00321 roc_auc binary     0.976    10 0.00552 Preprocessor1_Model04
3 0.0186  roc_auc binary     0.975    10 0.00581 Preprocessor1_Model09
4 0.00203 roc_auc binary     0.974    10 0.00591 Preprocessor1_Model06
5 0.219   roc_auc binary     0.973    10 0.00575 Preprocessor1_Model10

svm_mod <-
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_engine("kernlab") %>%
  set_mode("classification")

cancer_tune_wflow2 <-
  workflow(preprocessor = std_rec, spec = svm_mod)

svm_res <- 
  tune_grid(
  cancer_tune_wflow2,
  resamples = cancer_kfold, 
  grid = 7)

show_best(svm_res, metric = "accuracy")

# A tibble: 5 × 8
     cost rbf_sigma .metric  .estimator  mean     n std_err .config             
    <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 0.0595   2.87e- 2 accuracy binary     0.952    10 0.00743 Preprocessor1_Model3
2 0.203    9.96e- 1 accuracy binary     0.941    10 0.00990 Preprocessor1_Model6
3 0.0117   5.59e-10 accuracy binary     0.656    10 0.00146 Preprocessor1_Model1
4 0.00229  1.70e- 4 accuracy binary     0.656    10 0.00146 Preprocessor1_Model2
5 2.43     5.07e- 7 accuracy binary     0.656    10 0.00146 Preprocessor1_Model4

ctrl <- control_grid(verbose = FALSE, save_pred = TRUE)
svm_tune_res <-
  tune_grid(
    cancer_tune_wflow2,
    resamples = cancer_kfold,
    control = ctrl)

show_best(svm_tune_res, metric = "accuracy")

# A tibble: 5 × 8
      cost rbf_sigma .metric  .estimator  mean     n std_err .config            
     <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>              
1 10.3     0.00484   accuracy binary     0.962    10 0.00494 Preprocessor1_Mode…
2  0.178   0.0343    accuracy binary     0.957    10 0.00607 Preprocessor1_Mode…
3 26.8     0.000100  accuracy binary     0.952    10 0.00743 Preprocessor1_Mode…
4  0.0722  0.120     accuracy binary     0.952    10 0.00896 Preprocessor1_Mode…
5  0.00265 0.0000816 accuracy binary     0.656    10 0.00146 Preprocessor1_Mode…

select_best(svm_tune_res, metric = "accuracy")

# A tibble: 1 × 3
   cost rbf_sigma .config              
  <dbl>     <dbl> <chr>                
1  10.3   0.00484 Preprocessor1_Model05

# autoplot(tune_res)

6.4.2 入れ子式の交差検証によるアルゴリズムの選択

6.5 さまざまな性能評価指標

6.5.1 混同行列を解釈する

cm <- 
  cancer_aug |> 
  conf_mat(class, .pred_class)
# cm <- 
#   collect_predictions(glm_fit_fold) |> 
#   conf_mat(class, .pred_class)

cm

           Truth
Prediction  benign malignant
  benign        87         1
  malignant      1        48

# autoplot(cm, type = "mosaic")
autoplot(cm, type = "heatmap")

6.5.2 分類モデルの適合率と再現率を最適化する

cancer_aug |> 
  f_meas(class, .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  binary         0.989

collect_predictions(glm_fit_fold) |> 
  f_meas(class, .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  binary         0.973

set.seed(123)
cancer_wflow %>% 
  fit_resamples(cancer_kfold,
                metrics = metric_set(f_meas, precision)) |> 
  collect_metrics()

# A tibble: 2 × 6
  .metric   .estimator  mean     n std_err .config             
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
1 f_meas    binary     0.973    10 0.00325 Preprocessor1_Model1
2 precision binary     0.978    10 0.00790 Preprocessor1_Model1

collect_predictions(glm_fit_fold) |>  
  group_by(id == "Fold01") |> 
  roc_curve(class, .pred_class) |> 
  autoplot()

6.5.4 多クラス分類のための性能評価

6.6 クラスの不均衡に対処する

tbl_BreastCancer |> 
  count(class)

# A tibble: 2 × 2
  class         n
  <fct>     <int>
1 benign      458
2 malignant   241

# 不均衡なデータの作成
# 良性腫瘍のすべてのデータと悪性腫瘍40件分のデータを結合
tbl_BreastCancer_imbalance <- 
  tbl_BreastCancer |> 
  filter(class == "benign") |> 
  bind_rows(
    tbl_BreastCancer |> 
      filter(class == "malignant") |> 
      slice_head(n = 40))

tbl_BreastCancer_imbalance |> 
  count(class)

# A tibble: 2 × 2
  class         n
  <fct>     <int>
1 benign      458
2 malignant    40

cancer_split_imb <- 
  tbl_BreastCancer_imbalance |> 
  initial_split(strata = class)
cancer_train_imb <- 
  training(cancer_split_imb)
cancer_test_imb <- 
  training(cancer_split_imb)

cancer_train_imb |> 
  count(class)

# A tibble: 2 × 2
  class         n
  <fct>     <int>
1 benign      342
2 malignant    31

# アップサンプリング
# 多い方のクラス件数となるまで、少ないクラスのデータを復元抽出する
up_rec <- 
  cancer_train_imb |> 
  recipe(class ~ .) |> 
  themis::step_upsample(class) |> 
  prep()
# ダウンサンプリング
# 少ないクラス件数に寄せる
dw_rec <- 
  cancer_train_imb |> 
  recipe(class ~ .) |> 
  themis::step_downsample(class) |> 
  prep()

up_rec |> 
  bake(new_data = NULL) |> 
  count(class)

# A tibble: 2 × 2
  class         n
  <fct>     <int>
1 benign      342
2 malignant   342

dw_rec |> 
  bake(new_data = NULL) |> 
  count(class)

# A tibble: 2 × 2
  class         n
  <fct>     <int>
1 benign       31
2 malignant    31