Back to Article
Article Notebook
Download Source

An investigation of the predictors of thyroid cancer in patients with thyroid nodules

Authors
Affiliations

Ovie Edafe

Department of Oncology & Metabolism, University of Sheffield

Neil Shephard

Research Software Engineer, Department of Computer Science, University of Sheffield

Karen Sisley

Senior Lecturer, Clinical Medicine, School of Medicine and Population Health, University of Sheffield

Sabapathy P Balasubramanian

Directorate of General Surgery, Sheffield Teaching Hospitals NHS Foundation Trust

Published

April 26, 2024

Abstract

An abstract summarising the work undertaken and the overall conclusions can be placed here. Sub-headings are currently removed because they conflict with those in the body of the text and mess up the links in the Table of Contents.

Keywords

Thyroid nodules, Thyroid cancer

In [1]:
## Libraries for data manipulation, plotting and tabulating (sorted alphabetically)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(ggdark)
library(gtsummary)
library(Hmisc)

Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':

    src, summarize
The following objects are masked from 'package:base':

    format.pval, units
library(knitr)
library(mice)

Attaching package: 'mice'
The following object is masked from 'package:stats':

    filter
The following objects are masked from 'package:base':

    cbind, rbind
library(readr)
library(rmarkdown)
library(visdat)
library(naniar)

## Libraries for Tidymodelling
library(dials)
Loading required package: scales

Attaching package: 'scales'
The following object is masked from 'package:readr':

    col_factor
library(kernlab)

Attaching package: 'kernlab'
The following object is masked from 'package:scales':

    alpha
The following object is masked from 'package:mice':

    convergence
The following object is masked from 'package:ggplot2':

    alpha
library(knitr)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.6     ✔ tibble       3.2.1
✔ infer        1.0.7     ✔ tidyr        1.3.1
✔ modeldata    1.4.0     ✔ tune         1.2.1
✔ parsnip      1.2.1     ✔ workflows    1.1.4
✔ purrr        1.0.2     ✔ workflowsets 1.1.0
✔ recipes      1.1.0     ✔ yardstick    1.3.1
✔ rsample      1.2.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ recipes::all_double()   masks gtsummary::all_double()
✖ recipes::all_factor()   masks gtsummary::all_factor()
✖ recipes::all_integer()  masks gtsummary::all_integer()
✖ recipes::all_logical()  masks gtsummary::all_logical()
✖ recipes::all_numeric()  masks gtsummary::all_numeric()
✖ kernlab::alpha()        masks scales::alpha(), ggplot2::alpha()
✖ purrr::cross()          masks kernlab::cross()
✖ purrr::discard()        masks scales::discard()
✖ mice::filter()          masks dplyr::filter(), stats::filter()
✖ dplyr::lag()            masks stats::lag()
✖ .GlobalEnv::set_names() masks purrr::set_names()
✖ yardstick::spec()       masks readr::spec()
✖ Hmisc::src()            masks dplyr::src()
✖ recipes::step()         masks stats::step()
✖ Hmisc::summarize()      masks dplyr::summarize()
✖ parsnip::translate()    masks Hmisc::translate()
✖ .GlobalEnv::view()      masks tibble::view()
• Use tidymodels_prefer() to resolve common conflicts.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ kernlab::alpha()     masks scales::alpha(), ggplot2::alpha()
✖ scales::col_factor() masks readr::col_factor()
✖ purrr::cross()       masks kernlab::cross()
✖ purrr::discard()     masks scales::discard()
✖ mice::filter()       masks dplyr::filter(), stats::filter()
✖ stringr::fixed()     masks recipes::fixed()
✖ dplyr::lag()         masks stats::lag()
✖ yardstick::spec()    masks readr::spec()
✖ Hmisc::src()         masks dplyr::src()
✖ Hmisc::summarize()   masks dplyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vip)

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
## Set global options
options(digits = 3)
train <- 0.75
test <- 0.25

## Set directories based on current location
base_dir <- getwd()
data_dir <- paste(base_dir, "data", sep = "/")
csv_dir <- paste(data_dir, "csv", sep = "/")
r_dir <- paste(data_dir, "r", sep = "/")
r_scripts <- paste(base_dir, "r", sep = "/")


## Load data
##
## The following line runs the `r/shf_thy_nod.R` script which reads the data from CSV and cleans/tidies it.
## If something has changed in that file or the underlying data (in `data/csv/sheffield_thyroid_module.R`) then this
## line should be uncommented and the code will run. At the end of the file it saves the data to `data/r/clean.rds`.
source("r/shf_thy_nod.R")
Rows: 1364 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): gender, ethnicity, eligibility, incidental_nodule, palpable_nodule...
dbl  (7): study_id, age_at_scan, albumin, tsh_value, lymphocytes, monocyte, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## If nothing has changed in the underlying data or the cleaning/tidying process then the last version of the data,
## saved to `data/r/clean.rds` can be loaded by commenting out the line above and uncommenting the line below.
#df <- readRDS(paste(r_dir, "clean.rds", sep="/"))


## @ns-rse 2024-06-18:
## We want this table to match the model, therefore rather than repeat ourselves we move the subsetting to give us
## df_complete to here and assign to df_complete. This is then passed into gtsummary::tbl_summary() and is available for
## the next code chunk.
##
## This is the point at which you should subset your data for those who have data available for the variables of
## interest. The variables here should include the outcome `final_pathology` and the predictor variables that are set in
## the code chunk `recipe`. May want to move this to another earlier in the processing so that the number of rows can be
## counted and reported.
df_complete <- df |>
  dplyr::select(
    age_at_scan,
    gender,
    ethnicity,
    incidental_nodule,
    palpable_nodule,
    rapid_enlargement,
    compressive_symptoms,
    hypertension,
    vocal_cord_paresis,
    graves_disease,
    hashimotos_thyroiditis,
    family_history_thyroid_cancer,
    exposure_radiation,
    albumin,
    tsh_value,
    lymphocytes,
    monocyte,
    bta_u_classification,
    solitary_nodule,
    size_nodule_mm,
    cervical_lymphadenopathy,
    thy_classification,
    final_pathology) |>
## @ns-rse 2024-06-14 :
## I would consider removing this dplyr::filter() and instead using the recipes::step_filter_missing() as is now in
## place below
## dplyr::filter(if_any(everything(), is.na))
## Instead I think it might be useful to remove individuals who do not have a value for final_pathology here
  dplyr::filter(!is.na(final_pathology))

1 Introduction

Thyroid nodules are common. The challenge in the management of thyroid nodules is differentiating between benign and malignant nodule thyroid nodules.The use fine needle aspiration and cytology (FNAC) still leaves around 20% of patients that cannot be clearly classified as either benign or malignant. This scenario traditionally leads to diagnostic hemithyroidectomy for definitive histology. Other clinical variables such as patients’ demographics, clinical and biochemical factors have been shown to be associated with thyroid cancer in patients with thyroid nodules. This has been utilised in studies evaluating predictors of thyroid cancer with a view of creating a model to aid prediction. Standard practice on the management of thyroid nodules does not utilise these non ultrasound and non cytological factors. Combination of these variables considered to be significant with ultrasound and cytological characteristics may improve management of patients with thyroid nodules. Thyroid nodules are increasingly being incidentally detected with increased use of imaging in the evaluation of non thyroid related pathologies. Thus, leading to increase investigation of thyroid nodules and subsequent increased number of thyroid operations in non diagnostic cases. There are morbidities associated with thyroid surgery including scar, recurrent laryngeal nerve injury, hypothyroidism and hypoparathyroidism. We performed a systematic review to evaluate for predictors of thyroid cancer specifically in patients presenting with thyroid nodules. The systematic review a number of potential important variables that may be useful in the prediction of thyroid cancer in patients with thyroid nodules. The aim of this study was to evaluate the predictors of thyroid cancer with a view of improving prediction of thyroid cancer using computer age statistical inference techniques (Efron and Hastie (2016)).

2 Methods

This study was reported as per the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines

2.1 Study design

This was a retrospective cohort study.

2.2 Setting

The study was conducted at the Sheffield Teaching hospitals NHS Foundation Trusts. This is a tertiary referral centre for the management of thyroid cancer

2.3 Participants

We included all consecutive patients who presented with thyroid nodule(s) or that were found to have thyroid nodule(s) on ultrasound done for thyroid pathology or for other non thyroid related pathologies

2.4 Variables

Variable evaluated was based on findings from a systematic review evaluating predictors of thyroid cancer in patients with thyroid nodules. Data on the following variables were collected: patient demographics (age, gender, ethnicity), nodule presentation (incidental nodule, palpable nodule, rapid enlargement, compressive symptoms, vocal paresis), past medical history (hypertension, Graves’ disease, Hashimotos’ thyroiditis, family history of thyroid cancer, exposure to neck radiation), biochemistry (thyroid stimulating hormone, lymphocytes, monocytes), ultrasound characteristics (British Thyroid Association ultrasound (BTA U), nodule size, solitary nodule, nodule consistency, cervical lymphadenopathy), Royal College of Pathology (RCP) FNAC classification, type of thyroid surgery, and histological diagnosis.

2.5 Data source

Data was collected from patients’ case notes and electronic patients’ database using a standardised data collection proforma. This was initially piloted on 30 patients and revised to improve data entry. In addition a number of variables that were not standard collected during workout of patients were not further checked; these include body mass index (BMI), serum thyroglobulin, serum triiodothyronine (T3), thyroxine (T4), thyroglobulin antibody (TgAb), thyroid peroxidase antibody (TP0Ab), and urinary iodine.

2.6 Study size

We sought to have a large data set of at least 100 thyroid nodules with a cancer diagnosis using consecutive sampling technique. We aimed for a total of 1500 patients with thyroid nodules to achieve our target sample size. With the use of modern statistical techniques, we proposed such number will be appropriate to detect important variables if it exists.

2.7 Data analysis

Data was cleaned and analysed using the R Statistical Software R Core Team (2023) and the Tidyverse (Wickham et al. (2019)), Tidymodels (Kuhn and Wickham (2020)) collection of packages.

2.8 Imputation

The dataset is incomplete and there are missing observations across all variables to varying degrees. In order to maximise the sample available for analysis imputation was used to infer missing values. The Multivariat Imputation via Chained Equations (MICE and implemented in the eponymous R package Buuren and Groothuis-Oudshoorn (2011)) was employed which assumes data is missing at random (a difficult assumption to formally test). The approach takes each variable with missing data and attempts to predict it using statistical modelling based on the observed values. In essence it is the same approach as the statistical methods being employed to try and predict Thyroid Cancer and there are a range of statistical techniques available which include

2.9 Modelling

We used a selection of statistic modelling techniques to evaluate association between variables and thyroid cancer in patients with thyroid nodules. The patient population was split into training and testing cohorts in a ratio of 0.75:0.25 and each model is fitted using the training cohort. This split ratio is generally used in traditional machine learning techniques. The training set of the data was used to estimate the relation between variables and thyroid cancer. The larger the training data, the better it is for the model to learn the trends. The test set was used to determine the accuracy of the model in predicting thyroid cancer; the bigger the test data the more confidence we have in the model prognostic values. We used simple randomisation technique for the split to prevent bias in the data split. We ensured that there was no duplicate in the data sets so any test data was not accidentally trained. Furthermore, cross validation was used to estimate the accuracy of the various machine learning models. The k-fold techniques splits the data in ?10 folds, and the data was trained on all but one of the the fold, and the one fold not trained is used to test the data. This was repeated multiple times using a different fold for test and the others for training until all the folds is utilised for training and testing. Following multiple training process with k-fold, we selected the model that has the best predictive value for thyroid cancer in the test cohort. We also used the leave one out (loo) cross-validation to train and test the data set.In this technique, all but one observation is use to train the data set and one observation is use to test the data; this is repeated until all the data test is used for testing and training. The model with the best predictive value was selected.

2.9.1 LASSO / Elastic Net

LASSO (Least Absolute Shrinkage and Selection Operatror) and Elastic Net Zou and Hastie (2005) are regression methods that perform variable selection. The original LASSO method proposed by Regression Shrinkage and Selection via the Lasso (1996) allows the coefficients for independent/predictor variables to “shrink” down towards zero, effectively eliminating them from influencing the model, this is often referred to as L1 regularisation. The Elastic Net Zou and Hastie (2005) improves on the LASSO by balancing L1 regularisation with ridge-regression or L2 regularisation which helps avoid over-fitting.

Both methods avoid many of the shortcomings/pitfalls of stepwise variable selection Thompson (1995) Smith (2018) and have been shown to be more accurate in clinical decision making in small datasets with well code, externally selected variables Steyerberg et al. (2001)

2.9.2 Random Forest

To add reference The random forest plot is an extension of the decision tree methodology to reduce variance. Decision trees are very sensitive to the training data set and can lead to high variance; thus potential issues with generalisation of the model. The random forest plot selects random observation of the dataset to create multiple decision trees. Random variables are selected for each tree in the training of the data set. The aggregated output of the generated decision trees is then used to create an estimate.

2.9.3 Gradient Boosting

Gradient boosting is a machine learning algorithm that uses decision tree as a base model. The data is initially trained on this decision tree, but the initial prediction is weak, thus termed a weak based model. In gradient boosting the process is iterative; a sequence of decision trees is added to the initial tree. Each tree learns from the prior tree(s) to improve the model, increasing strength and minimising error.

2.9.4 SVM

Support Vector Machines is an approach that allows observation with a binary classifications to be separated using a hyperplane. It finds a hyperplane that best stratify the two classes i.e benign versus malignant nodules. SVM finds the hyperplane with the maximum margin of separation between the two classes. The support vectors are the data point that are positioned close to the margin of the hyperplane and these used to select the most appropraite hyperplane. The support vectors are the only data points that have an influence on the maximum margin in SVM.

2.9.5 Comparision

3 Results

In [2]:
n_obs <- nrow(df)

?@tbl-patient-demographics shows the demographics of patients included in this study. A total of 1364 patients were included in this study with a median (IQR) age of 55 ( 41-69). ?@tbl-clinical-characteristics shows the distribution of clinical variables evaluated between benign and malignant thyroid nodules.

3.1 Data Description

A summary of the variables that are available in this data set can be found in Table 3.

The completeness of the original data is shown in tables ?@tbl-imputation-summary-pmm, ?@tbl-imputation-summary-cart, ?@tbl-imputation-summary-rf, along with summaries from four rounds of imputation for each of three imputation methods. Where variables continuous (e.g. age or size_nodule_mm) basic summary statistics in the form of mean, standard deviation, median and inter-quartile range are given. For categorical variables that are logical TRUE/FALSE (e.g. palpable_nodule) the number of TRUE observations and the percentage (of those with observed data for that variable) are shown along with the number that are Unknown. For categorical variables such as gender percentages in each category are reported. For all variables an indication of the number of missing observations is also given and it is worth noting that there are 214 instances where the final_pathology is not known which reduces the sample size to 1150.

3.1.1 Missing Data

More detailed tabulations of missing data by variable are shown in Table 1 which shows the number and percentage of missing data for each variable and by case in Table 2 which shows how much missing data each case has. A visualisation of this is shown in Figure 1 .

NB - Currently there is a bug in the stable release of Quarto which prevents rendering of the missing data figures. It is fixed in development version v1.6.1 (currently available as pre-release, so if things don’t render try upgrading).

In [3]:
In [4]:
naniar::miss_var_summary(df_complete) |>
  knitr::kable(col.names=c("Variable", "N", "%"),
               caption="Summary of missing data by variable.")
Table 1: Summary of missing data by variable.
Variable N %
thy_classification 915 79.6
albumin 515 44.8
tsh_value 413 35.9
monocyte 363 31.6
lymphocytes 359 31.2
size_nodule_mm 319 27.7
family_history_thyroid_cancer 281 24.4
hypertension 126 11.0
exposure_radiation 121 10.5
compressive_symptoms 106 9.22
vocal_cord_paresis 76 6.61
hashimotos_thyroiditis 73 6.35
graves_disease 67 5.83
palpable_nodule 58 5.04
bta_u_classification 50 4.35
ethnicity 48 4.17
rapid_enlargement 43 3.74
cervical_lymphadenopathy 9 0.783
solitary_nodule 8 0.696
incidental_nodule 5 0.435
age_at_scan 0 0
gender 0 0
final_pathology 0 0
In [5]:
In [6]:
naniar::miss_case_table(df_complete) |>
  knitr::kable(col.names=c("Missing Variables", "N", "%"),
               caption="Summary of missing data by case, how much missing data is there per person?")
Table 2: Summary of missing data by case, how much missing data is there per person?
Missing Variables N %
0 65 5.652
1 227 19.739
2 229 19.913
3 181 15.739
4 139 12.087
5 102 8.870
6 76 6.609
7 35 3.043
8 30 2.609
9 13 1.130
10 19 1.652
11 15 1.304
12 9 0.783
13 4 0.348
14 3 0.261
15 2 0.174
16 1 0.087
In [7]:
visdat::vis_miss(df_complete)

The MICE package also provides tools for visualising missing data and these are shown in figures Figure 2, ?@fig-mice-vis-missing-biomarker and Figure 4.

The columns of these plots, labelled along the top, show the variable, if a cell is blue it indicates data is present, if it is red it indicates there is missing data. The left-hand side shows the total number of observations for that rows particular combination of variables with number of missing variables indicated on the right. The first row shows that for these variables there are 604 observations with zero missing data across the listed variables, the second row indicates there are 166 observations with just family_history_thyroid_cancer but there are some with this missing and other variables. The numbers on the bottom of the figure indicate the total number of missing observations for that variable (e.g. for family_history_thyroid_cancer there is a total of 281 missing observations).

TODO - Workout why out-width: "80%" isn’t applied to these figures and/or how to make the All figure readable.

In [8]:
mice_missing_clinical <- df_complete |>
  dplyr::select(final_pathology,
                age_at_scan,
                gender,
                incidental_nodule,
                palpable_nodule,
                rapid_enlargement,
                compressive_symptoms,
                hashimotos_thyroiditis,
                family_history_thyroid_cancer,
                cervical_lymphadenopathy) |>
  mice::md.pattern(rotate.names = TRUE)
In [9]:
mice_missing_biomarkers <- df_complete |>
  dplyr::select(final_pathology,
                age_at_scan,
                gender,
                tsh_value,
                albumin,
                lymphocytes,
                monocyte) |>
  mice::md.pattern(rotate.names = TRUE)
In [10]:
mice_missing_ultrasound <- df_complete |>
  dplyr::select(final_pathology,
                age_at_scan,
                gender,
                bta_u_classification,
                thy_classification,
                size_nodule_mm,
                solitary_nodule) |>
  mice::md.pattern(rotate.names = TRUE)
In [11]:
mice_missing_ultrasound <- df_complete |>
  dplyr::select(final_pathology,
                age_at_scan,
                gender,
                incidental_nodule,
                palpable_nodule,
                rapid_enlargement,
                compressive_symptoms,
                hashimotos_thyroiditis,
                family_history_thyroid_cancer,
                cervical_lymphadenopathy,
                tsh_value,
                albumin,
                lymphocytes,
                monocyte,
                bta_u_classification,
                thy_classification,
                size_nodule_mm,
                solitary_nodule) |>
  mice::md.pattern(rotate.names = TRUE)

3.1.2 Imputation

The MICE package@mice offers a number of different methods for imputing variables (see [documentation][mice_details]) we have investigated Predictive Mean Matching (PMM), Classification and Regression Trees (CART) and Random Forests (RF). Four rounds of imputation using each method were made.

A comparison of distributions/proportions before and after imputation are presented below to allow assessment of the utility of each method.

In [12]:
## Setup MICE mids for various methods
##

#' Impute missing data and plot the results using the mice package.
#'
#' This is a wrapper around the functionality of the \href{https://amices.org/mice}{mice} and
#' \href{https://amices.org/ggmice}{ggmice} packages for Multivariate Imputation by Chained Equations and visualisation
#' of the resulting imputed datasets. Users should refer to the documentation for details of the options available.
#'
#' The wrapper uses \code{\link[mice]{futuremice}} to perform imputation in parallel using the same number of cores as
#' the requested number of imputations. This speeds up the process but be wary if your computer has limited cores.
#'
#' @param df data.frame|tibble Data frame or tibble of original data.
#' @param imputations int Number of imputations to perform.
#' @param iterations int Number of iterations to perform for imputation.
#' @param method str Method of imputation to use, see the Details section of \code{\link[mice]{mice}}.
#' @param action str Action to take when running \code{\link[mice]{complete}}.
#' @param include bool Logical of whether to include the original dataset in the output.
#' @param seed int Seed for random number generation
#'
#'
impute_data <- function(df = df_complete,
                        outcome_var = "final_pathology",
                        imputations = 4,
                        iterations = 5,
                        method = "pmm",
                        action = "long",
                        include = TRUE,
                        continuous = c("albumin", "tsh_value", "lymphocytes", "monocyte", "size_nodule_mm"),
                        categorical = c("ethnicity",
                                        "incidental_nodule",
                                        "palpable_nodule",
                                        "rapid_enlargement",
                                        "compressive_symptoms",
                                        "hypertension",
                                        "vocal_cord_paresis",
                                        "graves_disease",
                                        "hashimotos_thyroiditis",
                                        "family_history_thyroid_cancer",
                                        "exposure_radiation",
                                        "bta_u_classification",
                                        "solitary_nodule",
                                        "cervical_lymphadenopathy",
                                        "thy_classification"),
                        seed = 123) {
    results <- list()
    ## Setup imputation
    results$mids <- df |>
        dplyr::select(-{{ outcome_var }}) |>
        mice::futuremice(m = imputations,
                         method = method,
                         n.core = imputations,
                         iterations = iterations,
                         parallelseed  = seed)
    ## Generate output dataset,
    results$imputed <- results$mids |>
        mice::complete(action = "long", include = include)
    ## Convert the .imp variable which indicates the imputation set to factor with original dataset labelled as such
    results$imputed <- results$imputed |>
         dplyr::mutate(.imp = factor(.imp,
                                     levels = seq(0, imputations, 1),
                                     labels = c("Original", as.character(seq(1, imputations, 1)))))
    ## We need to bind the outcome variable to each imputed dataset so they can be used in analyses
    outcome = df[[outcome_var]]
    n = imputations
    while(n > 0) {
        outcome = append(outcome, df[[outcome_var]])
        n <- n - 1
    }
    results$imputed <- cbind(results$imputed, outcome)
    colnames(results$imputed) <- stringr::str_replace(colnames(results$imputed), "outcome", outcome_var)
    ## Plot traces of the imputation over iteration
    results$trace <- ggmice::plot_trace(results$mid)
    ## Plot correlation between variables
    results$corr <- ggmice::plot_corr(df,
                                      label = TRUE,
                                      square = TRUE,
                                      rotate = TRUE,
                                      diagonal = FALSE)
    ## Plot histograms of continuous variables
    results$histogram <- list()
    for (var in continuous) {
        results$histogram[[var]] <- ggmice::ggmice(results$mids,
                                                   ggplot2::aes(x = .data[[var]],
                                                                group = .imp)) +
                                        ggplot2::geom_density()
    }
    ## Scatterplots and bar charts for categorical variables
    results$scatter <- list()
    results$bar_chart <- list()
    for (var in categorical) {
        results$scatter[[var]] <- ggmice::ggmice(results$mids,
                                                 ggplot2::aes(x = .imp,
                                                              y = .data[[var]])) +
                                        ggplot2::geom_jitter()
        results$bar_chart[[var]] <- ggmice::ggmice(results$mids,
                                                   ggplot2::aes(x = .data[[var]],
                                                                fill = .imp)) +
                                        ggplot2::geom_bar(position = "dodge")
    }
    results
}

## Impute using three different methods using the above impute_data() wrapper
imputations = 5
iterations = 5
mice_pmm <- impute_data(method = "pmm",
                        imputations = imputations,
                        iterations = iterations,
                        seed = 684613)
Warning: Number of logged events: 25
Warning: Number of logged events: 25
Warning: Number of logged events: 25
Warning: Number of logged events: 24
Warning: Number of logged events: 24
mice_cart <- impute_data(method = "cart",
                        imputations = imputations,
                        iterations = iterations,
                        seed = 1388466)
Warning: Number of logged events: 25
Warning: Number of logged events: 25
Warning: Number of logged events: 24
Warning: Number of logged events: 25
Warning: Number of logged events: 23
mice_rf <- impute_data(method = "rf",
                        imputations = imputations,
                        iterations = iterations,
                        seed = 3151358)
Warning: Number of logged events: 25
Warning: Number of logged events: 25
Warning: Number of logged events: 23
Warning: Number of logged events: 25
Warning: Number of logged events: 25

The convergence of the imputation methods are shown in figues ?@fig-mice-convergence-pmm, ?@fig-mice-convergence-cart, and ?@fig-mice-convergence-rf.

In [13]:
cowplot::plot_grid(mice_pmm$histogram$albumin,
                   mice_cart$histogram$albumin,
                   mice_rf$histogram$albumin,
                   nrow = 1,
                   ncol = 3)
In [14]:
cowplot::plot_grid(mice_pmm$histogram$monocyte,
                   mice_cart$histogram$monocyte,
                   mice_rf$histogram$monocyte,
                   nrow = 1,
                   ncol = 3)
In [15]:
cowplot::plot_grid(mice_pmm$histogram$lymphocytes,
                   mice_cart$histogram$lymphocytes,
                   mice_rf$histogram$lymphocytes,
                   nrow = 1,
                   ncol = 3)
In [16]:
cowplot::plot_grid(mice_pmm$histogram$tsh_value,
                   mice_cart$histogram$tsh_value,
                   mice_rf$histogram$tsh_value,
                   nrow = 1,
                   ncol = 3)
In [17]:
cowplot::plot_grid(mice_pmm$histogram$size_nodule_mm,
                   mice_cart$histogram$size_nodule_mm,
                   mice_rf$histogram$size_nodule_mm,
                   nrow = 1,
                   ncol = 3)

TODO - Extract the legends from individual plots and add them to the end of each row, see the cowplot shared legends article for pointers on how to do this. Should ideally also get the fill colours to align with those used by ggmice.

In [18]:
cowplot::plot_grid(mice_pmm$scatter$incidental_nodule + theme(legend.position = "None"),
                   mice_cart$scatter$incidental_nodule + theme(legend.position = "None"),
                   mice_rf$scatter$incidental_nodule + theme(legend.position = "None"),
                   mice_pmm$bar_chart$incidental_nodule + theme(legend.position = "None"),
                   mice_cart$bar_chart$incidental_nodule + theme(legend.position = "None"),
                   mice_rf$bar_chart$incidental_nodule + theme(legend.position = "None"),
                   nrow = 2,
                   ncol = 3)
In [19]:
cowplot::plot_grid(mice_pmm$scatter$palpable_nodule,
                   mice_cart$scatter$palpable_nodule,
                   mice_rf$scatter$palpable_nodule,
                   mice_pmm$bar_chart$palpable_nodule,
                   mice_cart$bar_chart$palpable_nodule,
                   mice_rf$bar_chart$palpable_nodule,
                   nrow = 2,
                   ncol = 3)
In [20]:
cowplot::plot_grid(mice_pmm$scatter$rapid_enlargement,
                   mice_cart$scatter$rapid_enlargement,
                   mice_rf$scatter$rapid_enlargement,
                   mice_pmm$bar_chart$rapid_enlargement,
                   mice_cart$bar_chart$rapid_enlargement,
                   mice_rf$bar_chart$rapid_enlargement,
                   nrow = 2,
                   ncol = 3)
In [21]:
cowplot::plot_grid(mice_pmm$scatter$compressive_symptoms,
                   mice_cart$scatter$compressive_symptoms,
                   mice_rf$scatter$compressive_symptoms,
                   mice_pmm$bar_chart$compressive_symptoms,
                   mice_cart$bar_chart$compressive_symptoms,
                   mice_rf$bar_chart$compressive_symptoms,
                   nrow = 2,
                   ncol = 3)
In [22]:
cowplot::plot_grid(mice_pmm$scatter$hypertension,
                   mice_cart$scatter$hypertension,
                   mice_rf$scatter$hypertension,
                   mice_pmm$bar_chart$hypertension,
                   mice_cart$bar_chart$hypertension,
                   mice_rf$bar_chart$hypertension,
                   nrow = 2,
                   ncol = 3)
In [23]:
cowplot::plot_grid(mice_pmm$scatter$vocal_cord_paresis,
                   mice_cart$scatter$vocal_cord_paresis,
                   mice_rf$scatter$vocal_cord_paresis,
                   mice_pmm$bar_chart$vocal_cord_paresis,
                   mice_cart$bar_chart$vocal_cord_paresis,
                   mice_rf$bar_chart$vocal_cord_paresis,
                   nrow = 2,
                   ncol = 3)
In [24]:
cowplot::plot_grid(mice_pmm$scatter$graves_disease,
                   mice_cart$scatter$graves_disease,
                   mice_rf$scatter$graves_disease,
                   mice_pmm$bar_chart$graves_disease,
                   mice_cart$bar_chart$graves_disease,
                   mice_rf$bar_chart$graves_disease,
                   nrow = 2,
                   ncol = 3)
In [25]:
cowplot::plot_grid(mice_pmm$scatter$hashimotos_thyroiditis,
                   mice_cart$scatter$hashimotos_thyroiditis,
                   mice_rf$scatter$hashimotos_thyroiditis,
                   mice_pmm$bar_chart$hashimotos_thyroiditis,
                   mice_cart$bar_chart$hashimotos_thyroiditis,
                   mice_rf$bar_chart$hashimotos_thyroiditis,
                   nrow = 2,
                   ncol = 3)
In [26]:
cowplot::plot_grid(mice_pmm$scatter$family_history,
                   mice_cart$scatter$family_history,
                   mice_rf$scatter$family_history,
                   mice_pmm$bar_chart$family_history,
                   mice_cart$bar_chart$family_history,
                   mice_rf$bar_chart$family_history,
                   nrow = 2,
                   ncol = 3)
In [27]:
cowplot::plot_grid(mice_pmm$scatter$exposure_radiation,
                   mice_cart$scatter$exposure_radiation,
                   mice_rf$scatter$exposure_radiation,
                   mice_pmm$bar_chart$exposure_radiation,
                   mice_cart$bar_chart$exposure_radiation,
                   mice_rf$bar_chart$exposure_radiation,
                   nrow = 2,
                   ncol = 3)
In [28]:
cowplot::plot_grid(mice_pmm$scatter$solitary_nodule,
                   mice_cart$scatter$solitary_nodule,
                   mice_rf$scatter$solitary_nodule,
                   mice_pmm$bar_chart$solitary_nodule,
                   mice_cart$bar_chart$solitary_nodule,
                   mice_rf$bar_chart$solitary_nodule,
                   nrow = 2,
                   ncol = 3)
In [29]:
cowplot::plot_grid(mice_pmm$scatter$bta_u_classification,
                   mice_cart$scatter$bta_u_classification,
                   mice_rf$scatter$bta_u_classification,
                   mice_pmm$bar_chart$bta_u_classification,
                   mice_cart$bar_chart$bta_u_classification,
                   mice_rf$bar_chart$bta_u_classification,
                   nrow = 2,
                   ncol = 3)
In [30]:
cowplot::plot_grid(mice_pmm$scatter$cervical_lymphadenopathy,
                   mice_cart$scatter$cervical_lymphadenopathy,
                   mice_rf$scatter$cervical_lymphadenopathy,
                   mice_pmm$bar_chart$cervical_lymphadenopathy,
                   mice_cart$bar_chart$cervical_lymphadenopathy,
                   mice_rf$bar_chart$cervical_lymphadenopathy,
                   nrow = 2,
                   ncol = 3)
In [31]:
mice_pmm$imputed |> gtsummary::tbl_summary(by=".imp")

Characteristic

Original, N = 1,150

1

1, N = 1,150

1

2, N = 1,150

1

3, N = 1,150

1

4, N = 1,150

1

5, N = 1,150

1
.id 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863)
Age 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68)
gender





    F 903 (79%) 903 (79%) 903 (79%) 903 (79%) 903 (79%) 903 (79%)
    M 247 (21%) 247 (21%) 247 (21%) 247 (21%) 247 (21%) 247 (21%)
ethnicity





    A 812 (74%) 843 (73%) 841 (73%) 839 (73%) 845 (73%) 842 (73%)
    B 3 (0.3%) 4 (0.3%) 4 (0.3%) 4 (0.3%) 3 (0.3%) 3 (0.3%)
    C 44 (4.0%) 45 (3.9%) 47 (4.1%) 44 (3.8%) 48 (4.2%) 45 (3.9%)
    D 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%)
    F 4 (0.4%) 4 (0.3%) 4 (0.3%) 4 (0.3%) 4 (0.3%) 4 (0.3%)
    G 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%)
    H 8 (0.7%) 8 (0.7%) 8 (0.7%) 9 (0.8%) 8 (0.7%) 8 (0.7%)
    J 46 (4.2%) 50 (4.3%) 49 (4.3%) 51 (4.4%) 50 (4.3%) 49 (4.3%)
    K 11 (1.0%) 11 (1.0%) 13 (1.1%) 12 (1.0%) 11 (1.0%) 12 (1.0%)
    L 16 (1.5%) 19 (1.7%) 16 (1.4%) 18 (1.6%) 16 (1.4%) 17 (1.5%)
    M 16 (1.5%) 16 (1.4%) 16 (1.4%) 17 (1.5%) 16 (1.4%) 18 (1.6%)
    N 18 (1.6%) 19 (1.7%) 18 (1.6%) 18 (1.6%) 19 (1.7%) 20 (1.7%)
    P 14 (1.3%) 14 (1.2%) 15 (1.3%) 16 (1.4%) 15 (1.3%) 14 (1.2%)
    R 6 (0.5%) 7 (0.6%) 6 (0.5%) 8 (0.7%) 7 (0.6%) 7 (0.6%)
    S 38 (3.4%) 40 (3.5%) 43 (3.7%) 42 (3.7%) 40 (3.5%) 40 (3.5%)
    Z 57 (5.2%) 61 (5.3%) 61 (5.3%) 59 (5.1%) 59 (5.1%) 62 (5.4%)
    Unknown 48 0 0 0 0 0
incidental_nodule 620 (54%) 623 (54%) 622 (54%) 623 (54%) 624 (54%) 623 (54%)
    Unknown 5 0 0 0 0 0
palpable_nodule 441 (40%) 471 (41%) 470 (41%) 470 (41%) 471 (41%) 470 (41%)
    Unknown 58 0 0 0 0 0
rapid_enlargement 19 (1.7%) 19 (1.7%) 21 (1.8%) 22 (1.9%) 20 (1.7%) 19 (1.7%)
    Unknown 43 0 0 0 0 0
compressive_symptoms 88 (8.4%) 107 (9.3%) 103 (9.0%) 102 (8.9%) 98 (8.5%) 101 (8.8%)
    Unknown 106 0 0 0 0 0
hypertension 262 (26%) 287 (25%) 301 (26%) 292 (25%) 293 (25%) 291 (25%)
    Unknown 126 0 0 0 0 0
vocal_cord_paresis 3 (0.3%) 5 (0.4%) 3 (0.3%) 6 (0.5%) 3 (0.3%) 3 (0.3%)
    Unknown 76 0 0 0 0 0
graves_disease 17 (1.6%) 20 (1.7%) 20 (1.7%) 19 (1.7%) 19 (1.7%) 18 (1.6%)
    Unknown 67 0 0 0 0 0
hashimotos_thyroiditis 7 (0.6%) 7 (0.6%) 7 (0.6%) 8 (0.7%) 9 (0.8%) 7 (0.6%)
    Unknown 73 0 0 0 0 0
family_history_thyroid_cancer 8 (0.9%) 10 (0.9%) 14 (1.2%) 11 (1.0%) 11 (1.0%) 16 (1.4%)
    Unknown 281 0 0 0 0 0
exposure_radiation 9 (0.9%) 9 (0.8%) 9 (0.8%) 9 (0.8%) 10 (0.9%) 10 (0.9%)
    Unknown 121 0 0 0 0 0
Albumin 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0)
    Unknown 515 0 0 0 0 0
TSH value 1.48 (0.85, 2.30) 1.50 (0.89, 2.30) 1.50 (0.86, 2.40) 1.50 (0.87, 2.50) 1.50 (0.89, 2.50) 1.40 (0.85, 2.30)
    Unknown 413 0 0 0 0 0
Lymphocytes 1.94 (1.51, 2.43) 1.96 (1.54, 2.46) 1.95 (1.53, 2.45) 1.94 (1.50, 2.42) 1.95 (1.53, 2.42) 1.93 (1.49, 2.42)
    Unknown 359 0 0 0 0 0
Monocytes 0.52 (0.42, 0.66) 0.53 (0.43, 0.66) 0.52 (0.42, 0.66) 0.53 (0.42, 0.66) 0.52 (0.43, 0.66) 0.52 (0.42, 0.66)
    Unknown 363 0 0 0 0 0
bta_u_classification





    U1 1 (<0.1%) 1 (<0.1%) 2 (0.2%) 1 (<0.1%) 1 (<0.1%) 1 (<0.1%)
    U2 860 (78%) 902 (78%) 895 (78%) 898 (78%) 893 (78%) 893 (78%)
    U3 210 (19%) 214 (19%) 222 (19%) 220 (19%) 223 (19%) 224 (19%)
    U4 22 (2.0%) 25 (2.2%) 23 (2.0%) 24 (2.1%) 26 (2.3%) 22 (1.9%)
    U5 7 (0.6%) 8 (0.7%) 8 (0.7%) 7 (0.6%) 7 (0.6%) 10 (0.9%)
    Unknown 50 0 0 0 0 0
solitary_nodule 320 (28%) 323 (28%) 322 (28%) 323 (28%) 323 (28%) 323 (28%)
    Unknown 8 0 0 0 0 0
Nodule size (mm) 14 (7, 28) 13 (6, 27) 13 (6, 27) 13 (6, 27) 12 (6, 26) 13 (6, 27)
    Unknown 319 0 0 0 0 0
cervical_lymphadenopathy 26 (2.3%) 27 (2.3%) 27 (2.3%) 26 (2.3%) 27 (2.3%) 26 (2.3%)
    Unknown 9 0 0 0 0 0
thy_classification





    Thy1 34 (14%) 362 (31%) 174 (15%) 242 (21%) 179 (16%) 220 (19%)
    Thy1c 8 (3.4%) 19 (1.7%) 35 (3.0%) 39 (3.4%) 44 (3.8%) 28 (2.4%)
    Thy2 63 (27%) 403 (35%) 412 (36%) 360 (31%) 373 (32%) 360 (31%)
    Thy2c 11 (4.7%) 29 (2.5%) 34 (3.0%) 44 (3.8%) 35 (3.0%) 38 (3.3%)
    Thy3a 18 (7.7%) 36 (3.1%) 44 (3.8%) 60 (5.2%) 61 (5.3%) 69 (6.0%)
    Thy3f 74 (31%) 193 (17%) 302 (26%) 296 (26%) 335 (29%) 336 (29%)
    Thy4 10 (4.3%) 17 (1.5%) 42 (3.7%) 24 (2.1%) 37 (3.2%) 29 (2.5%)
    Thy5 17 (7.2%) 91 (7.9%) 107 (9.3%) 85 (7.4%) 86 (7.5%) 70 (6.1%)
    Unknown 915 0 0 0 0 0
final_pathology





    Benign 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%)
    Cancer 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%)
1

Median (IQR); n (%)

In [32]:
mice_cart$imputed |> gtsummary::tbl_summary(by=".imp")

Characteristic

Original, N = 1,150

1

1, N = 1,150

1

2, N = 1,150

1

3, N = 1,150

1

4, N = 1,150

1

5, N = 1,150

1
.id 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863)
Age 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68)
gender





    F 903 (79%) 903 (79%) 903 (79%) 903 (79%) 903 (79%) 903 (79%)
    M 247 (21%) 247 (21%) 247 (21%) 247 (21%) 247 (21%) 247 (21%)
ethnicity





    A 812 (74%) 843 (73%) 842 (73%) 843 (73%) 845 (73%) 845 (73%)
    B 3 (0.3%) 3 (0.3%) 3 (0.3%) 4 (0.3%) 3 (0.3%) 3 (0.3%)
    C 44 (4.0%) 46 (4.0%) 46 (4.0%) 47 (4.1%) 47 (4.1%) 47 (4.1%)
    D 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%)
    F 4 (0.4%) 4 (0.3%) 4 (0.3%) 5 (0.4%) 4 (0.3%) 4 (0.3%)
    G 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 8 (0.7%) 7 (0.6%)
    H 8 (0.7%) 8 (0.7%) 9 (0.8%) 8 (0.7%) 8 (0.7%) 8 (0.7%)
    J 46 (4.2%) 48 (4.2%) 48 (4.2%) 52 (4.5%) 49 (4.3%) 46 (4.0%)
    K 11 (1.0%) 11 (1.0%) 11 (1.0%) 11 (1.0%) 12 (1.0%) 11 (1.0%)
    L 16 (1.5%) 16 (1.4%) 17 (1.5%) 16 (1.4%) 16 (1.4%) 17 (1.5%)
    M 16 (1.5%) 17 (1.5%) 17 (1.5%) 17 (1.5%) 18 (1.6%) 19 (1.7%)
    N 18 (1.6%) 19 (1.7%) 21 (1.8%) 19 (1.7%) 19 (1.7%) 18 (1.6%)
    P 14 (1.3%) 16 (1.4%) 14 (1.2%) 15 (1.3%) 14 (1.2%) 15 (1.3%)
    R 6 (0.5%) 8 (0.7%) 7 (0.6%) 6 (0.5%) 6 (0.5%) 6 (0.5%)
    S 38 (3.4%) 41 (3.6%) 39 (3.4%) 38 (3.3%) 41 (3.6%) 42 (3.7%)
    Z 57 (5.2%) 61 (5.3%) 63 (5.5%) 60 (5.2%) 58 (5.0%) 60 (5.2%)
    Unknown 48 0 0 0 0 0
incidental_nodule 620 (54%) 623 (54%) 624 (54%) 624 (54%) 623 (54%) 624 (54%)
    Unknown 5 0 0 0 0 0
palpable_nodule 441 (40%) 470 (41%) 472 (41%) 468 (41%) 472 (41%) 468 (41%)
    Unknown 58 0 0 0 0 0
rapid_enlargement 19 (1.7%) 19 (1.7%) 20 (1.7%) 19 (1.7%) 19 (1.7%) 21 (1.8%)
    Unknown 43 0 0 0 0 0
compressive_symptoms 88 (8.4%) 102 (8.9%) 105 (9.1%) 104 (9.0%) 101 (8.8%) 99 (8.6%)
    Unknown 106 0 0 0 0 0
hypertension 262 (26%) 291 (25%) 293 (25%) 292 (25%) 291 (25%) 292 (25%)
    Unknown 126 0 0 0 0 0
vocal_cord_paresis 3 (0.3%) 4 (0.3%) 3 (0.3%) 3 (0.3%) 3 (0.3%) 3 (0.3%)
    Unknown 76 0 0 0 0 0
graves_disease 17 (1.6%) 17 (1.5%) 19 (1.7%) 18 (1.6%) 17 (1.5%) 17 (1.5%)
    Unknown 67 0 0 0 0 0
hashimotos_thyroiditis 7 (0.6%) 9 (0.8%) 7 (0.6%) 9 (0.8%) 8 (0.7%) 8 (0.7%)
    Unknown 73 0 0 0 0 0
family_history_thyroid_cancer 8 (0.9%) 12 (1.0%) 9 (0.8%) 8 (0.7%) 11 (1.0%) 16 (1.4%)
    Unknown 281 0 0 0 0 0
exposure_radiation 9 (0.9%) 10 (0.9%) 9 (0.8%) 11 (1.0%) 11 (1.0%) 10 (0.9%)
    Unknown 121 0 0 0 0 0
Albumin 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0)
    Unknown 515 0 0 0 0 0
TSH value 1.48 (0.85, 2.30) 1.50 (0.88, 2.39) 1.50 (0.88, 2.50) 1.50 (0.91, 2.50) 1.50 (0.90, 2.40) 1.42 (0.85, 2.30)
    Unknown 413 0 0 0 0 0
Lymphocytes 1.94 (1.51, 2.43) 1.94 (1.54, 2.43) 1.94 (1.51, 2.43) 1.95 (1.53, 2.44) 1.91 (1.50, 2.41) 1.94 (1.50, 2.43)
    Unknown 359 0 0 0 0 0
Monocytes 0.52 (0.42, 0.66) 0.53 (0.42, 0.66) 0.53 (0.43, 0.66) 0.53 (0.43, 0.66) 0.52 (0.42, 0.66) 0.52 (0.42, 0.65)
    Unknown 363 0 0 0 0 0
bta_u_classification





    U1 1 (<0.1%) 1 (<0.1%) 1 (<0.1%) 1 (<0.1%) 1 (<0.1%) 2 (0.2%)
    U2 860 (78%) 894 (78%) 899 (78%) 903 (79%) 897 (78%) 897 (78%)
    U3 210 (19%) 226 (20%) 219 (19%) 215 (19%) 220 (19%) 220 (19%)
    U4 22 (2.0%) 22 (1.9%) 24 (2.1%) 23 (2.0%) 24 (2.1%) 24 (2.1%)
    U5 7 (0.6%) 7 (0.6%) 7 (0.6%) 8 (0.7%) 8 (0.7%) 7 (0.6%)
    Unknown 50 0 0 0 0 0
solitary_nodule 320 (28%) 322 (28%) 322 (28%) 322 (28%) 322 (28%) 323 (28%)
    Unknown 8 0 0 0 0 0
Nodule size (mm) 14 (7, 28) 13 (6, 26) 12 (6, 27) 13 (6, 26) 12 (6, 27) 13 (6, 26)
    Unknown 319 0 0 0 0 0
cervical_lymphadenopathy 26 (2.3%) 26 (2.3%) 26 (2.3%) 27 (2.3%) 26 (2.3%) 26 (2.3%)
    Unknown 9 0 0 0 0 0
thy_classification





    Thy1 34 (14%) 203 (18%) 192 (17%) 187 (16%) 221 (19%) 188 (16%)
    Thy1c 8 (3.4%) 46 (4.0%) 35 (3.0%) 36 (3.1%) 59 (5.1%) 63 (5.5%)
    Thy2 63 (27%) 278 (24%) 272 (24%) 281 (24%) 258 (22%) 237 (21%)
    Thy2c 11 (4.7%) 45 (3.9%) 70 (6.1%) 58 (5.0%) 70 (6.1%) 76 (6.6%)
    Thy3a 18 (7.7%) 60 (5.2%) 75 (6.5%) 74 (6.4%) 59 (5.1%) 125 (11%)
    Thy3f 74 (31%) 347 (30%) 321 (28%) 317 (28%) 325 (28%) 264 (23%)
    Thy4 10 (4.3%) 54 (4.7%) 69 (6.0%) 79 (6.9%) 68 (5.9%) 74 (6.4%)
    Thy5 17 (7.2%) 117 (10%) 116 (10%) 118 (10%) 90 (7.8%) 123 (11%)
    Unknown 915 0 0 0 0 0
final_pathology





    Benign 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%)
    Cancer 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%)
1

Median (IQR); n (%)

In [33]:
mice_rf$imputed |> gtsummary::tbl_summary(by = ".imp")

Characteristic

Original, N = 1,150

1

1, N = 1,150

1

2, N = 1,150

1

3, N = 1,150

1

4, N = 1,150

1

5, N = 1,150

1
.id 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863) 576 (288, 863)
Age 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68) 55 (41, 68)
gender





    F 903 (79%) 903 (79%) 903 (79%) 903 (79%) 903 (79%) 903 (79%)
    M 247 (21%) 247 (21%) 247 (21%) 247 (21%) 247 (21%) 247 (21%)
ethnicity





    A 812 (74%) 851 (74%) 845 (73%) 850 (74%) 849 (74%) 853 (74%)
    B 3 (0.3%) 3 (0.3%) 3 (0.3%) 4 (0.3%) 3 (0.3%) 3 (0.3%)
    C 44 (4.0%) 47 (4.1%) 46 (4.0%) 46 (4.0%) 45 (3.9%) 45 (3.9%)
    D 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%) 2 (0.2%)
    F 4 (0.4%) 4 (0.3%) 4 (0.3%) 4 (0.3%) 4 (0.3%) 4 (0.3%)
    G 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 8 (0.7%)
    H 8 (0.7%) 8 (0.7%) 8 (0.7%) 8 (0.7%) 8 (0.7%) 8 (0.7%)
    J 46 (4.2%) 47 (4.1%) 49 (4.3%) 47 (4.1%) 49 (4.3%) 48 (4.2%)
    K 11 (1.0%) 11 (1.0%) 15 (1.3%) 11 (1.0%) 11 (1.0%) 11 (1.0%)
    L 16 (1.5%) 16 (1.4%) 16 (1.4%) 16 (1.4%) 17 (1.5%) 16 (1.4%)
    M 16 (1.5%) 17 (1.5%) 17 (1.5%) 18 (1.6%) 18 (1.6%) 16 (1.4%)
    N 18 (1.6%) 19 (1.7%) 19 (1.7%) 19 (1.7%) 19 (1.7%) 20 (1.7%)
    P 14 (1.3%) 14 (1.2%) 15 (1.3%) 14 (1.2%) 14 (1.2%) 14 (1.2%)
    R 6 (0.5%) 6 (0.5%) 6 (0.5%) 6 (0.5%) 6 (0.5%) 6 (0.5%)
    S 38 (3.4%) 40 (3.5%) 39 (3.4%) 39 (3.4%) 40 (3.5%) 39 (3.4%)
    Z 57 (5.2%) 58 (5.0%) 59 (5.1%) 59 (5.1%) 58 (5.0%) 57 (5.0%)
    Unknown 48 0 0 0 0 0
incidental_nodule 620 (54%) 623 (54%) 623 (54%) 624 (54%) 624 (54%) 622 (54%)
    Unknown 5 0 0 0 0 0
palpable_nodule 441 (40%) 462 (40%) 472 (41%) 465 (40%) 469 (41%) 467 (41%)
    Unknown 58 0 0 0 0 0
rapid_enlargement 19 (1.7%) 19 (1.7%) 19 (1.7%) 19 (1.7%) 20 (1.7%) 20 (1.7%)
    Unknown 43 0 0 0 0 0
compressive_symptoms 88 (8.4%) 94 (8.2%) 100 (8.7%) 94 (8.2%) 89 (7.7%) 95 (8.3%)
    Unknown 106 0 0 0 0 0
hypertension 262 (26%) 282 (25%) 284 (25%) 285 (25%) 281 (24%) 286 (25%)
    Unknown 126 0 0 0 0 0
vocal_cord_paresis 3 (0.3%) 3 (0.3%) 3 (0.3%) 3 (0.3%) 3 (0.3%) 3 (0.3%)
    Unknown 76 0 0 0 0 0
graves_disease 17 (1.6%) 17 (1.5%) 17 (1.5%) 17 (1.5%) 17 (1.5%) 17 (1.5%)
    Unknown 67 0 0 0 0 0
hashimotos_thyroiditis 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%) 7 (0.6%)
    Unknown 73 0 0 0 0 0
family_history_thyroid_cancer 8 (0.9%) 9 (0.8%) 8 (0.7%) 11 (1.0%) 9 (0.8%) 9 (0.8%)
    Unknown 281 0 0 0 0 0
exposure_radiation 9 (0.9%) 9 (0.8%) 9 (0.8%) 10 (0.9%) 9 (0.8%) 9 (0.8%)
    Unknown 121 0 0 0 0 0
Albumin 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0) 45.0 (43.0, 47.0)
    Unknown 515 0 0 0 0 0
TSH value 1.48 (0.85, 2.30) 1.50 (0.87, 2.40) 1.50 (0.90, 2.30) 1.50 (0.88, 2.30) 1.50 (0.87, 2.30) 1.50 (0.85, 2.40)
    Unknown 413 0 0 0 0 0
Lymphocytes 1.94 (1.51, 2.43) 1.94 (1.54, 2.43) 1.94 (1.51, 2.44) 1.95 (1.54, 2.44) 1.95 (1.53, 2.42) 1.94 (1.51, 2.44)
    Unknown 359 0 0 0 0 0
Monocytes 0.52 (0.42, 0.66) 0.53 (0.42, 0.66) 0.52 (0.42, 0.66) 0.53 (0.43, 0.66) 0.52 (0.42, 0.66) 0.52 (0.42, 0.66)
    Unknown 363 0 0 0 0 0
bta_u_classification





    U1 1 (<0.1%) 1 (<0.1%) 1 (<0.1%) 1 (<0.1%) 1 (<0.1%) 1 (<0.1%)
    U2 860 (78%) 902 (78%) 904 (79%) 904 (79%) 904 (79%) 899 (78%)
    U3 210 (19%) 216 (19%) 216 (19%) 216 (19%) 215 (19%) 219 (19%)
    U4 22 (2.0%) 23 (2.0%) 22 (1.9%) 22 (1.9%) 22 (1.9%) 24 (2.1%)
    U5 7 (0.6%) 8 (0.7%) 7 (0.6%) 7 (0.6%) 8 (0.7%) 7 (0.6%)
    Unknown 50 0 0 0 0 0
solitary_nodule 320 (28%) 321 (28%) 321 (28%) 320 (28%) 323 (28%) 321 (28%)
    Unknown 8 0 0 0 0 0
Nodule size (mm) 14 (7, 28) 13 (6, 27) 12 (6, 26) 12 (6, 26) 13 (6, 26) 13 (6, 26)
    Unknown 319 0 0 0 0 0
cervical_lymphadenopathy 26 (2.3%) 26 (2.3%) 26 (2.3%) 26 (2.3%) 26 (2.3%) 26 (2.3%)
    Unknown 9 0 0 0 0 0
thy_classification





    Thy1 34 (14%) 206 (18%) 173 (15%) 136 (12%) 175 (15%) 147 (13%)
    Thy1c 8 (3.4%) 46 (4.0%) 38 (3.3%) 69 (6.0%) 36 (3.1%) 32 (2.8%)
    Thy2 63 (27%) 299 (26%) 283 (25%) 323 (28%) 333 (29%) 341 (30%)
    Thy2c 11 (4.7%) 57 (5.0%) 78 (6.8%) 71 (6.2%) 64 (5.6%) 40 (3.5%)
    Thy3a 18 (7.7%) 105 (9.1%) 115 (10%) 117 (10%) 85 (7.4%) 97 (8.4%)
    Thy3f 74 (31%) 291 (25%) 299 (26%) 306 (27%) 325 (28%) 359 (31%)
    Thy4 10 (4.3%) 48 (4.2%) 52 (4.5%) 47 (4.1%) 61 (5.3%) 48 (4.2%)
    Thy5 17 (7.2%) 98 (8.5%) 112 (9.7%) 81 (7.0%) 71 (6.2%) 86 (7.5%)
    Unknown 915 0 0 0 0 0
final_pathology





    Benign 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%) 1,050 (91%)
    Cancer 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%) 100 (8.7%)
1

Median (IQR); n (%)

3.2 Modelling

TODO - And in light of having removed ?@tbl-data-completness in favour of the imputed datesets this too has been removed? (@ns-rse 2024-07-11). TODO - This table feels like duplication of ?@tbl-data-completeness, perhaps have just one? (@ns-rse 2024-07-11).

The predictor variables selected to predict final_pathology are shown in ?@tbl-predictors

Section that sets up the modelling

In [34]:
## Prefer tidymodel commands (although in most places we use the convention <pkg>::<function>())
library(tidyverse)
library(tidymodels)
tidymodels::tidymodels_prefer()
set.seed(5039378)

## Use df_complete rather than df as this subset have data for all the variables of interest.
## split <- rsample::initial_split(df_complete, prop = 0.75)
## @ns-rse (2024-07-18) - Use an imputed dataset instead
split <- mice_rf$imputed |>
    dplyr::filter(.imp == 1) |>
    dplyr::select(-.imp, -.id) |>
    rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
In [35]:
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
In [36]:
cv_loo <- rsample::loo_cv(train)
In [37]:
## NB This is the key section where the variables that are to be used in the model are defined. A dependent variable
## (the outcome of interest) is in this case the `final_pathology`, whether individuals have malignant or benign tumors,
## this appears on the left-hand side of the equation (before the tilde `~`). On the right of the equation are the
## predictor or dependant variables
##
## @ns-rse 2024-06-14 :
## Because we have used dplyr::select() to choose _just_ the columns of interest we can use the '.'
## notation to refer to "all other variables" as being predictors. This is useful as it saves duplication of writing
## everything out which leaves scope for some being missed.
thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
  ## @ns-rse 2024-06-14 :
  ## This step can be used to filter observations with missing data, see the manual pages for more details
  ## https://recipes.tidymodels.org/reference/step_filter_missing.html
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  ## @ns-rse 2024-06-14 :
  ## We first normalise the data _before_ we generate dummies otherwise the dummies, which are numerical, get normalised
  ## too
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
In [38]:
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe)
saveRDS(thyroid_workflow, file = paste(r_dir, "thyroid_workflow.rds", sep = "/"))

The following section is output from a Tidymodel approach to logistic regression to try and work out why variables are not being included.

In [39]:
## define binary logistic regression model
logistic_model <- logistic_reg() |>
  set_engine("glm")
## add the binary logistic regression model to the thyroid workflow
log_thyroid_workflow <- thyroid_workflow |>
  add_model(logistic_model)
## fit the the workflow to the training data
log_thyroid_fit <- fit(log_thyroid_workflow, data = train)
log_thyroid_fit
## to inspect the fit object
str(log_thyroid_fit)
## use fitted model to make prediction
logistic_model_predictions <- predict(log_thyroid_fit, test) |> bind_cols(test)
## examine the processing steps
log_thyroid_fit |> extract_recipe()
## examine the model
log_thyroid_fit |> extract_fit_parsnip()
## to check if the workflow has been trained
log_thyroid_fit$trained
## I can only see age and gender, unsure why the other variable not present.? excluded due to lack of data ?because they are simply not important ? issue with workflow set up

A total of 1150 patients had complete data for the selected predictor variables (see ?@tbl-predictors). Because of the volume of missing data which if a saturated model were used would include only ~350 people with complete data across all co-variates imputed datasets were analysed instead.

3.2.1 Logistic Regression

In [40]:
train |> colnames()
 [1] "age_at_scan"                   "gender"                       
 [3] "ethnicity"                     "incidental_nodule"            
 [5] "palpable_nodule"               "rapid_enlargement"            
 [7] "compressive_symptoms"          "hypertension"                 
 [9] "vocal_cord_paresis"            "graves_disease"               
[11] "hashimotos_thyroiditis"        "family_history_thyroid_cancer"
[13] "exposure_radiation"            "albumin"                      
[15] "tsh_value"                     "lymphocytes"                  
[17] "monocyte"                      "bta_u_classification"         
[19] "solitary_nodule"               "size_nodule_mm"               
[21] "cervical_lymphadenopathy"      "thy_classification"           
[23] "final_pathology"              
3.2.1.1 Clinical Characteristics
In [41]:
glm_clin <- glm(final_pathology ~ age_at_scan + gender + incidental_nodule + palpable_nodule +
rapid_enlargement + compressive_symptoms + hashimotos_thyroiditis + family_history_thyroid_cancer + exposure_radiation +
tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy,
    data = train,
    family = binomial(link = "logit"))

## Use the packages to summarise the output
gtsummary::tbl_regression(glm_clin,
    exponentiate = TRUE,
    show_single_row = c(gender,
                        incidental_nodule,
                        palpable_nodule,
                        rapid_enlargement,
                        compressive_symptoms,
                        hashimotos_thyroiditis,
                        family_history_thyroid_cancer,
                        exposure_radiation,
                        solitary_nodule,
                        cervical_lymphadenopathy))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Characteristic

OR

1

95% CI

1

p-value

Age 0.98 0.96, 0.99 0.005
gender 1.97 1.11, 3.43 0.019
incidental_nodule 0.99 0.48, 2.02 >0.9
palpable_nodule 2.76 1.29, 6.08 0.010
rapid_enlargement 1.66 0.46, 5.37 0.4
compressive_symptoms 0.98 0.43, 2.09 >0.9
hashimotos_thyroiditis 1.19 0.02, 13.0 >0.9
family_history_thyroid_cancer 3.80 0.48, 21.7 0.15
exposure_radiation 0.00
>0.9
TSH value 1.00 0.94, 1.05 >0.9
Nodule size (mm) 1.03 1.01, 1.05 <0.001
solitary_nodule 1.48 0.85, 2.51 0.2
cervical_lymphadenopathy 6.74 2.37, 18.8 <0.001
1

OR = Odds Ratio, CI = Confidence Interval

3.2.1.2 Biomarkers
In [42]:
glm_biochem <- glm(final_pathology ~ age_at_scan + gender + tsh_value + albumin + lymphocytes + monocyte,
    data = train,
    family = binomial(link = "logit"))
gtsummary::tbl_regression(glm_biochem,
    exponentiate = TRUE,
    show_single_row = c(gender))

Characteristic

OR

1

95% CI

1

p-value

Age 0.98 0.97, 1.00 0.010
gender 1.95 1.15, 3.26 0.012
TSH value 1.01 0.96, 1.06 0.5
Albumin 1.03 0.96, 1.12 0.4
Lymphocytes 1.04 0.74, 1.43 0.8
Monocytes 0.18 0.04, 0.72 0.019
1

OR = Odds Ratio, CI = Confidence Interval

3.2.1.3 Ultrasound 1
In [43]:
glm_ultrasound <- glm(final_pathology ~ age_at_scan + gender + bta_u_classification + thy_classification,
    data = train,
    family = binomial(link = "logit"))
gtsummary::tbl_regression(glm_ultrasound,
    exponentiate = TRUE,
    show_single_row = c(gender))
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values

Characteristic

OR

1

95% CI

1

p-value

Age 0.99 0.97, 1.01 0.2
gender 1.23 0.63, 2.32 0.5
bta_u_classification


    U1
    U2 461,584 0.00, NA >0.9
    U3 7,642,048 0.00, NA >0.9
    U4 36,642,711 0.00, NA >0.9
    U5 126,534,639 0.00, NA >0.9
thy_classification


    Thy1
    Thy1c 1.63 0.19, 9.45 0.6
    Thy2 0.36 0.09, 1.25 0.11
    Thy2c 0.00 0.00, 0.00 >0.9
    Thy3a 2.24 0.66, 7.80 0.2
    Thy3f 3.01 1.21, 8.47 0.025
    Thy4 7.06 1.93, 26.6 0.003
    Thy5 9.52 3.19, 31.1 <0.001
1

OR = Odds Ratio, CI = Confidence Interval

3.2.1.4 Ultrasound 2
In [44]:
glm_ultrasound <- glm(final_pathology ~ age_at_scan + gender + incidental_nodule + tsh_value +
size_nodule_mm + solitary_nodule + cervical_lymphadenopathy + bta_u_classification + thy_classification,
    data = train,
    family = binomial(link = "logit"))
gtsummary::tbl_regression(glm_ultrasound,
    exponentiate = TRUE,
    show_single_row = c(
        gender,
        incidental_nodule,
        solitary_nodule,
        cervical_lymphadenopathy
    )
)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values

Characteristic

OR

1

95% CI

1

p-value

Age 0.99 0.97, 1.01 0.2
gender 1.19 0.58, 2.34 0.6
incidental_nodule 0.94 0.48, 1.84 0.9
TSH value 0.95 0.89, 1.00 0.091
Nodule size (mm) 1.04 1.02, 1.06 <0.001
solitary_nodule 1.49 0.77, 2.82 0.2
cervical_lymphadenopathy 6.61 1.45, 29.3 0.013
bta_u_classification


    U1
    U2 239,169 0.00, NA >0.9
    U3 3,105,400 0.00, NA >0.9
    U4 14,770,696 0.00, NA >0.9
    U5 39,781,959 0.00, NA >0.9
thy_classification


    Thy1
    Thy1c 0.97 0.06, 7.66 >0.9
    Thy2 0.25 0.06, 0.97 0.049
    Thy2c 0.00 0.00, 1,045,453 >0.9
    Thy3a 2.69 0.76, 9.91 0.13
    Thy3f 2.45 0.92, 7.39 0.089
    Thy4 8.34 2.18, 33.3 0.002
    Thy5 11.1 3.47, 39.6 <0.001
1

OR = Odds Ratio, CI = Confidence Interval

3.2.2 LASSO

In [45]:
## Specify the LASSO model using parsnip, the key here is the use of the glmnet engine which is the R package for
## fitting LASSO regression. Technically the package fits Elastic Net but with a mixture value of 1 it is equivalent to
## a plain LASSO (mixture value of 0 is equivalent to Ridge Regression in an Elastic Net)
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
  parsnip::set_engine("glmnet")

## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
  object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
  resamples = cv_folds,
  grid = dials::grid_regular(penalty(), levels = 50)
)
In [46]:
## Plot the tuning search results, see https://tune.tidymodels.org/reference/autoplot.tune_results.html
##
## This shows how the evaluation metrics change over time with each iteration of the Lasso (we know we are running a
## LASSO because when we setup tune_spec_lasso the mixture = 1 when we define parsnip::logistic_reg()
tune::autoplot(lasso_grid)
Figure 24: Autoplot of LASSO grid search
In [47]:
lasso_kfold_roc_auc <- lasso_grid |>
  tune::select_best(metric = "roc_auc")
In [48]:
final_lasso_kfold <- tune::finalize_workflow(
  workflows::add_model(thyroid_workflow, tune_spec_lasso),
  lasso_kfold_roc_auc)
In [49]:
final_lasso_kfold |>
  fit(train) |>
  hardhat::extract_fit_parsnip() |>
  vip::vi(lambda = lasso_kfold_roc_auc$penalty) |>
  dplyr::mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) |>
  ggplot(mapping = aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col()
Figure 25: Importance of variables fitted using LASSO

NB - We may wish to inspect the coefficients at each step of tuning. A related example of how to do this can be found in the Tidymodels documentation under the Tuning a glmnet model. This would be desirable as it looks like only two features are selected as being important by this method and so rather than just accepting this I would want to investigate and see how the coefficients changed over iterations. Another useful resource is the glmnet documentation, although note that since we are using the Tidymodels framework the model fit is wrapped up inside (hence the above article on how to extract this information).

In [50]:
save(tune_spec_lasso, lasso_grid, final_lasso_kfold, lasso_kfold_roc_auc,
  file = "data/r/lasso.RData"
)

3.2.3 Elastic Net

In [51]:
## Specify the Elastic Net model using parsnip, the key here is the use of the glmnet engine which is the R package for
## fitting Elastic Net regression. Technically the package fits Elastic Net but with a mixture value of 1 it is equivalent to
## a plain Elastic Net (mixture value of 0 is equivalent to Ridge Regression in an Elastic Net)
tune_spec_elastic <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 0.5) |>
  parsnip::set_engine("glmnet")

## Tune the Elastic Net parameters via cross-validation
lasso_grid <- tune::tune_grid(
  object = workflows::add_model(thyroid_workflow, tune_spec_elastic),
  resamples = cv_folds,
  grid = dials::grid_regular(penalty(), levels = 50)
)
In [52]:
## Plot the tuning search results, see https://tune.tidymodels.org/reference/autoplot.tune_results.html
##
## This shows how the evaluation metrics change over time with each iteration of the Lasso (we know we are running a
## Elastic Net because when we setup tune_spec_elastic the mixture = 1 when we define parsnip::logistic_reg()
tune::autoplot(lasso_grid)
Figure 26: Autoplot of Elastic Net grid search
In [53]:
## K-fold best fit for Elastic Net
lasso_kfold_roc_auc <- lasso_grid |>
  tune::select_best(metric = "roc_auc")
In [54]:
## Fit the final Elastic Net model
final_elastic_kfold <- tune::finalize_workflow(
  workflows::add_model(thyroid_workflow, tune_spec_elastic),
  lasso_kfold_roc_auc
)
In [55]:
final_elastic_kfold |>
  fit(train) |>
  hardhat::extract_fit_parsnip() |>
  vip::vi(lambda = lasso_kfold_roc_auc$penalty) |>
  dplyr::mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) |>
  ggplot(mapping = aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  dark_theme_minimal()
Inverted geom defaults of fill and color/colour.
To change them back, use invert_geom_defaults().
Figure 27: Importance of variables fitted using Elastic Net

NB - We may wish to inspect the coefficients at each step of tuning. A related example of how to do this can be found in the Tidymodels documentation under the Tuning a glmnet model. This would be desirable as it looks like only two features are selected as being important by this method and so rather than just accepting this I would want to investigate and see how the coefficients changed over iterations. Another useful resource is the glmnet documentation, although note that since we are using the Tidymodels framework the model fit is wrapped up inside (hence the above article on how to extract this information).

In [56]:
save(tune_spec_elastic, lasso_grid, final_elastic_kfold, lasso_kfold_roc_auc,
  file = "data/r/lasso.RData"
)

3.2.4 Random Forest

In [57]:
## Specify the Random Forest model
rf_tune <- parsnip::rand_forest(
  mtry = tune(),
  trees = 100,
  min_n = tune()
) |>
  set_mode("classification") |>
  set_engine("ranger", importance = "impurity")


## Tune the parameters via Cross-validation
rf_grid <- tune::tune_grid(
  add_model(thyroid_workflow, rf_tune),
  resamples = cv_folds, ## cv_loo,
  grid = grid_regular(mtry(range = c(5, 10)), # smaller ranges will run quicker
    min_n(range = c(2, 25)),
    levels = 3
  )
)

## Get the best fitting model with the highest ROC AUC
rf_highest_roc_auc <- rf_grid |>
  select_best()
Warning in select_best(rf_grid): No value of `metric` was given; "roc_auc" will
be used.
final_rf <- tune::finalize_workflow(
  add_model(thyroid_workflow, rf_tune),
  rf_highest_roc_auc
)
In [58]:
final_rf |>
  fit(train) |>
  hardhat::extract_fit_parsnip() |>
  vip::vi(lambda = final_rf$penalty) |>
  dplyr::mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) |>
  ggplot(mapping = aes(x = Importance, y = Variable, fill = Importance)) +
  geom_col() +
  dark_theme_minimal()

In [59]:
save(rf_tune, rf_grid, final_rf, rf_highest_roc_auc,
  file = "data/r/random_forest.RData"
)

3.2.5 Gradient Boosting

In [60]:
## Specify the Gradient boosting model
model_xgboost <- parsnip::boost_tree(
    mode = "classification",
    trees = 100,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
) |>
  set_engine("xgboost", objective = "binary:logistic")

## Specify the models tuning parameters using the `dials` package along (https://dials.tidymodels.org/) with the grid
## space. This helps identify the hyperparameters with the lowest prediction error.
xgboost_params <- dials::parameters(
    min_n(),
    tree_depth(),
    learn_rate(),
    loss_reduction()
)
xgboost_grid <- dials::grid_max_entropy(
    xgboost_params,
    size = 10
)

## Tune the model via cross-validation
xgboost_tuned <- tune::tune_grid(workflows::add_model(thyroid_workflow, spec = model_xgboost),
    resamples = cv_folds,
    grid = xgboost_grid,
    metrics = yardstick::metric_set(roc_auc, accuracy, ppv),
    control = tune::control_grid(verbose = FALSE)
)
In [61]:
## We get the best final fit from the Gradient Boosting model.
xgboost_highest_roc_auc <- xgboost_tuned |>
    tune::select_best()
Warning in tune::select_best(xgboost_tuned): No value of `metric` was given;
"roc_auc" will be used.
final_xgboost <- tune::finalize_workflow(
    add_model(thyroid_workflow, model_xgboost),
    xgboost_highest_roc_auc
)
final_fit <- final_xgboost |> fit (data = train)
summary(final_fit)
        Length Class      Mode   
pre     3      stage_pre  list   
fit     2      stage_fit  list   
post    1      stage_post list   
trained 1      -none-     logical
## it is unclear from the output what predictors were significant, so i need to examin the model to see what predictors were used
preprocessor <- final_fit |> extract_preprocessor()
## view the predictors evaluated
summary(preprocessor)
# A tibble: 23 × 4
   variable             type      role      source  
   <chr>                <list>    <chr>     <chr>   
 1 age_at_scan          <chr [2]> predictor original
 2 gender               <chr [3]> predictor original
 3 ethnicity            <chr [3]> predictor original
 4 incidental_nodule    <chr [3]> predictor original
 5 palpable_nodule      <chr [3]> predictor original
 6 rapid_enlargement    <chr [3]> predictor original
 7 compressive_symptoms <chr [3]> predictor original
 8 hypertension         <chr [3]> predictor original
 9 vocal_cord_paresis   <chr [3]> predictor original
10 graves_disease       <chr [3]> predictor original
# ℹ 13 more rows
## above showed the variables used
## the next line shows the extracts the fitted model from the work flow
model_xgboost <- extract_fit_parsnip(final_fit)$fit
summary(model_xgboost)
               Length Class              Mode       
handle             1  xgb.Booster.handle externalptr
raw            95013  -none-             raw        
niter              1  -none-             numeric    
evaluation_log     2  data.table         list       
call               8  -none-             call       
params            10  -none-             list       
callbacks          1  -none-             list       
feature_names     45  -none-             character  
nfeatures          1  -none-             numeric    
## the next code evaluate the important variables
importance_xgboost <- xgboost::xgb.importance(model = model_xgboost)
print(importance_xgboost)
                   Feature   Gain  Cover Frequency
                    <char>  <num>  <num>     <num>
1: bta_u_classification_U2 0.8201 0.5114    0.3106
2:               tsh_value 0.0900 0.0693    0.1677
3:          size_nodule_mm 0.0332 0.3619    0.3634
4:             age_at_scan 0.0314 0.0344    0.1025
5:                 albumin 0.0253 0.0230    0.0559
## the print shows age and gender as important
## to make a prediction on the train set
xgboost_predictions <- predict(final_fit, test, type = "prob") |>
    bind_cols(predict(final_fit, test)) |>
    bind_cols(test)
colnames(xgboost_predictions)
 [1] ".pred_Benign"                  ".pred_Cancer"                 
 [3] ".pred_class"                   "age_at_scan"                  
 [5] "gender"                        "ethnicity"                    
 [7] "incidental_nodule"             "palpable_nodule"              
 [9] "rapid_enlargement"             "compressive_symptoms"         
[11] "hypertension"                  "vocal_cord_paresis"           
[13] "graves_disease"                "hashimotos_thyroiditis"       
[15] "family_history_thyroid_cancer" "exposure_radiation"           
[17] "albumin"                       "tsh_value"                    
[19] "lymphocytes"                   "monocyte"                     
[21] "bta_u_classification"          "solitary_nodule"              
[23] "size_nodule_mm"                "cervical_lymphadenopathy"     
[25] "thy_classification"            "final_pathology"              
## analyse model performance using roc curves, need to figure out what "truth" and estimate refers to here, as final_pathology and .pred_Cancer do not run
## roc_auc <- xgboost_predictions |>
##    roc_auc(truth = , estimate = )
In [62]:
save(xgboost_tuned, xgboost_grid, final_xgboost, importance_xgboost, model_xgboost, ## xgboost_highest_roc_auc,
  file = "data/r/xgboost.RData"
)
``




#### SVM

<!-- {{< include sections/_svm.qmd >}} -->


#### Explainability

Which factors are important to classification can be assessed not just by the "importance" but by methods know as
[LIME](https://search.r-project.org/CRAN/refmans/lime/html/lime-package.html) (Local Interpretable Model-Agnostic
Explanations)  @ribeiro2016 and [Shapley
values](https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html)
@lundberg2017

#### Comparision

Comparing the sensitivity of the different models goes here.

+ Table of sensitivity/specificity/other metrics.
+ ROC curves

## Conclusion

The take-away message is....these things are hard!


## Appendix

### Data Dictionary
In [63]:
In [64]:
var_labels |>
  as.data.frame() |>
  kable(col.names = c("Description"),
        caption="Description of variables in the Sheffield Thyroid dataset.")
Table 3: Description of variables in the Sheffield Thyroid dataset.
Description
age_at_scan Age
albumin Albumin
bta_u_classification BTA U
cervical_lymphadenopathy Cervical Lymphadenopathy
compressive_symptoms Compressive symptoms
consistency_nodule Nodule consistency
eligibility Eligibility
ethinicity Ethinicity
ethnicity Ethnicity
exposure_radiation Exposure to radiation
family_history_thyroid_cancer Family history of thyroid cancer
final_pathology Final diagnosis
fna_done FNA done
gender Gender
graves_disease Graves’ disease
hashimotos_thyroiditis Hashimoto’s disease
hypertension Hypertension
incidental_nodule Incidental nodule
lymphocytes Lymphocytes
monocyte Monocytes
palpable_nodule Palpable nodule
rapid_enlargement Rapid enlargement
repeat_bta_u_classification Repeat BTA U
repeat_fna_done Repeat FNA
repeat_thy_classification Repeat Thy class
repeat_ultrasound Repeat ultrasound
size_nodule_mm Nodule size (mm)
solitary_nodule Solitary nodule
study_id Study ID
thy_classification Thy classification
thyroid_histology_diagnosis Histology
thyroid_surgery Thyroid surgery
tsh_value TSH value
vocal_cord_paresis Vocal cord paresis
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R.” J. Stat. Soft. 45 (December): 1–67. https://doi.org/10.18637/jss.v045.i03.
Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge Core. Cambridge, England, UK: Cambridge University Press. https://doi.org/10.1017/CBO9781316576533.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Regression Shrinkage and Selection via the Lasso.” 1996. Journal of the Royal Statistical Society. Series B (Methodological). https://www.jstor.org/stable/2346178.
Smith, Gary. 2018. Step away from stepwise.” J. Big Data 5 (1): 1–12. https://doi.org/10.1186/s40537-018-0143-6.
Steyerberg, Ewout W., Marinus J. C. Eijkemans, Frank E. Harrell, and Dik. 2001. “Prognostic Modeling with Logistic Regression Analysis.” Medical Decision Making 21 (1): 45–56. https://doi.org/10.1177/0272989x0102100106.
Thompson, Bruce. 1995. “Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply Here: A Guidelines Editorial.” Educational and Psychological Measurement 55 (4): 525–34. https://doi.org/10.1177/0013164495055004001.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Zou, Hui, and Trevor Hastie. 2005. Regularization and Variable Selection Via the Elastic Net.” J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.