An investigation of the predictors of thyroid cancer in patients with thyroid nodules

Ovie Edafe; Neil Shephard; Karen Sisley; Sabapathy P Balasubramanian

Abstract

An abstract summarising the work undertaken and the overall conclusions can be placed here. Sub-headings are currently removed because they conflict with those in the body of the text and mess up the links in the Table of Contents.

1 Introduction

Thyroid nodules are common. The challenge in the management of thyroid nodules is differentiating between benign and malignant nodule thyroid nodules.The use fine needle aspiration and cytology (FNAC) still leaves around 20% of patients that cannot be clearly classified as either benign or malignant. This scenario traditionally leads to diagnostic hemithyroidectomy for definitive histology. Other clinical variables such as patients’ demographics, clinical and biochemical factors have been shown to be associated with thyroid cancer in patients with thyroid nodules. This has been utilised in studies evaluating predictors of thyroid cancer with a view of creating a model to aid prediction. Standard practice on the management of thyroid nodules does not utilise these non ultrasound and non cytological factors. Combination of these variables considered to be significant with ultrasound and cytological characteristics may improve management of patients with thyroid nodules. Thyroid nodules are increasingly being incidentally detected with increased use of imaging in the evaluation of non thyroid related pathologies. Thus, leading to increase investigation of thyroid nodules and subsequent increased number of thyroid operations in non diagnostic cases. There are morbidities associated with thyroid surgery including scar, recurrent laryngeal nerve injury, hypothyroidism and hypoparathyroidism. We performed a systematic review to evaluate for predictors of thyroid cancer specifically in patients presenting with thyroid nodules. The systematic review a number of potential important variables that may be useful in the prediction of thyroid cancer in patients with thyroid nodules. The aim of this study was to evaluate the predictors of thyroid cancer with a view of improving prediction of thyroid cancer using computer age statistical inference techniques (Efron and Hastie (2016)).

2 Methods

This study was reported as per the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines

2.1 Study design

This was a retrospective cohort study.

2.2 Setting

The study was conducted at the Sheffield Teaching hospitals NHS Foundation Trusts. This is a tertiary referral centre for the management of thyroid cancer

2.3 Participants

We included all consecutive patients who presented with thyroid nodule(s) or that were found to have thyroid nodule(s) on ultrasound done for thyroid pathology or for other non thyroid related pathologies

2.4 Variables

Variable evaluated was based on findings from a systematic review evaluating predictors of thyroid cancer in patients with thyroid nodules. Data on the following variables were collected: patient demographics (age, gender, ethnicity), nodule presentation (incidental nodule, palpable nodule, rapid enlargement, compressive symptoms, vocal paresis), past medical history (hypertension, Graves’ disease, Hashimotos’ thyroiditis, family history of thyroid cancer, exposure to neck radiation), biochemistry (thyroid stimulating hormone, lymphocytes, monocytes), ultrasound characteristics (British Thyroid Association ultrasound (BTA U), nodule size, solitary nodule, nodule consistency, cervical lymphadenopathy), Royal College of Pathology (RCP) FNAC classification, type of thyroid surgery, and histological diagnosis.

2.5 Data source

Data was collected from patients’ case notes and electronic patients’ database using a standardised data collection proforma. This was initially piloted on 30 patients and revised to improve data entry. In addition a number of variables that were not standard collected during workout of patients were not further checked; these include body mass index (BMI), serum thyroglobulin, serum triiodothyronine (T3), thyroxine (T4), thyroglobulin antibody (TgAb), thyroid peroxidase antibody (TP0Ab), and urinary iodine.

2.6 Study size

We sought to have a large data set of at least 100 thyroid nodules with a cancer diagnosis using consecutive sampling technique. We aimed for a total of 1500 patients with thyroid nodules to achieve our target sample size. With the use of modern statistical techniques, we proposed such number will be appropriate to detect important variables if it exists.

2.7 Data analysis

Data was cleaned and analysed using the R Statistical Software R Core Team (2023) and the Tidyverse (Wickham et al. (2019)), Tidymodels (Kuhn and Wickham (2020)) collection of packages.

2.8 Imputation

The dataset is incomplete and there are missing observations across all variables to varying degrees. In order to maximise the sample available for analysis imputation was used to infer missing values. The Multivariat Imputation via Chained Equations (MICE and implemented in the eponymous R package Buuren and Groothuis-Oudshoorn (2011)) was employed which assumes data is missing at random (a difficult assumption to formally test). The approach takes each variable with missing data and attempts to predict it using statistical modelling based on the observed values. In essence it is the same approach as the statistical methods being employed to try and predict Thyroid Cancer and there are a range of statistical techniques available which include

2.9 Modelling

We used a selection of statistic modelling techniques to evaluate association between variables and thyroid cancer in patients with thyroid nodules. The patient population was split into training and testing cohorts in a ratio of 0.75:0.25 and each model is fitted using the training cohort. This split ratio is generally used in traditional machine learning techniques. The training set of the data was used to estimate the relation between variables and thyroid cancer. The larger the training data, the better it is for the model to learn the trends. The test set was used to determine the accuracy of the model in predicting thyroid cancer; the bigger the test data the more confidence we have in the model prognostic values. We used simple randomisation technique for the split to prevent bias in the data split. We ensured that there was no duplicate in the data sets so any test data was not accidentally trained. Furthermore, cross validation was used to estimate the accuracy of the various machine learning models. The k-fold techniques splits the data in ?10 folds, and the data was trained on all but one of the the fold, and the one fold not trained is used to test the data. This was repeated multiple times using a different fold for test and the others for training until all the folds is utilised for training and testing. Following multiple training process with k-fold, we selected the model that has the best predictive value for thyroid cancer in the test cohort. We also used the leave one out (loo) cross-validation to train and test the data set.In this technique, all but one observation is use to train the data set and one observation is use to test the data; this is repeated until all the data test is used for testing and training. The model with the best predictive value was selected.

2.9.1 LASSO / Elastic Net

LASSO (Least Absolute Shrinkage and Selection Operatror) and Elastic Net Zou and Hastie (2005) are regression methods that perform variable selection. The original LASSO method proposed by “Regression Shrinkage and Selection via the Lasso” (1996) allows the coefficients for independent/predictor variables to “shrink” down towards zero, effectively eliminating them from influencing the model, this is often referred to as L₁ regularisation. The Elastic Net Zou and Hastie (2005) improves on the LASSO by balancing L₁ regularisation with ridge-regression or L₂ regularisation which helps avoid over-fitting.

Both methods avoid many of the shortcomings/pitfalls of stepwise variable selection Thompson (1995) Smith (2018) and have been shown to be more accurate in clinical decision making in small datasets with well code, externally selected variables Steyerberg et al. (2001)

2.9.2 Random Forest

To add reference The random forest plot is an extension of the decision tree methodology to reduce variance. Decision trees are very sensitive to the training data set and can lead to high variance; thus potential issues with generalisation of the model. The random forest plot selects random observation of the dataset to create multiple decision trees. Random variables are selected for each tree in the training of the data set. The aggregated output of the generated decision trees is then used to create an estimate.

2.9.3 Gradient Boosting

Gradient boosting is a machine learning algorithm that uses decision tree as a base model. The data is initially trained on this decision tree, but the initial prediction is weak, thus termed a weak based model. In gradient boosting the process is iterative; a sequence of decision trees is added to the initial tree. Each tree learns from the prior tree(s) to improve the model, increasing strength and minimising error.

2.9.4 SVM

Support Vector Machines is an approach that allows observation with a binary classifications to be separated using a hyperplane. It finds a hyperplane that best stratify the two classes i.e benign versus malignant nodules. SVM finds the hyperplane with the maximum margin of separation between the two classes. The support vectors are the data point that are positioned close to the margin of the hyperplane and these used to select the most appropraite hyperplane. The support vectors are the only data points that have an influence on the maximum margin in SVM.

2.9.5 Comparision

3 Results

?@tbl-patient-demographics shows the demographics of patients included in this study. A total of 1364 patients were included in this study with a median (IQR) age of 55 ( 41-69). ?@tbl-clinical-characteristics shows the distribution of clinical variables evaluated between benign and malignant thyroid nodules.

3.1 Data Description

A summary of the variables that are available in this data set can be found in Table 3.

The completeness of the original data is shown in tables ?@tbl-imputation-summary-pmm, ?@tbl-imputation-summary-cart, ?@tbl-imputation-summary-rf, along with summaries from four rounds of imputation for each of three imputation methods. Where variables continuous (e.g. age or size_nodule_mm) basic summary statistics in the form of mean, standard deviation, median and inter-quartile range are given. For categorical variables that are logical TRUE/FALSE (e.g. palpable_nodule) the number of TRUE observations and the percentage (of those with observed data for that variable) are shown along with the number that are Unknown. For categorical variables such as gender percentages in each category are reported. For all variables an indication of the number of missing observations is also given and it is worth noting that there are 214 instances where the final_pathology is not known which reduces the sample size to 1150.

3.1.1 Missing Data

More detailed tabulations of missing data by variable are shown in Table 1 which shows the number and percentage of missing data for each variable and by case in Table 2 which shows how much missing data each case has. A visualisation of this is shown in Figure 1 .

NB - Currently there is a bug in the stable release of Quarto which prevents rendering of the missing data figures. It is fixed in development version v1.6.1 (currently available as pre-release, so if things don’t render try upgrading).

Table 1: Summary of missing data by variable.

Variable	N	%
thy_classification	915	79.6
albumin	515	44.8
tsh_value	413	35.9
monocyte	363	31.6
lymphocytes	359	31.2
size_nodule_mm	319	27.7
family_history_thyroid_cancer	281	24.4
hypertension	126	11.0
exposure_radiation	121	10.5
compressive_symptoms	106	9.22
vocal_cord_paresis	76	6.61
hashimotos_thyroiditis	73	6.35
graves_disease	67	5.83
palpable_nodule	58	5.04
bta_u_classification	50	4.35
ethnicity	48	4.17
rapid_enlargement	43	3.74
cervical_lymphadenopathy	9	0.783
solitary_nodule	8	0.696
incidental_nodule	5	0.435
age_at_scan	0	0
gender	0	0
final_pathology	0	0

Source: Article Notebook

Table 2: Summary of missing data by case, how much missing data is there per person?

Missing Variables	N	%
0	65	5.652
1	227	19.739
2	229	19.913
3	181	15.739
4	139	12.087
5	102	8.870
6	76	6.609
7	35	3.043
8	30	2.609
9	13	1.130
10	19	1.652
11	15	1.304
12	9	0.783
13	4	0.348
14	3	0.261
15	2	0.174
16	1	0.087

The MICE package also provides tools for visualising missing data and these are shown in figures Figure 2, ?@fig-mice-vis-missing-biomarker and Figure 4.

The columns of these plots, labelled along the top, show the variable, if a cell is blue it indicates data is present, if it is red it indicates there is missing data. The left-hand side shows the total number of observations for that rows particular combination of variables with number of missing variables indicated on the right. The first row shows that for these variables there are 604 observations with zero missing data across the listed variables, the second row indicates there are 166 observations with just family_history_thyroid_cancer but there are some with this missing and other variables. The numbers on the bottom of the figure indicate the total number of missing observations for that variable (e.g. for family_history_thyroid_cancer there is a total of 281 missing observations).

TODO - Workout why out-width: "80%" isn’t applied to these figures and/or how to make the All figure readable.

3.1.2 Imputation

The MICE package@mice offers a number of different methods for imputing variables (see [documentation][mice_details]) we have investigated Predictive Mean Matching (PMM), Classification and Regression Trees (CART) and Random Forests (RF). Four rounds of imputation using each method were made.

A comparison of distributions/proportions before and after imputation are presented below to allow assessment of the utility of each method.

The convergence of the imputation methods are shown in figues ?@fig-mice-convergence-pmm, ?@fig-mice-convergence-cart, and ?@fig-mice-convergence-rf.

TODO - Extract the legends from individual plots and add them to the end of each row, see the cowplot shared legends article for pointers on how to do this. Should ideally also get the fill colours to align with those used by ggmice.

Characteristic	Original, N = 1,150 ¹	1, N = 1,150 ¹	2, N = 1,150 ¹	3, N = 1,150 ¹	4, N = 1,150 ¹	5, N = 1,150 ¹
.id	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)
Age	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)
gender
F	903 (79%)	903 (79%)	903 (79%)	903 (79%)	903 (79%)	903 (79%)
M	247 (21%)	247 (21%)	247 (21%)	247 (21%)	247 (21%)	247 (21%)
ethnicity
A	812 (74%)	843 (73%)	841 (73%)	839 (73%)	845 (73%)	842 (73%)
B	3 (0.3%)	4 (0.3%)	4 (0.3%)	4 (0.3%)	3 (0.3%)	3 (0.3%)
C	44 (4.0%)	45 (3.9%)	47 (4.1%)	44 (3.8%)	48 (4.2%)	45 (3.9%)
D	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)
F	4 (0.4%)	4 (0.3%)	4 (0.3%)	4 (0.3%)	4 (0.3%)	4 (0.3%)
G	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)
H	8 (0.7%)	8 (0.7%)	8 (0.7%)	9 (0.8%)	8 (0.7%)	8 (0.7%)
J	46 (4.2%)	50 (4.3%)	49 (4.3%)	51 (4.4%)	50 (4.3%)	49 (4.3%)
K	11 (1.0%)	11 (1.0%)	13 (1.1%)	12 (1.0%)	11 (1.0%)	12 (1.0%)
L	16 (1.5%)	19 (1.7%)	16 (1.4%)	18 (1.6%)	16 (1.4%)	17 (1.5%)
M	16 (1.5%)	16 (1.4%)	16 (1.4%)	17 (1.5%)	16 (1.4%)	18 (1.6%)
N	18 (1.6%)	19 (1.7%)	18 (1.6%)	18 (1.6%)	19 (1.7%)	20 (1.7%)
P	14 (1.3%)	14 (1.2%)	15 (1.3%)	16 (1.4%)	15 (1.3%)	14 (1.2%)
R	6 (0.5%)	7 (0.6%)	6 (0.5%)	8 (0.7%)	7 (0.6%)	7 (0.6%)
S	38 (3.4%)	40 (3.5%)	43 (3.7%)	42 (3.7%)	40 (3.5%)	40 (3.5%)
Z	57 (5.2%)	61 (5.3%)	61 (5.3%)	59 (5.1%)	59 (5.1%)	62 (5.4%)
Unknown	48	0	0	0	0	0
incidental_nodule	620 (54%)	623 (54%)	622 (54%)	623 (54%)	624 (54%)	623 (54%)
Unknown	5	0	0	0	0	0
palpable_nodule	441 (40%)	471 (41%)	470 (41%)	470 (41%)	471 (41%)	470 (41%)
Unknown	58	0	0	0	0	0
rapid_enlargement	19 (1.7%)	19 (1.7%)	21 (1.8%)	22 (1.9%)	20 (1.7%)	19 (1.7%)
Unknown	43	0	0	0	0	0
compressive_symptoms	88 (8.4%)	107 (9.3%)	103 (9.0%)	102 (8.9%)	98 (8.5%)	101 (8.8%)
Unknown	106	0	0	0	0	0
hypertension	262 (26%)	287 (25%)	301 (26%)	292 (25%)	293 (25%)	291 (25%)
Unknown	126	0	0	0	0	0
vocal_cord_paresis	3 (0.3%)	5 (0.4%)	3 (0.3%)	6 (0.5%)	3 (0.3%)	3 (0.3%)
Unknown	76	0	0	0	0	0
graves_disease	17 (1.6%)	20 (1.7%)	20 (1.7%)	19 (1.7%)	19 (1.7%)	18 (1.6%)
Unknown	67	0	0	0	0	0
hashimotos_thyroiditis	7 (0.6%)	7 (0.6%)	7 (0.6%)	8 (0.7%)	9 (0.8%)	7 (0.6%)
Unknown	73	0	0	0	0	0
family_history_thyroid_cancer	8 (0.9%)	10 (0.9%)	14 (1.2%)	11 (1.0%)	11 (1.0%)	16 (1.4%)
Unknown	281	0	0	0	0	0
exposure_radiation	9 (0.9%)	9 (0.8%)	9 (0.8%)	9 (0.8%)	10 (0.9%)	10 (0.9%)
Unknown	121	0	0	0	0	0
Albumin	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)
Unknown	515	0	0	0	0	0
TSH value	1.48 (0.85, 2.30)	1.50 (0.89, 2.30)	1.50 (0.86, 2.40)	1.50 (0.87, 2.50)	1.50 (0.89, 2.50)	1.40 (0.85, 2.30)
Unknown	413	0	0	0	0	0
Lymphocytes	1.94 (1.51, 2.43)	1.96 (1.54, 2.46)	1.95 (1.53, 2.45)	1.94 (1.50, 2.42)	1.95 (1.53, 2.42)	1.93 (1.49, 2.42)
Unknown	359	0	0	0	0	0
Monocytes	0.52 (0.42, 0.66)	0.53 (0.43, 0.66)	0.52 (0.42, 0.66)	0.53 (0.42, 0.66)	0.52 (0.43, 0.66)	0.52 (0.42, 0.66)
Unknown	363	0	0	0	0	0
bta_u_classification
U1	1 (<0.1%)	1 (<0.1%)	2 (0.2%)	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)
U2	860 (78%)	902 (78%)	895 (78%)	898 (78%)	893 (78%)	893 (78%)
U3	210 (19%)	214 (19%)	222 (19%)	220 (19%)	223 (19%)	224 (19%)
U4	22 (2.0%)	25 (2.2%)	23 (2.0%)	24 (2.1%)	26 (2.3%)	22 (1.9%)
U5	7 (0.6%)	8 (0.7%)	8 (0.7%)	7 (0.6%)	7 (0.6%)	10 (0.9%)
Unknown	50	0	0	0	0	0
solitary_nodule	320 (28%)	323 (28%)	322 (28%)	323 (28%)	323 (28%)	323 (28%)
Unknown	8	0	0	0	0	0
Nodule size (mm)	14 (7, 28)	13 (6, 27)	13 (6, 27)	13 (6, 27)	12 (6, 26)	13 (6, 27)
Unknown	319	0	0	0	0	0
cervical_lymphadenopathy	26 (2.3%)	27 (2.3%)	27 (2.3%)	26 (2.3%)	27 (2.3%)	26 (2.3%)
Unknown	9	0	0	0	0	0
thy_classification
Thy1	34 (14%)	362 (31%)	174 (15%)	242 (21%)	179 (16%)	220 (19%)
Thy1c	8 (3.4%)	19 (1.7%)	35 (3.0%)	39 (3.4%)	44 (3.8%)	28 (2.4%)
Thy2	63 (27%)	403 (35%)	412 (36%)	360 (31%)	373 (32%)	360 (31%)
Thy2c	11 (4.7%)	29 (2.5%)	34 (3.0%)	44 (3.8%)	35 (3.0%)	38 (3.3%)
Thy3a	18 (7.7%)	36 (3.1%)	44 (3.8%)	60 (5.2%)	61 (5.3%)	69 (6.0%)
Thy3f	74 (31%)	193 (17%)	302 (26%)	296 (26%)	335 (29%)	336 (29%)
Thy4	10 (4.3%)	17 (1.5%)	42 (3.7%)	24 (2.1%)	37 (3.2%)	29 (2.5%)
Thy5	17 (7.2%)	91 (7.9%)	107 (9.3%)	85 (7.4%)	86 (7.5%)	70 (6.1%)
Unknown	915	0	0	0	0	0
final_pathology
Benign	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)
Cancer	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)
¹ Median (IQR); n (%)

Characteristic	Original, N = 1,150 ¹	1, N = 1,150 ¹	2, N = 1,150 ¹	3, N = 1,150 ¹	4, N = 1,150 ¹	5, N = 1,150 ¹
.id	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)
Age	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)
gender
F	903 (79%)	903 (79%)	903 (79%)	903 (79%)	903 (79%)	903 (79%)
M	247 (21%)	247 (21%)	247 (21%)	247 (21%)	247 (21%)	247 (21%)
ethnicity
A	812 (74%)	843 (73%)	842 (73%)	843 (73%)	845 (73%)	845 (73%)
B	3 (0.3%)	3 (0.3%)	3 (0.3%)	4 (0.3%)	3 (0.3%)	3 (0.3%)
C	44 (4.0%)	46 (4.0%)	46 (4.0%)	47 (4.1%)	47 (4.1%)	47 (4.1%)
D	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)
F	4 (0.4%)	4 (0.3%)	4 (0.3%)	5 (0.4%)	4 (0.3%)	4 (0.3%)
G	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	8 (0.7%)	7 (0.6%)
H	8 (0.7%)	8 (0.7%)	9 (0.8%)	8 (0.7%)	8 (0.7%)	8 (0.7%)
J	46 (4.2%)	48 (4.2%)	48 (4.2%)	52 (4.5%)	49 (4.3%)	46 (4.0%)
K	11 (1.0%)	11 (1.0%)	11 (1.0%)	11 (1.0%)	12 (1.0%)	11 (1.0%)
L	16 (1.5%)	16 (1.4%)	17 (1.5%)	16 (1.4%)	16 (1.4%)	17 (1.5%)
M	16 (1.5%)	17 (1.5%)	17 (1.5%)	17 (1.5%)	18 (1.6%)	19 (1.7%)
N	18 (1.6%)	19 (1.7%)	21 (1.8%)	19 (1.7%)	19 (1.7%)	18 (1.6%)
P	14 (1.3%)	16 (1.4%)	14 (1.2%)	15 (1.3%)	14 (1.2%)	15 (1.3%)
R	6 (0.5%)	8 (0.7%)	7 (0.6%)	6 (0.5%)	6 (0.5%)	6 (0.5%)
S	38 (3.4%)	41 (3.6%)	39 (3.4%)	38 (3.3%)	41 (3.6%)	42 (3.7%)
Z	57 (5.2%)	61 (5.3%)	63 (5.5%)	60 (5.2%)	58 (5.0%)	60 (5.2%)
Unknown	48	0	0	0	0	0
incidental_nodule	620 (54%)	623 (54%)	624 (54%)	624 (54%)	623 (54%)	624 (54%)
Unknown	5	0	0	0	0	0
palpable_nodule	441 (40%)	470 (41%)	472 (41%)	468 (41%)	472 (41%)	468 (41%)
Unknown	58	0	0	0	0	0
rapid_enlargement	19 (1.7%)	19 (1.7%)	20 (1.7%)	19 (1.7%)	19 (1.7%)	21 (1.8%)
Unknown	43	0	0	0	0	0
compressive_symptoms	88 (8.4%)	102 (8.9%)	105 (9.1%)	104 (9.0%)	101 (8.8%)	99 (8.6%)
Unknown	106	0	0	0	0	0
hypertension	262 (26%)	291 (25%)	293 (25%)	292 (25%)	291 (25%)	292 (25%)
Unknown	126	0	0	0	0	0
vocal_cord_paresis	3 (0.3%)	4 (0.3%)	3 (0.3%)	3 (0.3%)	3 (0.3%)	3 (0.3%)
Unknown	76	0	0	0	0	0
graves_disease	17 (1.6%)	17 (1.5%)	19 (1.7%)	18 (1.6%)	17 (1.5%)	17 (1.5%)
Unknown	67	0	0	0	0	0
hashimotos_thyroiditis	7 (0.6%)	9 (0.8%)	7 (0.6%)	9 (0.8%)	8 (0.7%)	8 (0.7%)
Unknown	73	0	0	0	0	0
family_history_thyroid_cancer	8 (0.9%)	12 (1.0%)	9 (0.8%)	8 (0.7%)	11 (1.0%)	16 (1.4%)
Unknown	281	0	0	0	0	0
exposure_radiation	9 (0.9%)	10 (0.9%)	9 (0.8%)	11 (1.0%)	11 (1.0%)	10 (0.9%)
Unknown	121	0	0	0	0	0
Albumin	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)
Unknown	515	0	0	0	0	0
TSH value	1.48 (0.85, 2.30)	1.50 (0.88, 2.39)	1.50 (0.88, 2.50)	1.50 (0.91, 2.50)	1.50 (0.90, 2.40)	1.42 (0.85, 2.30)
Unknown	413	0	0	0	0	0
Lymphocytes	1.94 (1.51, 2.43)	1.94 (1.54, 2.43)	1.94 (1.51, 2.43)	1.95 (1.53, 2.44)	1.91 (1.50, 2.41)	1.94 (1.50, 2.43)
Unknown	359	0	0	0	0	0
Monocytes	0.52 (0.42, 0.66)	0.53 (0.42, 0.66)	0.53 (0.43, 0.66)	0.53 (0.43, 0.66)	0.52 (0.42, 0.66)	0.52 (0.42, 0.65)
Unknown	363	0	0	0	0	0
bta_u_classification
U1	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)	2 (0.2%)
U2	860 (78%)	894 (78%)	899 (78%)	903 (79%)	897 (78%)	897 (78%)
U3	210 (19%)	226 (20%)	219 (19%)	215 (19%)	220 (19%)	220 (19%)
U4	22 (2.0%)	22 (1.9%)	24 (2.1%)	23 (2.0%)	24 (2.1%)	24 (2.1%)
U5	7 (0.6%)	7 (0.6%)	7 (0.6%)	8 (0.7%)	8 (0.7%)	7 (0.6%)
Unknown	50	0	0	0	0	0
solitary_nodule	320 (28%)	322 (28%)	322 (28%)	322 (28%)	322 (28%)	323 (28%)
Unknown	8	0	0	0	0	0
Nodule size (mm)	14 (7, 28)	13 (6, 26)	12 (6, 27)	13 (6, 26)	12 (6, 27)	13 (6, 26)
Unknown	319	0	0	0	0	0
cervical_lymphadenopathy	26 (2.3%)	26 (2.3%)	26 (2.3%)	27 (2.3%)	26 (2.3%)	26 (2.3%)
Unknown	9	0	0	0	0	0
thy_classification
Thy1	34 (14%)	203 (18%)	192 (17%)	187 (16%)	221 (19%)	188 (16%)
Thy1c	8 (3.4%)	46 (4.0%)	35 (3.0%)	36 (3.1%)	59 (5.1%)	63 (5.5%)
Thy2	63 (27%)	278 (24%)	272 (24%)	281 (24%)	258 (22%)	237 (21%)
Thy2c	11 (4.7%)	45 (3.9%)	70 (6.1%)	58 (5.0%)	70 (6.1%)	76 (6.6%)
Thy3a	18 (7.7%)	60 (5.2%)	75 (6.5%)	74 (6.4%)	59 (5.1%)	125 (11%)
Thy3f	74 (31%)	347 (30%)	321 (28%)	317 (28%)	325 (28%)	264 (23%)
Thy4	10 (4.3%)	54 (4.7%)	69 (6.0%)	79 (6.9%)	68 (5.9%)	74 (6.4%)
Thy5	17 (7.2%)	117 (10%)	116 (10%)	118 (10%)	90 (7.8%)	123 (11%)
Unknown	915	0	0	0	0	0
final_pathology
Benign	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)
Cancer	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)
¹ Median (IQR); n (%)

Characteristic	Original, N = 1,150 ¹	1, N = 1,150 ¹	2, N = 1,150 ¹	3, N = 1,150 ¹	4, N = 1,150 ¹	5, N = 1,150 ¹
.id	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)	576 (288, 863)
Age	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)	55 (41, 68)
gender
F	903 (79%)	903 (79%)	903 (79%)	903 (79%)	903 (79%)	903 (79%)
M	247 (21%)	247 (21%)	247 (21%)	247 (21%)	247 (21%)	247 (21%)
ethnicity
A	812 (74%)	851 (74%)	845 (73%)	850 (74%)	849 (74%)	853 (74%)
B	3 (0.3%)	3 (0.3%)	3 (0.3%)	4 (0.3%)	3 (0.3%)	3 (0.3%)
C	44 (4.0%)	47 (4.1%)	46 (4.0%)	46 (4.0%)	45 (3.9%)	45 (3.9%)
D	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)	2 (0.2%)
F	4 (0.4%)	4 (0.3%)	4 (0.3%)	4 (0.3%)	4 (0.3%)	4 (0.3%)
G	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	8 (0.7%)
H	8 (0.7%)	8 (0.7%)	8 (0.7%)	8 (0.7%)	8 (0.7%)	8 (0.7%)
J	46 (4.2%)	47 (4.1%)	49 (4.3%)	47 (4.1%)	49 (4.3%)	48 (4.2%)
K	11 (1.0%)	11 (1.0%)	15 (1.3%)	11 (1.0%)	11 (1.0%)	11 (1.0%)
L	16 (1.5%)	16 (1.4%)	16 (1.4%)	16 (1.4%)	17 (1.5%)	16 (1.4%)
M	16 (1.5%)	17 (1.5%)	17 (1.5%)	18 (1.6%)	18 (1.6%)	16 (1.4%)
N	18 (1.6%)	19 (1.7%)	19 (1.7%)	19 (1.7%)	19 (1.7%)	20 (1.7%)
P	14 (1.3%)	14 (1.2%)	15 (1.3%)	14 (1.2%)	14 (1.2%)	14 (1.2%)
R	6 (0.5%)	6 (0.5%)	6 (0.5%)	6 (0.5%)	6 (0.5%)	6 (0.5%)
S	38 (3.4%)	40 (3.5%)	39 (3.4%)	39 (3.4%)	40 (3.5%)	39 (3.4%)
Z	57 (5.2%)	58 (5.0%)	59 (5.1%)	59 (5.1%)	58 (5.0%)	57 (5.0%)
Unknown	48	0	0	0	0	0
incidental_nodule	620 (54%)	623 (54%)	623 (54%)	624 (54%)	624 (54%)	622 (54%)
Unknown	5	0	0	0	0	0
palpable_nodule	441 (40%)	462 (40%)	472 (41%)	465 (40%)	469 (41%)	467 (41%)
Unknown	58	0	0	0	0	0
rapid_enlargement	19 (1.7%)	19 (1.7%)	19 (1.7%)	19 (1.7%)	20 (1.7%)	20 (1.7%)
Unknown	43	0	0	0	0	0
compressive_symptoms	88 (8.4%)	94 (8.2%)	100 (8.7%)	94 (8.2%)	89 (7.7%)	95 (8.3%)
Unknown	106	0	0	0	0	0
hypertension	262 (26%)	282 (25%)	284 (25%)	285 (25%)	281 (24%)	286 (25%)
Unknown	126	0	0	0	0	0
vocal_cord_paresis	3 (0.3%)	3 (0.3%)	3 (0.3%)	3 (0.3%)	3 (0.3%)	3 (0.3%)
Unknown	76	0	0	0	0	0
graves_disease	17 (1.6%)	17 (1.5%)	17 (1.5%)	17 (1.5%)	17 (1.5%)	17 (1.5%)
Unknown	67	0	0	0	0	0
hashimotos_thyroiditis	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)	7 (0.6%)
Unknown	73	0	0	0	0	0
family_history_thyroid_cancer	8 (0.9%)	9 (0.8%)	8 (0.7%)	11 (1.0%)	9 (0.8%)	9 (0.8%)
Unknown	281	0	0	0	0	0
exposure_radiation	9 (0.9%)	9 (0.8%)	9 (0.8%)	10 (0.9%)	9 (0.8%)	9 (0.8%)
Unknown	121	0	0	0	0	0
Albumin	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)	45.0 (43.0, 47.0)
Unknown	515	0	0	0	0	0
TSH value	1.48 (0.85, 2.30)	1.50 (0.87, 2.40)	1.50 (0.90, 2.30)	1.50 (0.88, 2.30)	1.50 (0.87, 2.30)	1.50 (0.85, 2.40)
Unknown	413	0	0	0	0	0
Lymphocytes	1.94 (1.51, 2.43)	1.94 (1.54, 2.43)	1.94 (1.51, 2.44)	1.95 (1.54, 2.44)	1.95 (1.53, 2.42)	1.94 (1.51, 2.44)
Unknown	359	0	0	0	0	0
Monocytes	0.52 (0.42, 0.66)	0.53 (0.42, 0.66)	0.52 (0.42, 0.66)	0.53 (0.43, 0.66)	0.52 (0.42, 0.66)	0.52 (0.42, 0.66)
Unknown	363	0	0	0	0	0
bta_u_classification
U1	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)	1 (<0.1%)
U2	860 (78%)	902 (78%)	904 (79%)	904 (79%)	904 (79%)	899 (78%)
U3	210 (19%)	216 (19%)	216 (19%)	216 (19%)	215 (19%)	219 (19%)
U4	22 (2.0%)	23 (2.0%)	22 (1.9%)	22 (1.9%)	22 (1.9%)	24 (2.1%)
U5	7 (0.6%)	8 (0.7%)	7 (0.6%)	7 (0.6%)	8 (0.7%)	7 (0.6%)
Unknown	50	0	0	0	0	0
solitary_nodule	320 (28%)	321 (28%)	321 (28%)	320 (28%)	323 (28%)	321 (28%)
Unknown	8	0	0	0	0	0
Nodule size (mm)	14 (7, 28)	13 (6, 27)	12 (6, 26)	12 (6, 26)	13 (6, 26)	13 (6, 26)
Unknown	319	0	0	0	0	0
cervical_lymphadenopathy	26 (2.3%)	26 (2.3%)	26 (2.3%)	26 (2.3%)	26 (2.3%)	26 (2.3%)
Unknown	9	0	0	0	0	0
thy_classification
Thy1	34 (14%)	206 (18%)	173 (15%)	136 (12%)	175 (15%)	147 (13%)
Thy1c	8 (3.4%)	46 (4.0%)	38 (3.3%)	69 (6.0%)	36 (3.1%)	32 (2.8%)
Thy2	63 (27%)	299 (26%)	283 (25%)	323 (28%)	333 (29%)	341 (30%)
Thy2c	11 (4.7%)	57 (5.0%)	78 (6.8%)	71 (6.2%)	64 (5.6%)	40 (3.5%)
Thy3a	18 (7.7%)	105 (9.1%)	115 (10%)	117 (10%)	85 (7.4%)	97 (8.4%)
Thy3f	74 (31%)	291 (25%)	299 (26%)	306 (27%)	325 (28%)	359 (31%)
Thy4	10 (4.3%)	48 (4.2%)	52 (4.5%)	47 (4.1%)	61 (5.3%)	48 (4.2%)
Thy5	17 (7.2%)	98 (8.5%)	112 (9.7%)	81 (7.0%)	71 (6.2%)	86 (7.5%)
Unknown	915	0	0	0	0	0
final_pathology
Benign	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)	1,050 (91%)
Cancer	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)	100 (8.7%)
¹ Median (IQR); n (%)

3.2 Modelling

TODO - And in light of having removed ?@tbl-data-completness in favour of the imputed datesets this too has been removed? (@ns-rse 2024-07-11). TODO - This table feels like duplication of ?@tbl-data-completeness, perhaps have just one? (@ns-rse 2024-07-11).

The predictor variables selected to predict final_pathology are shown in ?@tbl-predictors

Section that sets up the modelling

The following section is output from a Tidymodel approach to logistic regression to try and work out why variables are not being included.

A total of 1150 patients had complete data for the selected predictor variables (see ?@tbl-predictors). Because of the volume of missing data which if a saturated model were used would include only ~350 people with complete data across all co-variates imputed datasets were analysed instead.

3.2.1 Logistic Regression

3.2.1.1 Clinical Characteristics

Characteristic	OR ¹	95% CI ¹	p-value
Age	0.98	0.96, 0.99	0.005
gender	1.97	1.11, 3.43	0.019
incidental_nodule	0.99	0.48, 2.02	>0.9
palpable_nodule	2.76	1.29, 6.08	0.010
rapid_enlargement	1.66	0.46, 5.37	0.4
compressive_symptoms	0.98	0.43, 2.09	>0.9
hashimotos_thyroiditis	1.19	0.02, 13.0	>0.9
family_history_thyroid_cancer	3.80	0.48, 21.7	0.15
exposure_radiation	0.00		>0.9
TSH value	1.00	0.94, 1.05	>0.9
Nodule size (mm)	1.03	1.01, 1.05	<0.001
solitary_nodule	1.48	0.85, 2.51	0.2
cervical_lymphadenopathy	6.74	2.37, 18.8	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

3.2.1.2 Biomarkers

Characteristic	OR ¹	95% CI ¹	p-value
Age	0.98	0.97, 1.00	0.010
gender	1.95	1.15, 3.26	0.012
TSH value	1.01	0.96, 1.06	0.5
Albumin	1.03	0.96, 1.12	0.4
Lymphocytes	1.04	0.74, 1.43	0.8
Monocytes	0.18	0.04, 0.72	0.019
¹ OR = Odds Ratio, CI = Confidence Interval

3.2.1.3 Ultrasound 1

Characteristic	OR ¹	95% CI ¹	p-value
Age	0.99	0.97, 1.01	0.2
gender	1.23	0.63, 2.32	0.5
bta_u_classification
U1	—	—
U2	461,584	0.00, NA	>0.9
U3	7,642,048	0.00, NA	>0.9
U4	36,642,711	0.00, NA	>0.9
U5	126,534,639	0.00, NA	>0.9
thy_classification
Thy1	—	—
Thy1c	1.63	0.19, 9.45	0.6
Thy2	0.36	0.09, 1.25	0.11
Thy2c	0.00	0.00, 0.00	>0.9
Thy3a	2.24	0.66, 7.80	0.2
Thy3f	3.01	1.21, 8.47	0.025
Thy4	7.06	1.93, 26.6	0.003
Thy5	9.52	3.19, 31.1	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

3.2.1.4 Ultrasound 2

Characteristic	OR ¹	95% CI ¹	p-value
Age	0.99	0.97, 1.01	0.2
gender	1.19	0.58, 2.34	0.6
incidental_nodule	0.94	0.48, 1.84	0.9
TSH value	0.95	0.89, 1.00	0.091
Nodule size (mm)	1.04	1.02, 1.06	<0.001
solitary_nodule	1.49	0.77, 2.82	0.2
cervical_lymphadenopathy	6.61	1.45, 29.3	0.013
bta_u_classification
U1	—	—
U2	239,169	0.00, NA	>0.9
U3	3,105,400	0.00, NA	>0.9
U4	14,770,696	0.00, NA	>0.9
U5	39,781,959	0.00, NA	>0.9
thy_classification
Thy1	—	—
Thy1c	0.97	0.06, 7.66	>0.9
Thy2	0.25	0.06, 0.97	0.049
Thy2c	0.00	0.00, 1,045,453	>0.9
Thy3a	2.69	0.76, 9.91	0.13
Thy3f	2.45	0.92, 7.39	0.089
Thy4	8.34	2.18, 33.3	0.002
Thy5	11.1	3.47, 39.6	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

3.2.2 LASSO

## Specify the LASSO model using parsnip, the key here is the use of the glmnet engine which is the R package for
## fitting LASSO regression. Technically the package fits Elastic Net but with a mixture value of 1 it is equivalent to
## a plain LASSO (mixture value of 0 is equivalent to Ridge Regression in an Elastic Net)
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
  parsnip::set_engine("glmnet")

## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
  object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
  resamples = cv_folds,
  grid = dials::grid_regular(penalty(), levels = 50)
)

Figure 24: Autoplot of LASSO grid search

Figure 25: Importance of variables fitted using LASSO

NB - We may wish to inspect the coefficients at each step of tuning. A related example of how to do this can be found in the Tidymodels documentation under the Tuning a glmnet model. This would be desirable as it looks like only two features are selected as being important by this method and so rather than just accepting this I would want to investigate and see how the coefficients changed over iterations. Another useful resource is the glmnet documentation, although note that since we are using the Tidymodels framework the model fit is wrapped up inside (hence the above article on how to extract this information).

3.2.3 Elastic Net

Figure 26: Autoplot of Elastic Net grid search

Figure 27: Importance of variables fitted using Elastic Net

3.2.4 Random Forest

3.2.5 Gradient Boosting

        Length Class      Mode   
pre     3      stage_pre  list   
fit     2      stage_fit  list   
post    1      stage_post list   
trained 1      -none-     logical

# A tibble: 23 × 4
   variable             type      role      source  
   <chr>                <list>    <chr>     <chr>   
 1 age_at_scan          <chr [2]> predictor original
 2 gender               <chr [3]> predictor original
 3 ethnicity            <chr [3]> predictor original
 4 incidental_nodule    <chr [3]> predictor original
 5 palpable_nodule      <chr [3]> predictor original
 6 rapid_enlargement    <chr [3]> predictor original
 7 compressive_symptoms <chr [3]> predictor original
 8 hypertension         <chr [3]> predictor original
 9 vocal_cord_paresis   <chr [3]> predictor original
10 graves_disease       <chr [3]> predictor original
# ℹ 13 more rows

               Length Class              Mode       
handle             1  xgb.Booster.handle externalptr
raw            95013  -none-             raw        
niter              1  -none-             numeric    
evaluation_log     2  data.table         list       
call               8  -none-             call       
params            10  -none-             list       
callbacks          1  -none-             list       
feature_names     45  -none-             character  
nfeatures          1  -none-             numeric

                   Feature   Gain  Cover Frequency
                    <char>  <num>  <num>     <num>
1: bta_u_classification_U2 0.8201 0.5114    0.3106
2:               tsh_value 0.0900 0.0693    0.1677
3:          size_nodule_mm 0.0332 0.3619    0.3634
4:             age_at_scan 0.0314 0.0344    0.1025
5:                 albumin 0.0253 0.0230    0.0559

 [1] ".pred_Benign"                  ".pred_Cancer"                 
 [3] ".pred_class"                   "age_at_scan"                  
 [5] "gender"                        "ethnicity"                    
 [7] "incidental_nodule"             "palpable_nodule"              
 [9] "rapid_enlargement"             "compressive_symptoms"         
[11] "hypertension"                  "vocal_cord_paresis"           
[13] "graves_disease"                "hashimotos_thyroiditis"       
[15] "family_history_thyroid_cancer" "exposure_radiation"           
[17] "albumin"                       "tsh_value"                    
[19] "lymphocytes"                   "monocyte"                     
[21] "bta_u_classification"          "solitary_nodule"              
[23] "size_nodule_mm"                "cervical_lymphadenopathy"     
[25] "thy_classification"            "final_pathology"

Table 3: Description of variables in the Sheffield Thyroid dataset.

	Description
age_at_scan	Age
albumin	Albumin
bta_u_classification	BTA U
cervical_lymphadenopathy	Cervical Lymphadenopathy
compressive_symptoms	Compressive symptoms
consistency_nodule	Nodule consistency
eligibility	Eligibility
ethinicity	Ethinicity
ethnicity	Ethnicity
exposure_radiation	Exposure to radiation
family_history_thyroid_cancer	Family history of thyroid cancer
final_pathology	Final diagnosis
fna_done	FNA done
gender	Gender
graves_disease	Graves’ disease
hashimotos_thyroiditis	Hashimoto’s disease
hypertension	Hypertension
incidental_nodule	Incidental nodule
lymphocytes	Lymphocytes
monocyte	Monocytes
palpable_nodule	Palpable nodule
rapid_enlargement	Rapid enlargement
repeat_bta_u_classification	Repeat BTA U
repeat_fna_done	Repeat FNA
repeat_thy_classification	Repeat Thy class
repeat_ultrasound	Repeat ultrasound
size_nodule_mm	Nodule size (mm)
solitary_nodule	Solitary nodule
study_id	Study ID
thy_classification	Thy classification
thyroid_histology_diagnosis	Histology
thyroid_surgery	Thyroid surgery
tsh_value	TSH value
vocal_cord_paresis	Vocal cord paresis

References

Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” J. Stat. Soft. 45 (December): 1–67. https://doi.org/10.18637/jss.v045.i03.

Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge Core. Cambridge, England, UK: Cambridge University Press. https://doi.org/10.1017/CBO9781316576533.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

“Regression Shrinkage and Selection via the Lasso.” 1996. Journal of the Royal Statistical Society. Series B (Methodological). https://www.jstor.org/stable/2346178.

Smith, Gary. 2018. “Step away from stepwise.” J. Big Data 5 (1): 1–12. https://doi.org/10.1186/s40537-018-0143-6.

Steyerberg, Ewout W., Marinus J. C. Eijkemans, Frank E. Harrell, and Dik. 2001. “Prognostic Modeling with Logistic Regression Analysis.” Medical Decision Making 21 (1): 45–56. https://doi.org/10.1177/0272989x0102100106.

Thompson, Bruce. 1995. “Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply Here: A Guidelines Editorial.” Educational and Psychological Measurement 55 (4): 525–34. https://doi.org/10.1177/0013164495055004001.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection Via the Elastic Net.” J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.

Citation

BibTeX citation:

@online{edafe2024,
  author = {Edafe, Ovie and Shephard, Neil and Sisley, Karen and P
    Balasubramanian, Sabapathy},
  title = {An Investigation of the Predictors of Thyroid Cancer in
    Patients with Thyroid Nodules},
  date = {2024-04-26},
  langid = {en},
  abstract = {An abstract summarising the work undertaken and the
    overall conclusions can be placed here. Sub-headings are currently
    removed because they conflict with those in the body of the text and
    mess up the links in the Table of Contents.}
}

For attribution, please cite this work as:

Edafe, Ovie, Neil Shephard, Karen Sisley, and Sabapathy P Balasubramanian. 2024. “An Investigation of the Predictors of Thyroid Cancer in Patients with Thyroid Nodules.” Sheffield Study on Thyroid Nodules. April 26, 2024.