Automated Machine Learning in the Sonographic Diagnosis of Non-alcoholic Fatty Liver Disease

Objective: This study evaluated the performance of automated machine-learning to diagnose non-alcoholic fatty liver disease (NAFLD) by ultrasound and compared these findings to radiologist performance. Methods: 96 patients with histologic (33) or proton density fat fraction MRI (63) diagnosis of NAFLD and 100 patients without evidence of NAFLD were retrospectively identified. The “Fatty Liver” label included 96 patients with 405 images and the “Not Fatty Liver” label included 100 patients with 500 images. These 905 images made up a “Comprehensive Image” group. A “Radiology Selected Image” group was then created by selecting only images considered diagnostic by a blinded radiologist, resulting in 649 images. Cloud AutoML Visionbeta (Google LLC, Mountain View, CA) was used for machine learning. The models were evaluated against three blinded radiologists. Results: The “Comprehensive Image” group model demonstrated a sensitivity of 88.6% (73.3–96.8%) and a specificity of 95.3% (84.2–99.4%). Radiologist performance on this image group included a sensitivity of 81.0% (74.3–87.6%) and specificity of 86.0% (72.6–99.5%). The model’s overall accuracy was 92.3% (84.0–97.1%), compared with mean individual performance (83.8%, 78.4–89.1%). The “Radiology Selected Image” group model demonstrated a sensitivity of 88.6% (73.3 – 96.8%) and specificity of 87.9% (71.8–96.6%). Mean radiologist sensitivity was 92.4% (86.9–97.9%) and specificity was 91.9% (83.4–100%). The model’s overall accuracy was 88.2% (78.1–94.8%) which was comparable to the individual radiologist performance (92.2%, 90.1–94.2%) and consensus performance (95.6%, 87.6–99.1%). Conclusions: An automated machine-learning algorithm may accurately detect NAFLD on ultrasound.

N on-alcoholic fatty liver disease (NAFLD) is defined as the presence of hepatic steatosis not attributable to other causes of liver dysfunction, including alcoholic liver disease, viral hepatitis, and hemochromatosis. NAFLD may be present without inflammation (non-alcoholic fatty liver, or NAFL) or with inflammation (non-alcoholic steatohepatitis, or NASH). Variable levels of fibrosis up to frank cirrhosis may be present. Importantly, NAFLD is now the leading cause of chronic liver disease in the U.S, representing up to 75.1% of cases with an overall prevalence as high as 11.0% [1,2].
The diagnosis of NASH is complicated due to a lack of consensus on the role of laboratory findings, imaging, and histopathology from biopsy. The magnitude of liver function test (LFT) abnormalities does not correlate with degree of inflammation or fibrosis [3]. Additionally, the absence of LFT elevations does not exclude the presence of NAFLD. Imaging modalities for diagnosis include computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. CT has a sensitivity of 50% and specificity of 83% in the detection of NAFLD, but radiation concerns may limit routine use [4]. MRI performs better with a sensitivity of 88% and specificity of 63% [4]. The use of in-phase and out-phase sequences on MRI may be used to improve test statistics with a sensitivity and specificity of 95% and 98% respectively by quantifying degree of hepatic steatosis [5,6]. However, lack of access, high cost, and contraindications related metallic implants are prevalent. While liver biopsy is considered the gold-standard, the test is invasive and typically reserved for patients with an unclear diagnosis.
Ultrasound may be used to screen for significant steatosis in patients with NAFLD, with a sensitivity up to 85% and specificity of up to 94% respectively [7]. However, this sensitivity decreases to as low as 49% in the presence of morbid obesity [8]. Fatty liver may be diagnosed sonographically by hepatic hyperechogenicity relative to the renal cortex and spleen, ultrasound wave attenuation with a loss of diaphragm definition, poor delineation of the intrahepatic architecture, and a loss of echogenic fat within portal triads [7]. Compared with CT and MRI, ultrasound is less expensive, readily available, faster, and portable. There are no concerns with claustrophobia, radiation, or metallic contraindications. A reliable ultrasound approach to screen for NAFLD would be beneficial for early diagnosis. We hypothesize that a trained deep-learning model can diagnose NAFLD by ultrasound accurately.

Materials and Methods
Data (ultrasound images) for this institutional review board (IRB) approved study for model training, validation, and testing purposes were extracted from our institutional Picture Archiving and Communication (PACS) system in a retrospective fashion via consecutive sampling. Informed consent was waived by our local ethics board as this study only utilized retrospective data. A binary classification system was created for labelling purposes: "Fatty Liver" and "Not Fatty Liver." Images were processed using Cloud AutoML Visionbeta (Google LLC, Mountain View, CA) for model training and evaluation. This is a cloud-based freeware program available for Beta testing. Per the product guidelines, at least 100 -500 images per label is recommended.
Ultrasound images were originally created on a variety of scanners including a GE Logiq 9 and E9 (GE Healthcare), an IU22 and Epiq 7 (Philips), and an Aplio 500 and i800 (Cannon). The "Fatty Liver" label consisted of ultrasound pictures from patients identified either by tissue histology via liver biopsy or after fat quantification by MRI. Fat quantification was performed using proton density fat fraction (PDFF) on one of two Optima MR450W 1.5T scanner using Ideal IQ (GE Healthcare). Patients with histological confirmation of NAFLD were selected from patients identified via PACS undergoing ultrasound-guided liver biopsy for a presumed diagnosis of NAFLD or NASH. As many images of each patient as considered representative and unique were extracted. Patients with an MRI fat quantification diagnosis of NAFLD were selected from patients identified via PACS undergoing MRI with fat quantification greater than 6.4% with a diagnostic ultrasound available within the prior 6 months [9]. A goal of 5 pictures per patient was predicted to be extracted with a total label goal of between 100 -500 images. Inclusion and exclusion criteria for the "Fatty Liver" label is further detailed in Table 1. hepatitis, or other acute pathologic findings. Importantly, patients with type II diabetes mellitus were excluded from the "Not Fatty Liver" label to limit the inclusion of sub-clinical or undiagnosed NAFLD. However, because clinically patients within this label were not considered to necessarily be at risk for NAFLD, most did not have confirmatory MRI fat quantification or tissue histology. Inclusion and exclusion criteria for the "Not Fatty Liver" label is further detailed in Table 2. A goal of 5 unique images per patient was predicted. Images were subsequently cropped in a manual fashion by a non-radiologist to only include the ultrasound image to eliminate nondiagnostic image annotations. Samples of cropped images within the "Fatty Liver" label and within the "Not Fatty Liver" label are provided in Figure 1. This set of images was considered the "Comprehensive Image Group." An independent board-certified radiologist (H.N with 3 years of experience and who did not serve as a blinded reader) then reviewed all images and selected images considered technically adequate for diagnosis to produce the "Radiology Selected Image" group. Both image groups were independently uploaded to Cloud AutoML Visionbeta for model creation, optimization, and analysis.
learning. Of note, this product is within the beta launch stage. Image groups were sampled randomly by AutoML into a training set (roughly 80%), validation set (roughly 10%), and test set (roughly 10%); this was performed separately for both the "Comprehensive Image" group and "Radiology Selected Image" group. The training set is used for model training and the validation set is used for internal hyper-parameter optimization. The final test set (for which the model is naïve to) is used for evaluation of the model and for comparison against expert analysis.
The test set from the "Comprehensive Image" group was first used for blinded interpretation and assessment by three-independent board-certified radiologists with no additional clinical information available. After at least one month (to limit recall bias), these same radiologists were presented the test set from the "Radiologist Selected Image" group. The only interpretation that was asked of the radiologists during each assessment is whether they would make the diagnosis of hepatic steatosis. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and overall accuracy was subsequently calculated for the model as well as for the mean statistics of three blinded readers (L.N, A.L, and P.OK, each with at least 15 years of experience in ultrasound) for comparison against the gold standard label assignment; tissue histology or MRI fat quantification for the "Fatty Liver" label and a lack of any clinical data suggesting NAFLD required for the "Not Fatty Liver" label.
Cloud AutoML Visionbeta allows for the creation of custom models trained on uploaded images using a convolutional neural network pre-trained through transfer  Demographic data (age, gender, and body mass index as calculated per National Institute of Health) was pooled per image group and compared among labels [10]. Independent sample t-test and chi-square analysis was performed on SPSS v25 (IBM, Armonk, New York). Test metrics were calculated on Excel (Microsoft, Redmond, WA). Clopper-Pearson confidence interval was calculated for non-mean test metrics. A p-value of 0.05 was designated as significant when utilized. Performance metrics are presented as means with 95% confidence intervals (CI). Figure 2 diagrams the flow of enrolled patients and ultrasound images. Patients within the "Fatty Liver" label came from two sources from PACS: tissue histology (N = 33 patients, 102 pictures) and MRI fat quantification (N = 63 patients, 303 pictures), resulting in a total of 405 pictures. The "Not Fatty Liver" label consisted of a total of 100 patients with 500 ultrasound pictures. Thus, the "Comprehensive Image" group consists of a total of 905 pictures. From this, the "Radiology Selected Image" group consisted of a total of 649 images (292 within the "Fatty Liver" label and 357 from the "Not Fatty Liver" label).

Results
U n i v a r i a t e a n a l y s i s o f p a t i e n t s w i t h i n t h e "Comprehensive Image" group and "Radiology Selected" group as stratified per "Fatty Liver" label and "Not Fatty Liver" label is detailed in Table 3. Within both image groups, patients with the "Fatty Liver" label had a statistically significantly higher BMI compared with the "Not Fatty Liver" label (P < 0.0001 in both). The mean BMI of patients designated with the "Not Fatty Liver" label within both the "Comprehensive Image" group (28.3 ± 7.7) and "Radiology Selected Image" group (28.0 ± 7.6) fell within the National Institutes of Health BMI stratification as "overweight" (BMI 25 -30) [10]. In contrast, the mean BMI of patients with the "Fatty Liver" label within both the "Comprehensive Image" group (34.5 ± 7.2) and "Radiology Selected Image" group (34.6 ± 6.7) fell within the National Institutes of Health BMI stratification as "obese" (BMI 30+) [10].   Additionally, the median hepatic steatosis grade by pathology for patients within the "Fatty Liver" label in both image groups was grade 2 moderate steatosis (33 -66% of the examined histologic sample surface area visually determined to be involved by steatosis) [11]. The mean fat quantification percentage by PDFF MRI in patients with the "Fatty Liver" label within the "Comprehensive Image" group (17.3% ± 8.0) and "Radiology Selected Image" group (16.9% ± 7.8) both fell within the range of grade 1 (6.5 -17.4%) mild hepatic steatosis but approached grade 2 (17.5 -22.1%) moderate hepatic steatosis [9].
Randomized sampling of each image group was performed from within the Cloud AutoML Visionbeta API for model testing, validation, and testing. Within the "Comprehensive Image" group, 725 images were used for training, 102 images were used for validation, and 78 images were used for testing. Within the "Radiology Selected Image" group, 525 images were used for training, 59 images were used for validation, and 68 images were used for testing.
Furthermore, because these sets were populated randomly, the testing set for the "Comprehensive Image" group is different from the testing set in the "Radiology Selected Image" group; as are the corresponding training and validation sets. The model produced from the "Comprehensive Image" group performed with a sensitivity of 88.6% (95% CI = 73. 3  The radiology group performed with a mean sensitivity of 92.4% (95% CI = 86.9 -97.9%), mean specificity of 91.9 % (95% CI = 83.4 -100%), mean PPV of 93.1% (95% CI = 86.1% -100), and mean NPV of 92.4% (95% CI = 87.8 -97.0%). A calculated Fleiss kappa score of interobservability demonstrates improved agreement at 0.784. Test metrics with a 95% CI are detailed in Table 4.
As a secondary analysis, test statistics for consensus read by the radiology group were calculated for each image group. That is, the more common diagnosis made by the three radiologists for each test image was selected as the final interpreting diagnosis and compared against the true label. Within the "Comprehensive Image" group, a sensitivity of 88.6% (95% CI = 73.3 -96.8%), specificity of 93.0% (95% CI = 80.9 -98.5%), PPV of 91.2% (95% CI = 71.5 -96.9%) and NPV of 90.9% (95% CI = 79.9 -96.2%) was reported.  Table 4.

Discussion
The early and accessible diagnosis of NAFLD is important, especially given its increasing prevalence and potential for irreversible hepatic injury. As a modality, ultrasound is well poised to meet this clinical need. To this end, machine learning may have the potential to accurately and automatically assist in making an NAFLD diagnosis on B-mode ultrasound.
The model produced by the "Comprehensive Image" group performed with a sensitivity 88.6% and specificity 95.3% that was comparable to that of three board-certified ultrasound trained radiologists. Even after selection of only images considered diagnostic by an independent radiologist (the model produced by the "Radiology Selected Image" group), the model performance remained comparable. It is possible that by explicitly selecting for images considered diagnostic by a trained radiologist, the "Radiology Selected Image" group is now implicitly biased towards improving the ability of observers to detect NAFLD. However, even despite this, the model still performed well, with an overall accuracy of 88.2%.
As a secondary analysis, test statistics utilizing consensus ultrasound interpretation was calculated ( Table 4). As expected, all four parameters (sensitivity, specificity, PPV, and NPV) improved with consensus compared with independent mean test statistics in both image groups (though remained comparable within a 95% confidence interval). This suggests that observer interpretation accuracy may be improved when pooled across the experience of multiple experienced radiologists. To that end, the potential for improvement in model performance still exists.
Machine-learning for sonographic decision support in the diagnosis of NAFLD is an active area of research, with accuracies between 80 -98% reported [12]. However, limitations in methodologies exist including low label sizes and limited feature dimensionality. Alternative transfer learning neural network algorithms with manual hyperparameter optimization have been applied to the assessment of hepatic steatosis [13]. Cloud AutoML Visionbeta and other automated hyperparameter optimization protocols provide the advantage of rapidly and automatically optimizing the model per the user's application.
Machine learning presents a notable challenge in the medical radiologic space. The "black box" of deep learning systems makes independent expert validation difficult without the use of test sets for performance evaluation. Additionally, pathologies can be radiologically subtle and require a volumetric comprehension of a series of images [14]. Ultrasound offers further challenges including variable pathology visualization (given operator dependency), significant variation in technical parameterization (such as gain, depth, focal zone, resolution, and instrumentation), and an even more complex volumetric comprehension across all three axes in real-time. While AI-based identification methods are relatively new to ultrasound as an imaging modality, studies are increasingly utilizing these approaches to improve or automate diagnosis [15,16]. Creating large and standardized image databases is an important step in attempting to overcome these obstacles by attempting to encompass the wide variety of case examples.
Consequently, several limitations exist in this initial exploration. In the absence of feature extraction, is possible that irrelevant features were recognized as important in designating test image labels. To that end, images were manually cropped (by a non-radiologist observer) and images did span across a variety of ultrasound manufacturers with different technical parameters. Overfitting to irrelevant features was attempted to be overcome by aiming for the higher end of the recommended size of 100 -500 images per label using sample size augmentation by including multiple different images per patient (divided among training, validation, and test sets). However, this does increase the risk of data leakage. Additionally, due to an increasing presence of NAFLD in the general population, it is possible that in the absence of chemical-shift MRI or liver biopsy, patients within the "Not Fatty Liver" group may have had subclinical or unrecognized NAFLD. For example, retrospective quality control found that 5 of the 100 patients within the "Not Fatty Liver" label later had a diagnosis of type II diabetes (perhaps not unexpected given the BMIs presented in table 3). The use of the described exclusion criteria was used to attempt to limit the chance of undiagnosed NAFLD in this population. Similarly, as patients were identified through retrospective chart review, the veracity of medical documentation is not always guaranteed. Finally, because NAFLD represents systemic hepatic disease, it is possible that within any single ultrasound image, variable degrees of hepatic steatosis may be present.
Despite these limitations, extensive future directions exist. Specifically, the image database and training set may be improved by standardizing the examination technique and technical specifications (such as the ultrasound machine and parameters). Additionally, identification of the relevant significant features used by the model for label discrimination may be used for clinical and research applications. Moving forward it may be possible to create a "Fatty Liver Ultrasound Index", in which the responsible clinician is provided a score to interpret.