Proceedings Volume 10577

Medical Imaging 2018: Image Perception, Observer Performance, and Technology Assessment

cover
Proceedings Volume 10577

Medical Imaging 2018: Image Perception, Observer Performance, and Technology Assessment

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 11 May 2018
Contents: 9 Sessions, 48 Papers, 28 Presentations
Conference: SPIE Medical Imaging 2018
Volume Number: 10577

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 10577
  • Keynote and Image Perception I
  • Image Perception II
  • Observer Performance Evaluation I
  • Technology Assessment
  • Model Observers I
  • Model Observers II
  • Observer Performance Evaluation II and Tribute to Art Burgess
  • Poster Session
Front Matter: Volume 10577
icon_mobile_dropdown
Front Matter: Volume 10577
This PDF file contains the front matter associated with SPIE Proceedings Volume 10577, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and Conference Committee listing.
Keynote and Image Perception I
icon_mobile_dropdown
Learning to see (Conference Presentation)
Richard B. Gunderman
Human beings are born with a remarkable visual apparatus, but even if all the parts – lens, retina, optic nerve, and so on – are present in working order, seeing remains at least in large part a learned skill. This is reflected in the fact that some people can see and understand things that others find meaningless or even fail to notice. One striking example is the radiology education of medical students and residents, who over the course of their training move from not knowing what they are looking at to quickly making complex diagnoses. In this session, we consider how seeing is learned and weigh the respective contributions of science, technology, and the arts in cultivating this remarkable human capacity.
Do radiographers base the diagnostic acceptability of a radiograph on anatomical structures?
Background The document “European Guidelines on Quality Criteria for Diagnostic Radiographic Images” describes the visualisation of anatomical criteria to which a radiograph of diagnostic quality should comply. This research investigates the correlation between the evaluation of anatomical structures, presented in the European guidelines, and the classification of radiographs for diagnostic acceptability. Methods Sixteen radiographers classified 22 chest radiographs in terms of diagnostic acceptability using the RadLex categories, and scored the representation of five anatomical criteria on a scale from 1 to 5. All radiographs were visualised with ViewDex on a DICOM calibrated display. Observers were recruited in Belgium and Ireland. An interclass correlation coefficient was applied to evaluate internal consistency for each observer group. A Mann–Whitney U-test was applied to investigate differences in classification between countries. The relationship with the evaluation of anatomical structures was investigated with ordinal logistic regression. Results Both groups of observers performed with acceptable consistency. The Mann–Whitney U test illustrated a significant difference in classification between the two countries. The ordinal logistic regression indicated for each country a weak correlation between the RadLex and the anatomical structures. Certain factors in the radiograph, possibly others than anatomical elements, must be significantly better before the observer will attribute a higher RadLex score. Conclusion The relationship between the evaluation of anatomical criteria and the diagnostic acceptability is weak for both countries. When assigning a radiograph to a certain category of acceptability, other factors influence the decision.
A cognitive approach to determine the benefits of pairing radiologists in mammogram reading
Mammography screening in Europe and Australia is carried out by having two radiologists independently read the case and determine whether an actionable finding is present. If they disagree, a third radiologist – the arbitrator – reads the case and offers the final opinion. Currently radiologists are picked for the pair based on scheduling convenience, with no thought being given to whether a given pair of radiologists should really be put together to read cases. In the past research has shown that breast radiologists tend to commit the same mistakes time and again and incline to search mammograms in a particular way; hence, pairing two radiologists that tend to search a mammogram in an almost similar manner, for example, may not be such a good idea. In this study, we used eye position tracking to determine how radiologists searched a given set of cases. Using different cognitive models we paired the radiologists and determined the effect of the pairing on the radiologist’s performance using the Receivers Operating Characteristic Area Under the Curve (ROC AUC). Our results suggest that some pairings are detrimental to performance and should not be put together.
Image Perception II
icon_mobile_dropdown
Restored low-dose digital breast tomosynthesis: a perception study
Lucas R. Borges, Predrag R. Bakic, Andrew D. A. Maidment, et al.
This work investigates the perception of noise from restored low-dose digital breast tomosynthesis (DBT) images. First, low-dose DBT projections were generated using a dose reduction simulation algorithm. A dataset of clinical images from the Hospital of the University of Pennsylvania was used for this purpose. Low-dose projections were then denoised with a denoising pipeline developed specifically for DBT images. Denoised and noisy projections were combined to generate images with signal-to-noise ratio comparable to the full-dose images. The quality of restored low-dose and full-dose projections were first compared in terms of an objective no-reference image quality metric previously validated for mammography. In the second analysis, regions of interest (ROIs) were selected from reconstructed full-dose and restored low-dose slices, and were displayed side-by-side on a high-resolution medical display. Five medical physics specialists were asked to choose the image containing less noise and less blur using a 2-AFC experiment. The objective metric shows that, after the proposed image restoration framework was applied, images with as little as 60% of the AEC dose yielded similar quality indices when compared to images acquired with the full-dose. In the 2-AFC experiments results showed that when the denoising framework was used, 30% reduction in dose was possible without any perceived difference in noise or blur. Note that this study evaluated the observers perception to noise and blur and does not claim that the dose of DBT examinations can be reduced with no harm to the detection of cancer. Future work is necessary to make any claims regarding detection, localization and characterization of lesions.
A database for assessment of effect of lossy compression on digital mammograms
With widespread use of screening digital mammography, efficient storage of the vast amounts of data has become a challenge. While lossless image compression causes no risk to the interpretation of the data, it does not allow for high compression rates. Lossy compression and the associated higher compression ratios are therefore more desirable. The U.S. Food and Drug Administration (FDA) currently interprets the Mammography Quality Standards Act as prohibiting lossy compression of digital mammograms for primary image interpretation, image retention, or transfer to the patient or her designated recipient. Previous work has used reader studies to determine proper usage criteria for evaluating lossy image compression in mammography, and utilized different measures and metrics to characterize medical image quality. The drawback of such studies is that they rely on a threshold on compression ratio as the fundamental criterion for preserving the quality of images. However, compression ratio is not a useful indicator of image quality. On the other hand, many objective image quality metrics (IQMs) have shown excellent performance for natural image content for consumer electronic applications. In this paper, we create a new synthetic mammogram database with several unique features. We compare and characterize the impact of image compression on several clinically relevant image attributes such as perceived contrast and mass appearance for different kinds of masses. We plan to use this database to develop a new objective IQM for measuring the quality of compressed mammographic images to help determine the allowed maximum compression for different kinds of breasts and masses in terms of visual and diagnostic quality.
Analysis of visual search behaviour from experienced radiologists interpreting digital breast tomosynthesis (DBT) images: a pilot study
Digital Breast Tomosynthesis has several advantages over traditional 2D mammography. However, the cost-effectiveness to implement DBT modality into breast screening programmes is still under investigation. The DBT modality has been integrated into a regional breast screening program in Italy for several years. The purpose of this study is to examine the experienced Italian DBT readers’ visual search behaviour and summarise their visual inspection patterns. Seven experienced radiologists took part in the study, reading a set of DBT cases with a mixture of both normal and abnormal cases whilst their eye movements data were recorded. They read the cases through a fixed procedure starting with a 2D overview and then went through the DBT view of each side of the breasts. It was found that the experienced readers tended to perform a global-focal scan over the 2D view to detect the abnormality and then ‘drilled’ through the DBT slices, interpreting the details of the feature. The reading speed was also investigated to see if there was any difference in length of time when expert radiologists examine both normal and abnormal cases. The results showed that there was no significant difference in time between normal and abnormal cases. The eye movement patterns revealed that experienced DBT readers covered more areas on the 2D view and fixated longer and with more dwells inside the AOI in the 3D view. Based on these findings it is hoped that by understanding the visual search patterns of the experienced DBT radiologists, it could potentially help DBT trainees to develop more efficient interpretation approaches.
A deep (learning) dive into visual search behaviour of breast radiologists
Visual search, the process of detecting and identifying objects using the eye movements (saccades) and the foveal vision, has been studied for identification of root causes of errors in the interpretation of mammography. The aim of this study is to model visual search behaviour of radiologists and their interpretation of mammograms using deep machine learning approaches. Our model is based on a deep convolutional neural network, a biologically-inspired multilayer perceptron that simulates the visual cortex, and is reinforced with transfer learning techniques.

Eye tracking data obtained from 8 radiologists (of varying experience levels in reading mammograms) reviewing 120 two-view digital mammography cases (59 cancers) have been used to train the model, which was pre-trained with the ImageNet dataset for transfer learning. Areas of the mammogram that received direct (foveally fixated), indirect (peripherally fixated) or no (never fixated) visual attention were extracted from radiologists’ visual search maps (obtained by a head mounted eye tracking device). These areas, along with the radiologists’ assessment (including confidence of the assessment) of suspected malignancy were used to model: 1) Radiologists’ decision; 2) Radiologists’ confidence on such decision; and 3) The attentional level (i.e. foveal, peripheral or none) obtained by an area of the mammogram. Our results indicate high accuracy and low misclassification in modelling such behaviours.
Comparing salience detection algorithms in mammograms
Kristina Landino, Murray Loew
Salience in imaging is defined as the extent to which an object in an image catches the eye of the viewer. Currently, many software packages exist that calculate salience using a wide range of models and implementations. Here we examine four types of salience programs: feature-based programs, convolutional neural networks, principal components analysis programs, and background subtraction programs. In feature-based programs, the software creates a series of maps for individual salience features (e.g., orientation and intensity), and then combines those individual feature maps into an overall map of salience for the entire picture [1] [2] [3]. In other models, convolutional neural networks act as a series of layers, each of which transforms the data and finds the most salient points in an image [6] [9] [10]. In principal components analysis programs, components corresponding to higher eigenvalues are used to separate the background from the salient objects. Lastly, in background subtraction, salient areas are found by comparing the object’s intensity distribution to the background distribution. In total, this paper compares 19 models, including our own algorithm, on a general database of images to determine each model’s accuracy when detecting salience. Additionally, as previous work has shown a correlation between salient points in a mammogram and the presence of a mass in a mammogram, we apply each of these state-of-the-art software packages to a database of mammograms to determine the accuracy of each program when detecting abnormalities in mammograms.
Satisfaction at last: evidence for the 'satisfaction' account for multiple-target search errors
Stephen H. Adamo, Matthew S. Cain, Stephen R. Mitroff
Multiple-target visual searches, where several targets may be present within a single search array, are susceptible to Subsequent Search Miss (SSM) errors—a reduction in second target detection after a first target has been found (an effect previously called Satisfaction of Search). SSM errors occur in critical search settings (e.g., radiology and airport security screening), creating concerns for public safety. To eradicate SSM errors it is vital to understand their cause(s), and the current study investigated a key proposed mechanism—searchers prematurely terminate their search after finding a first target. This proposed mechanism, termed the satisfaction account, was proposed over 50 years ago but there are no conclusive supporting data to date. “Satisfaction” has been typically assessed by comparing the total time spent on multiple-target trials to the time spent on single-target trials or by examining if search was immediately terminated after finding a first target. The current study investigated the satisfaction account by exploring variability in the time participants spent searching between finding a first target and self-terminating their search without finding a second target. This individual differences approach revealed that accuracy on a multiple-target search task related to how long participants searched after finding a first target. The relationship was highly significant, even when accounting for variation in participants’ attentional vigilance. This study provides evidence for the previously elusive satisfaction account and it adds to the growing understanding that SSM errors are a multifaceted problem.
Observer Performance Evaluation I
icon_mobile_dropdown
Model and human observer reproducibility for detecting microcalcifications in digital breast tomosynthesis images
Digital breast tomosynthesis (DBT) is a relatively new 3D breast imaging technique, which allows for better low contrast lesion detection than 2D full field digital mammography (FFDM). European guidelines for quality control in FFDM specify minimum and achievable threshold contrasts of small test inserts, determined from readings by human observers. Today model observers are being developed to predict and subsequently substitute human detectability readings. A similar performance test would be welcomed for DBT. However, since such a performance estimation is based on an observer classification, in order to circumvent misjudgments, it is important that the classification system is reliable. The aim of this study was to assess the human and model observer reliability by determining the observer reproducibility when reading 5 datasets from 60 tomosynthesis series acquired under the same conditions. For this purpose, a 3D structured phantom with calcification cluster models was scanned on a Siemens Inspiration tomosynthesis system. VOIs were extracted from these acquisitions and read under a 4 alternative forced choice (4-AFC) paradigm by 6 human observers. A channelized Hotelling model observer using 8 Laguerre-Gauss(LG) channels was developed including a scanning algorithm to detect the calcification clusters. An internal noise method was used to better approximate the human reading results. The observer reproducibility was estimated by bootstrapping and SEM was used as a figure of merit. The results show that the model observer is more reproducible for the smaller calcification sizes with maximum of 5.81 SEM, than human observer with maximum of 13.57 SEM. For the larger clusters both observers have similar reproducibility.
Evaluation of search strategies for microcalcifications and masses in 3D images
Medical imaging is quickly evolving towards 3D image modalities such as computed tomography (CT), magnetic resonance imaging (MRI) and digital breast tomosynthesis (DBT). These 3D image modalities add volumetric information but further increase the need for radiologists to search through the image data set. Although much is known about search strategies in 2D images less is known about the functional consequences of different 3D search strategies. We instructed readers to use two different search strategies: drillers had their eye movements restricted to a few regions while they quickly scrolled through the image stack, scanners explored through eye movements the 2D slices. We used real-time eye position monitoring to ensure observers followed the drilling or the scanning strategy while approximately preserving the percentage of the volumetric data covered by the useful field of view. We investigated search for two signals: a simulated microcalcification and a larger simulated mass. Results show an interaction between the search strategy and lesion type. In particular, scanning provided significantly better detectability for microcalcifications at the cost of 5 times more time to search while there was little change in the detectability for the larger simulated masses. Analyses of eye movements support the hypothesis that the effectiveness of a search strategy in 3D imaging arises from the interaction of the fixational sampling of visual information and the signals’ visibility in the visual periphery.
Comparison of microcalcification detectability in FFDM and DBT using a virtual clinical trial
Zhijin Li, Agnès Desolneux, Serge Muller, et al.
The ultimate way to assess the performance of imaging systems is a clinical trial. Due to its limitation by cost and duration, several research groups are investigating the potential to replace clinical trials in part with virtual clinical trials (VCT) as a more efficient alternative. In this paper, we propose a VCT design to compare the microcalcification (μcalc) detection performance in full field digital mammography (FFDM) and digital breast tomosynthesis (DBT). Digital breast phantoms with uniform and breast-texture like backgrounds and digital μcalcs were created. The μcalcs had diameters ranging from 100μm to 600μm and their attenuation properties were varied to be equivalent to 20% to 60% of the attenuation of Aluminum at 22keV. FFDM and DBT image acquisitions according to the nominal topology of a commercial imaging system were simulated with a software x-ray imaging platform. Projection images were processed with commercial image processing and reconstruction algorithms. Microcalcification detection performance was estimated by an objective taskbased assessment using channelized Hotelling observers (CHO) with Laguerre-Gauss channels and by a human observer. For DBT, single-slice (CHO3ss) and a multi-slice CHO (CHO3msa) model observers were considered. Model and human observers performed a lesion-known-statistically and location-known exactly rating-scale detection task. The decision outcomes were used as input to a receiver operating characteristic analysis and the area under the curve was used as the figure-of-merit. Using our VCT set-up, the performance of the CHO and the human observer seems to be fairly well linearly correlated. There is a trend that µcalc detection performance in DBT is higher than in FFDM.
Analyzing ROC curves using the effective set-size model
The Effective Set-Size model has been used to describe uncertainty in various signal detection experiments. The model regards images as if they were an effective number (M*) of searchable locations, where the observer treats each location as a location-known-exactly detection task with signals having average detectability d'. The model assumes a rational observer behaves as if he searches an effective number of independent locations and follows signal detection theory at each location. Thus the location-known-exactly detectability (d') and the effective number of independent locations M* fully characterize search performance. In this model the image rating in a single-response task is assumed to be the maximum response that the observer would assign to these many locations. The model has been used by a number of other researchers, and is well corroborated. We examine this model as a way of differentiating imaging tasks that radiologists perform. Tasks involving more searching or location uncertainty may have higher estimated M* values. In this work we applied the Effective Set-Size model to a number of medical imaging data sets. The data sets include radiologists reading screening and diagnostic mammography with and without computer-aided diagnosis (CAD), and breast tomosynthesis. We developed an algorithm to fit the model parameters using two-sample maximum-likelihood ordinal regression, similar to the classic bi-normal model. The resulting model ROC curves are rational and fit the observed data well. We find that the distributions of M* and d' differ significantly among these data sets, and differ between pairs of imaging systems within studies. For example, on average tomosynthesis increased readers’ d' values, while CAD reduced the M* parameters. We demonstrate that the model parameters M* and d' are correlated. We conclude that the Effective Set-Size model may be a useful way of differentiating location uncertainty from the diagnostic uncertainty in medical imaging tasks.
Efficiency gain of paired split-plot designs in MRMC ROC studies
The widely used multi-reader multi-case ROC study design for comparing imaging modalities is the fullycrossed (FC) design: every reader reads every case of both modalities. In this work, we investigate the paired split-plot (PSP) designs that allow for reduced cost and increased flexibility compared to the FC design. In the PSP design, patient images from two modalities are read by the same readers, thereby the readings are paired across modalities. However, within each modality, not every reader reads every case. Instead, both the readers and the cases are partitioned into a number of groups and each group of readers read their own group of cases - a split-plot design. Using the U-statistic based variance analysis for AUC (i.e., area under the ROC curve), we show analytically that, with a fixed number of readings per reader, substantial statistical efficiency can be gained by the PSP design as compared to the FC design. Equivalently, we show that the PSP design can achieve the same statistical power as the FC design with substantially reduced number of readings. However, the efficiency/power gain of the PSP design comes with the increased cost of collecting a larger number of truthverified patient cases than the FC design. This means that one can trade off between different sources of cost and choose a least burdensome design. We demonstrate the advantages of the PSP design with a real-world reader study for the comparison of full field digital mammography with screen-film mammography.
Technology Assessment
icon_mobile_dropdown
Interaction of anatomic and quantum noise in DBT power spectrum
In x-ray breast images, anatomical variations have been characterized by slope of the noise power spectrum (NPS) that follows an inverse power-law relationship. Prior literature has reported that this slope (β) changes with imaging modality (DBT vs. mammography) and with different reconstruction algorithms and filters for the same breast structures. In this paper, we assessed the relative contributions of anatomic and quantum noise in the estimated magnitude of β. This is achieved via simulations with varying levels of quantum noise and examining contributions of noise filters. The calculations were performed on simulated DBT images from anthropomorphic software breast phantoms under varying acquisition and reconstruction/filter parameters. Our results indicate that variations in β cannot be solely considered as an indicator of reduced “anatomic noise” and hence potentially improved detectability in those images; presence of quantum noise and view aliasing artifacts in anatomical region always lowered the value of β.
Comparison of synthetic 2D images with planar and tomosynthesis imaging of the breast using a virtual clinical trial
Alistair Mackenzie, Sukhmanjit Kaur, Premkumar Elangovan, et al.
The aim was to measure the threshold diameter for detection of masses and calcifications in synthetic 2D images created from planes of digital breast tomosynthesis (DBT) of a mathematical breast phantom. The results were compared to those for 2D images and DBT. Simulated ill-defined masses and calcification clusters were inserted into mathematical breast models with a thickness of 53mm. The images were simulated as if acquired on a Siemens Inspiration X-ray system. Acquisitions of 2D and DBT images of the breast phantom at a mean glandular dose (MGD) of 1.6mGy were simulated using ray tracing with allowance for unsharpness and the addition of scatter and noise. The resultant images were processed using the manufacturer’s software to create 2D, DBT planes and synthetic 2D images. Image patches with or without the lesion were extracted. These patches were used in a 4-alternative forced choice study using 5 observers to measure the threshold diameter for each imaging mode. The threshold diameters of the masses and microcalcifications were 7.0mm, 6.3mm, 7.1mm and 4.9mm (masses) and 395μm, 211μm, 220μm, and 357μm (calcifications) for synthetic 2D, 2D (1.6mGy), 2D (1.1mGy) and DBT respectively. The threshold diameters were 10% (p=0.4) and 47% (p<0.0001) smaller for 2D images compared to synthetic 2D images for masses and calcification respectively at a MGD of 1.6mGy. At the same dose, the threshold diameter for small calcifications was larger for synthetic 2D images than 2D images, but no significant differences were found for masses between 2D and synthetic 2D.
Assessment of DBT acquisition parameters for 2D and 3D search tasks
A concern with using mathematical model observers to gauge medical image quality is whether and to what degree task simplifications can affect study outcomes. Researchers are interested in assessments based on clinically realistic tasks, but routinely implement simplified tasks to manage time and computation. The goal of this work is to examine how optimization of digital breast tomosynthesis (DBT) acquisition parameters can be influenced by the consideration of 2D or 3D search tasks. Localization ROC (LROC) observer studies were based on simulated image slices and volumes obtained from low- and medium-density digital breast phantoms containing 8-mm spherical masses. An analytic cone-beam projector used an acquisition arc of 60° while the number of angular projections varied from 3 to 51. Image volumes were reconstructed with the Feldkamp FBP algorithm and then postfiltered and thresholded to eliminate negative pixel values. A visual-search (VS) model observer was applied for both the 2D and 3D LROC studies. The observer used 2D spatial derivatives as features to find suspicious candidate locations in an image. The candidates were compared by means of a binary Hotelling discriminant. Preliminary results indicated substantially reduced performance with the 3D task, particularly for the more-dense cases.
Quantifying predictive capability of electronic health records for the most harmful breast cancer
Yirong Wu, Jun Fan, Peggy Peissig, et al.
Improved prediction of the “most harmful” breast cancers that cause the most substantive morbidity and mortality would enable physicians to target more intense screening and preventive measures at those women who have the highest risk; however, such prediction models for the “most harmful” breast cancers have rarely been developed. Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHR variables in the “most harmful” breast cancer risk prediction. We identified 794 subjects who had breast cancer with primary non-benign tumors with their earliest diagnosis on or after 1/1/2004 from an existing personalized medicine data repository, including 395 “most harmful” breast cancer cases and 399 “least harmful” breast cancer cases. For these subjects, we collected EHR data comprised of 6 components: demographics, diagnoses, symptoms, procedures, medications, and laboratory results. We developed two regularized prediction models, Ridge Logistic Regression (Ridge-LR) and Lasso Logistic Regression (Lasso-LR), to predict the “most harmful” breast cancer one year in advance. The area under the ROC curve (AUC) was used to assess model performance. We observed that the AUCs of Ridge-LR and Lasso-LR models were 0.818 and 0.839 respectively. For both the Ridge-LR and LassoLR models, the predictive performance of the whole EHR variables was significantly higher than that of each individual component (p<0.001). In conclusion, EHR variables can be used to predict the “most harmful” breast cancer, providing the possibility to personalize care for those women at the highest risk in clinical practice.
Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed 'test' dataset and a potential solution
Alexej Gossmann, Aria Pezeshk, Berkman Sahiner
After the initial release of a machine learning algorithm, the subsequently gathered data can be used to augment the training dataset in order to modify or fine-tune the algorithm. For algorithm performance evaluation that generalizes to a targeted population of cases, ideally, test datasets randomly drawn from the targeted population are used. To ensure that test results generalize to new data, the algorithm needs to be evaluated on new and independent test data each time a new performance evaluation is required. However, medical test datasets of sufficient quality are often hard to acquire, and it is tempting to utilize a previously-used test dataset for a new performance evaluation. With extensive simulation studies, we illustrate how such a "naive" approach to test data reuse can inadvertently result in overfitting the algorithm to the test data, even when only a global performance metric is reported back from the test dataset. The overfitting behavior leads to a loss in generalization and overly optimistic conclusions about the algorithm performance. We investigate the use of the Thresholdout method of Dwork et. al. (Ref. 1) to tackle this problem. Thresholdout allows repeated reuse of the same test dataset. It essentially reports a noisy version of the performance metric on the test data, and provides theoretical guarantees on how many times the test dataset can be accessed to ensure generalization of the reported answers to the underlying distribution. With extensive simulation studies, we show that Thresholdout indeed substantially reduces the problem of overfitting to the test data under the simulation conditions, at the cost of a mild additional uncertainty on the reported test performance. We also extend some of the theoretical guarantees to the area under the ROC curve as the reported performance metric.
Towards the use of computationally inserted lesions for mammographic CAD assessment
Zahra Ghanian, Aria Pezeshk, Nicholas Petrick , et al.
Computer-aided detection (CADe) devices used for breast cancer detection on mammograms are typically first developed and assessed for a specific “original” acquisition system, e.g., a specific image detector. When CADe developers are ready to apply their CADe device to a new mammographic acquisition system, they typically assess the CADe device with images acquired using the new system. Collecting large repositories of clinical images containing verified cancer locations and acquired by the new image acquisition system is costly and time consuming. Our goal is to develop a methodology to reduce the clinical data burden in the assessment of a CADe device for use with a different image acquisition system. We are developing an image blending technique that allows users to seamlessly insert lesions imaged using an original acquisition system into normal images or regions acquired with a new system. In this study, we investigated the insertion of microcalcification clusters imaged using an original acquisition system into normal images acquired with that same system utilizing our previously-developed image blending technique. We first performed a reader study to assess whether experienced observers could distinguish between computationally inserted and native clusters. For this purpose, we applied our insertion technique to clinical cases taken from the University of South Florida Digital Database for Screening Mammography (DDSM) and the Breast Cancer Digital Repository (BCDR). Regions of interest containing microcalcification clusters from one breast of a patient were inserted into the contralateral breast of the same patient. The reader study included 55 native clusters and their 55 inserted counterparts. Analysis of the reader ratings using receiver operating characteristic (ROC) methodology indicated that inserted clusters cannot be reliably distinguished from native clusters (area under the ROC curve, AUC=0.58±0.04). Furthermore, CADe sensitivity was evaluated on mammograms with native and inserted microcalcification clusters using a commercial CADe system. For this purpose, we used full field digital mammograms (FFDMs) from 68 clinical cases, acquired at the University of Michigan Health System. The average sensitivities for native and inserted clusters were equal, 85.3% (58/68). These results demonstrate the feasibility of using the inserted microcalcification clusters for assessing mammographic CAD devices.
Model Observers I
icon_mobile_dropdown
Correlation between model observers in uniform background and human observers in patient liver background for a low-contrast detection task in CT
Channelized Hotelling observer (CHO) has demonstrated strong correlation with human observer (HO) in both single-slice viewing mode and multi-slice viewing mode in low-contrast detection tasks with uniform background. However, it remains unknown if the simplest single-slice CHO in uniform background can be used to predict human observer performance in more realistic tasks that involve patient anatomical background and multi-slice viewing mode. In this study, we aim to investigate the correlation between CHO in a uniform water background and human observer performance at a multi-slice viewing mode on patient liver background for a low-contrast lesion detection task. The human observer study was performed on CT images from 7 abdominal CT exams. A noise insertion tool was employed to synthesize CT scans at two additional dose levels. A validated lesion insertion tool was used to numerically insert metastatic liver lesions of various sizes and contrasts into both phantom and patient images. We selected 12 conditions out of 72 possible experimental conditions to evaluate the correlation at various radiation doses, lesion sizes, lesion contrasts and reconstruction algorithms. CHO with both single and multi-slice viewing modes were strongly correlated with HO. The corresponding Pearson’s correlation coefficient was 0.982 (with 95% confidence interval (CI) [0.936, 0.995]) and 0.989 (with 95% CI of [0.960, 0.997]) in multi-slice and single-slice viewing modes, respectively. Therefore, this study demonstrated the potential to use the simplest single-slice CHO to assess image quality for more realistic clinically relevant CT detection tasks.
Lesion detection performance of cone beam CT images with anatomical background noise: single-slice vs. multi-slice human and model observer study
We investigate lesion detectability and its trends for different noise structures in single-slice and multislice CBCT images with anatomical background noise. Anatomical background noise is modeled using a power law spectrum of breast anatomy. Spherical signal with a 2 mm diameter is used for modeling a lesion. CT projection data are acquired by the forward projection and reconstructed by the Feldkamp-Davis-Kress algorithm. To generate different noise structures, two types of reconstruction filters (Hanning and Ram-Lak weighted ramp filters) are used in the reconstruction, and the transverse and longitudinal planes of reconstructed volume are used for detectability evaluation. To evaluate single-slice images, the central slice, which contains the maximum signal energy, is used. To evaluate multislice images, central nine slices are used. Detectability is evaluated using human and model observer studies. For model observer, channelized Hotelling observer (CHO) with dense difference-of-Gaussian (D-DOG) channels are used. For all noise structures, detectability by a human observer is higher for multislice images than single-slice images, and the degree of detectability increase in multislice images depends on the noise structure. Variation in detectability for different noise structures is reduced in multislice images, but detectability trends are not much different between single-slice and multislice images. The CHO with D-DOG channels predicts detectability by a human observer well for both single-slice and multislice images.
A practical method to evaluate personalized injected patient dose for cardiac perfusion SPECT imaging: the polar map as a numerical observer
P. Hendrik Pretorius, Michael A. King, Karen L. Johnson
The diversity in the patient population necessitates a more refined dose reduction approach in cardiac perfusion imaging. We have recently formulated a strategy to better calculate individual and personalized injected doses using the body mass index. The purpose of this study is to present a practical method to evaluate the efficacy of personalizing injected dose employing the polar map methodology. Two hundred and fifty-two normally read patients were used to either determine the personalized dose or to test the dose reduction strategy. Fifty of the test patient studies were altered by inserting perfusion defects in the LV wall. The original or full dose as well as the personalizes dose data were reconstructed using OSEM (Ordered subsets expectation maximization) with attenuation, scatter and spatial resolution compensation. The ROC results show that the personalized dose strategy does not adversely affect the detection of perfusion defects.
Parameter selection with the Hotelling observer in linear iterative image reconstruction for breast tomosynthesis
Sean D. Rose, Jacob Roth, Cole Zimmerman, et al.
In this work we investigate an efficient implementation of a region-of-interest (ROI) based Hotelling observer (HO) in the context of parameter optimization for detection of a rod signal at two orientations in linear iterative image reconstruction for DBT. Our preliminary results suggest that ROI-HO performance trends may be efficiently estimated by modeling only the 2D plane perpendicular to the detector and containing the X-ray source trajectory. In addition, the ROI-HO is seen to exhibit orientation dependent trends in detectability as a function of the regularization strength employed in reconstruction. To further investigate the ROI-HO performance in larger 3D system models, we present and validate an iterative methodology for calculating the ROI-HO. Lastly, we present a real data study investigating the correspondence between ROI-HO performance trends and signal conspicuity. Conspicuity of signals in real data reconstructions is seen to track well with trends in ROI-HO detectability. In particular, we observe orientation dependent conspicuity matching the orientation dependent detectability of the ROI-HO.
A deep learning model observer for use in alterative forced choice virtual clinical trials
M. Alnowami, G. Mills, M. Awis, et al.
Virtual clinical trials (VCTs) represent an alternative assessment paradigm that overcomes issues of dose, high cost and delay encountered in conventional clinical trials for breast cancer screening. However, to fully utilize the potential benefits of VCTs requires a machine-based observer that can rapidly and realistically process large numbers of experimental conditions. To address this, a Deep Learning Model Observer (DLMO) was developed and trained to identify lesion targets from normal tissue in small (200 x 200 pixel) image segments, as used in Alternative Forced Choice (AFC) studies. The proposed network consists of 5 convolutional layers with 2x2 kernels and ReLU (Rectified Linear Unit) activations, followed by max pooling with size equal to the size of the final feature maps and three dense layers. The class outputs weights from the final fully connected dense layer are used to consider sets of n images in an n-AFC paradigm to determine the image most likely to contain a target. To examine the DLMO performance on clinical data, a training set of 2814 normal and 2814 biopsy-confirmed malignant mass targets were used. This produced a sensitivity of 0.90 and a specificity of 0.92 when presented with a test data set of 800 previously unseen clinical images. To examine the DLMOs minimum detectable contrast, a second dataset of 630 simulated backgrounds and 630 images with simulated lesion and spherical targets (4mm and 6mm diameter), produced contrast thresholds equivalent to/better than human observer performance for spherical targets, and comparable (12 % difference) for lesion targets.
Model Observers II
icon_mobile_dropdown
Towards a surround-aware numerical observer
Ali R. N. Avanaki, Kathryn S. Espig, Albert Xthona, et al.
Motivated by the fact that the visibility of an object is affected by its surrounding brightness, we design a surround-aware anthropomorphic numerical observer and conduct experiments to validate it. We derive the observer based on Barten’s formula that predicts the visibility of a sinusoidal pattern in a large uniformly lit surround field. The following are the key steps as well as assumptions in observer derivation. We deduce the effect of a ring-shaped surround from the predicted visibility thresholds for two large surrounds with different eccentricities from the target. Moreover, we theorize that the visibility of a small round object (akin to a micro-calcification in a breast radiograph) is the same as a single cycle of a sinusoid. Assuming independent detection of sinusoid cycles, we calculate the visibility threshold for a single cycle target from that of a multicycle sinusoidal pattern, which is predicted by Barten’s formula. The validation experiments are set up to isolate the effects of surround luminance, its eccentricity from the target, and its size. Our experimental results indicate that a surround considerably different from the target in luminance hinders target’s visibility. Moreover, we observer that the surround size and its proximity to the target tend to increase its impact. These observations are predicted by the proposed numerical observer. We also note that a dark surround seems to adversely affect the visibility of a bright target considerably more than a bright surround affects the visibility of a dark target. This asymmetry, however, cannot be predicted by the proposed observer.
Evaluation of a machine learning based model observer for x-ray CT
Felix K. Kopp, Marco Catalano, Daniela Pfeiffer, et al.
In the medical imaging domain, image quality assessment is usually carried out by human observers (HuO) performing a clinical task in reader studies. To overcome time-consuming reader studies numerical model observers (MO) were introduced and are now widely used in the CT research community to predict the performance of HuOs. In the recent years, machine learning based MOs showed promising results for SPECT. Therefore, we built a neural network, a socalled softmax regression model based on machine learning, as MO for x-ray CT. Performance was evaluated by comparing to one of the most prevalent MOs, the channelized Hotelling observer (CHO). CT image data labeled with confidence ratings assessed in a reader study for a detection-task of signals of different sizes, different noise levels and different reconstruction algorithms were used to train and test the MOs. Data was acquired with a clinical CT scanner. For each of four different x-ray radiation exposures, there were 208 repeated scans of a Catphan phantom. The neural network based MO (NN-MO) as well as the CHO showed good agreement with the performance in the reader study.
Observer templates in 2D and 3D localization tasks
In this study we examine search performance for 3D forced-localization tasks in Gaussian random textures in which subjects are able to freely scroll through the image as part of their search for the target. We also evaluate a 2D single-slice version of the same task for comparison. We analyze these experiments using both efficiency with respect to the Ideal Observer and the classification image technique, which directly estimates the weighting function used by observers for a task. We are particularly interested in whether subjects can efficiently integrate across multiple slices in depth as part of performing the localization task.

In the 3D tasks, the image display we use allows subjects to freely scroll through a volumetric image, and a localization response is made through a mouse-click on the image. The search region has a relatively modest size (approx. 8.8° visual angle). Localization responses are considered correct if they are close to the target center (within 6 voxels). The classification image methodology uses noise fields from the incorrect localizations to build an estimate of the weights used by the observer to perform the task. The basic idea is that incorrect localizations occur in regions of the image where the noise field matches the weighting profile, thereby eliciting a strong internal response.

The efficiency results indicate differences between 2D and 3D search tasks, with lower efficiency for large target in the 3D task. The classification images suggest that this finding can be explained by the lack of spatial integration across slices.
Reducing the number of reconstructions needed for estimating channelized observer performance
Angel R. Pineda, Hope Miedema, Melissa Brenner, et al.
A challenge for task-based optimization is the time required for each reconstructed image in applications where reconstructions are time consuming. Our goal is to reduce the number of reconstructions needed to estimate the area under the receiver operating characteristic curve (AUC) of the infinitely-trained optimal channelized linear observer. We explore the use of classifiers which either do not invert the channel covariance matrix or do feature selection. We also study the assumption that multiple low contrast signals in the same image of a non-linear reconstruction do not significantly change the estimate of the AUC. We compared the AUC of several classifiers (Hotelling, logistic regression, logistic regression using Firth bias reduction and the least absolute shrinkage and selection operator (LASSO)) with a small number of observations both for normal simulated data and images from a total variation reconstruction in magnetic resonance imaging (MRI). We used 10 Laguerre-Gauss channels and the Mann-Whitney estimator for AUC. For this data, our results show that at small sample sizes feature selection using the LASSO technique can decrease bias of the AUC estimation with increased variance and that for large sample sizes the difference between these classifiers is small. We also compared the use of multiple signals in a single reconstructed image to reduce the number of reconstructions in a total variation reconstruction for accelerated imaging in MRI. We found that AUC estimation using multiple low contrast signals in the same image resulted in similar AUC estimates as doing a single reconstruction per signal leading to a 13x reduction in the number of reconstructions needed.
A resampling comparison of CHO's detectability index bias and uncertainty
Francesc Massanes, Alexandre Ba, François Bochud, et al.
Model observers have gained popularity as a surrogate approach for image quality assessment, they are often used for the optimization of the reconstruction algorithm. The most widespread model observer is the channelized Hotelling observer (CHO) that allows measuring the image quality by calculating the detectability index (or associated area under receiver operating characteristic curve). In this work we have chosen to explore different resampling methods used to estimate the CHO performance and uncertainty. In this paper, using data from the inter-laboratory comparison of the computation of CHO model observer study, we established a simulation framework to fully evaluate different resampling methods, namely, leave-one out and bootstrapping with replacement to estimate the CHO’s detectability index bias and uncertainty. For this particular study, we focus our experiments on datasets with a few data samples, 200 normal and 200 abnormal images.
Observer Performance Evaluation II and Tribute to Art Burgess
icon_mobile_dropdown
Reader performance in visual assessment of breast density using visual analogue scales: Are some readers more predictive of breast cancer?
Millicent Rayner, Elaine F. Harkness, Philip Foden, et al.
Mammographic breast density is one of the strongest risk factors for breast cancer, and is used in risk prediction and for deciding appropriate imaging strategies. In the Predicting Risk Of Cancer At Screening (PROCAS) study, percent density estimated by two readers on Visual Analogue Scales (VAS) has shown a strong relationship with breast cancer risk when assessed against automated methods. However, this method suffers from reader variability. This study aimed to assess the performance of PROCAS readers using VAS, and to identify those most predictive of breast cancer. We selected the seven readers who had estimated density on over 6,500 women including at least 100 cancer cases, analysing their performance using multivariable logistic regression and Receiver Operator Characteristic (ROC) analysis. All seven readers showed statistically significant odds ratios (OR) for cancer risk according to VAS score after adjusting for classical risk factors. The OR was greatest for reader 18 at 1.026 (95% Cl 1.018-1.034). Adjusted Area Under the ROC Curves (AUCs) were statistically significant for all readers, but greatest for reader 14 at 0.639. Further analysis of the VAS scores for these two readers showed reader 14 had higher sensitivity (78.0% versus 42.2%), whereas reader 18 had higher specificity (78.0% versus 46.0%). Our results demonstrate individual differences when assigning VAS scores; one better identified those with increased risk, whereas another better identified low risk individuals. However, despite their different strengths, both readers showed similar predictive abilities overall. Standardised training for VAS may improve reader variability and consistency of VAS scoring.
Interactions of lesion detectability and size across single-slice DBT and 3D DBT
Miguel A. Lago, Craig K. Abbey, Bruno Barufaldi, et al.
Three dimensional image modalities introduce a new paradigm for visual search requiring visual exploration of a larger search space than 2D imaging modalities. The large number of slices in the 3D volumes and the limited reading times make it difficult for radiologists to explore thoroughly by fixating with their high resolution fovea on all regions of each slice. Thus, for 3D images, observers must rely much more on their visual periphery (points away from fixation) to process image information. We previously found a dissociation in signal detectability between 2D and 3D search tasks for small signals in synthetic textures evaluated with non-radiologist trained observers. Here, we extend our evaluation to more clinically realistic backgrounds and radiologist observers. We studied the detectability of simulated microcalcifications (MCALC) and masses (MASS) in Digital Breast Tomosynthesis (DBT) utilizing virtual breast phantoms. We compared the lesion detectability of 8 radiologists during free search in 3D DBT and a 2D single-slice DBT (center slice of the 3D DBT). Our results show that the detectability of the microcalcification degrades significantly in 3D DBT with respect to the 2D single-slice DBT. On the other hand, the detectability for masses does not show this behavior and its detectability is not significantly different. The large deterioration of the 3D detectability of microcalcifications relative to masses may be related to the peripheral processing given the high number of cases in which the microcalcification was missed and the high number of search errors. Together, the results extend previous findings with synthetic textures and highlight how search in 3D images is distinct from 2D search as a consequence of the interaction between search strategies and the visibility of signals in the visual periphery.
Lesion classification with a visual-search model observer
The objective of this work was to test the capabilities of visual-search (VS) model observers for target classification. In this paper, a localization and classification ROC study was conducted with simulated single-pinhole nuclear medicine images. The images featured two sizes of Gaussian targets in Gaussian lumpy backgrounds, with one target twice the size of the other. Pinhole size was a study variable. The VS observer performed both the localization and classification. Three human observers also participated in the study. The trends in localization performance as a function of pinhole size for the VS and average human results were in good agreement. For the classification task, the VS and human observers performed on par, but with substantial differences in how they were affected by pinhole size. The VS observer correctly classified the smaller target less often than the larger target even though both targets were correctly localized with the same frequency.
A citizen science approach to optimising computer aided detection (CAD) in mammography
Georgia V. Ionescu, Elaine F. Harkness, Johan Hulleman, et al.
Computer aided detection (CAD) systems assist medical experts during image interpretation. In mammography, CAD systems prompt suspicious regions which help medical experts to detect early signs of cancer. This is a challenging task and prompts may appear in regions that are actually normal, whilst genuine cancers may be missed. The effect prompting has on readers performance is not fully known. In order to explore the effects of prompting errors, we have created an online game (Bat Hunt), designed for non-experts, that mirrors mammographic CAD. This allows us to explore a wider parameter space. Users are required to detect bats in images of flocks of birds, with image difficulty matched to the proportions of screening mammograms in different BI-RADS density categories. Twelve prompted conditions were investigated, along with unprompted detection. On average, players achieved a sensitivity of 0.33 for unprompted detection, and sensitivities of 0.75, 0.83, and 0.92 respectively for 70%, 80%, and 90% of targets prompted, regardless of CAD specificity. False prompts distract players from finding unprompted targets if they appear in the same image. Player performance decreases when the number of false prompts increases, and increases proportionally with prompting sensitivity. Median lowest d' was for unprompted condition (1.08) and the highest for sensitivity 90% and 0.5 false prompts per image (d'=4.48).
Can a limited double reading/second opinion of initially recalled breast ultrasound screening examinations improve radiologists' performances?
David Gur, Kimberly Harnist, Terri-Ann Gizienski, et al.
Interpretations of breast ultrasound screening examinations result in high recall rates and large inter-radiologist variability, frequently leading to “conservative” recommendations. Double reading of all breast ultrasound screening examinations is cost prohibitive, but double reading of only “initially recalled” cases may prove efficacious. We assessed changes in recommendations, if any, by providing a consensus second opinion in a limited subset of examinations initially recommended for recall. We performed a retrospective reader study with 197 ultrasound examinations (97 not recalled and 100 recalled clinically). First, we generated a consensus “second opinion” consisting of the majority vote of three independent readings of each case by experienced ultrasound interpreters. During the reader study that followed, if the reader recommended a “recall” and the “consensus second opinion” did not, a message to that effect was displayed and the reader was asked to re-review the exam and re-assess if, knowing the second opinion, a re-rating of the case was warranted. We compared performance levels pre- and post- the second opinion. The second opinion resulted in “no recall” recommendations of 141 cases in the entire set, including four cancer cases missed by all three readers. On average, radiologists received “warning” messages in 30 cases (range 15-50), or in ~15% of cases. Rating changes (downgrades to no recall) occurred in 36 of these cases. These changes resulted in a possible recall rate reduction of 28% in prompted cases or 14% overall recall reduction, while increasing the false negative rate by only one case missed by 2 readers (~1%).
Poster Session
icon_mobile_dropdown
Study of CT image texture using deep learning techniques
Sandeep Dutta, Jiahua Fan, David Chevalier
For CT imaging, reduction of radiation dose while improving or maintaining image quality (IQ) is currently a very active research and development topic. Iterative Reconstruction (IR) approaches have been suggested to be able to offer better IQ to dose ratio compared to the conventional Filtered Back Projection (FBP) reconstruction. However, it has been widely reported that often CT image texture from IR is different compared to that from FBP. Researchers have proposed different figure of metrics to quantitate the texture from different reconstruction methods. But there is still a lack of practical and robust method in the field for texture description. This work applied deep learning method for CT image texture study. Multiple dose scans of a 20cm diameter cylindrical water phantom was performed on Revolution CT scanner (GE Healthcare, Waukesha) and the images were reconstructed with FBP and four different IR reconstruction settings. The training images generated were randomly allotted (80:20) to a training and validation set. An independent test set of 256-512 images/class were collected with the same scan and reconstruction settings. Multiple deep learning (DL) networks with Convolution, RELU activation, max-pooling, fully-connected, global average pooling and softmax activation layers were investigated. Impact of different image patch size for training was investigated. Original pixel data as well as normalized image data were evaluated. DL models were reliably able to classify CT image texture with accuracy up to ~99%. Results show that the deep learning techniques suggest that CT IR techniques may help lower the radiation dose compared to FBP.
Assessment of computerized algorithms by comparing with human observers in binary classification tasks: a simulation study
It is generally recognized that recent advancements in computer vision, especially the development of deep convolutional neural networks, has substantially improved the performance of computerized algorithms in medical imaging for classification tasks such as cancer detection/diagnosis.These advancements underscore the importance of the question of how the computer algorithm’s stand-alone performance compares with the performance of physicians. Current literature often uses descriptive statistics or a visual check of plots for the comparison lacking quantitative and rigorous statistical inference. In this work, we developed a U-statistic based approach to estimate the variance of performance difference between an algorithm and a group of human observers in a binary classification task. The performance metric considered in this work is percent correct (PC), e.g., sensitivity or specificity. Our variance estimation treats both human observers and patient cases as random samples and accounts for both sources of variability, thereby allowing for the conclusion to be generalizable to both the patient and the physician populations. Moreover, we investigated a z -statistic method based on our variance estimator for hypothesis testing. Our simulation results show that our variance estimator for the PC performance difference is unbiased. The normal approximation method using our variance estimator for hypothesis testing appears useful for large sample sizes.
Development of a validation instrument in myocardial perfusion imaging: results of first flow experiments
Marije E. Kamphuis, Gert Jan Pelgrim, Marcel J. W. Greuter, et al.
Institutional diagnostic workflows regarding coronary artery disease (CAD) may differ greatly. Myocardial perfusion imaging (MPI) is a commonly used diagnostic method in CAD, whereby multiple modalities are deployed to assess relative or absolute myocardial blood flow (MBF) (e.g. with SPECT, PET, MR, CT, or combinations). In line with proper clinical decision-making, it is essential to assess institutional MPI test validity by confronting MBF assessment to a ground truth. Our research focuses on developing such validation instrument/method for MPI by means of simulating controlled myocardial perfusion in a phantom flow setup. A first step was made in the process of method development and validation by specifying basic requirements for the phantom flow setup. First tests in CT-MPI were aimed to gain experience in clinical testing, to verify to which extent the set requirements are met, and to evaluate the steps needed to further improve accuracy and reproducibility of measurements. The myocardium was simulated as a static cylinder and placed in a controllable pulsatile flow circuit whereby using flow sensors as reference. First flow experiments were performed for different stroke volumes (20-35 mL/stroke). After contrast injection, dynamic MPI-CT scans (SOMATOM Force, Siemens) were obtained to investigate the relation between first-pass measured and computed flow. We observed a moderate correlation; hence, the required accuracy and reproducibility levels were not met. However, we have gained new insights in factors regarding the measurement setup and MBF computation process that might affect instrument validation, which we will incorporate in future flow setup design and testing.
Ischemic stroke enhancement in computed tomography scans using a computational approach
Allan F. F. Alves, Ana L. M. Pavan, Rachid Jennane, et al.
In this work, a novel approach was proposed to enhance the visual perception of ischemic stroke in computed tomography scans. Through different image processing techniques, we enabled less experienced physicians, to reliably detect early signs of stroke. A set of 40 retrospective CT scans of patients were used, divided into two groups: 25 cases of acute ischemic stroke and 15 normal cases used as control group. All cases were obtained within 4 hours of symptoms onset. Our approach was based on the variational decomposition model and three different segmentation methods. A test determined observers' performance to correctly diagnose stroke cases. The Expectation Maximization method provided the best results among all observers. The overall sensitivity of the observer’s analysis was 64% and increased to 79%. The overall specificity was 67% and increased to 78%. These results show the importance of a computational tool to assist neuroradiology decisions, especially in critical situations such as the diagnosis of ischemic stroke.
Projection space model observers based on marginal linear discriminants
Past studies with tomographic reconstructions have shown that visual-search (VS) model observers can be used to evaluate acquisition protocols in medical imaging. However, projection-space studies could be more efficient. A localization ROC (LROC) study was conducted with two VS observers and sets of simulated CT projections and reconstructions generated from a clinically realistic 2D lumbar-spine phantom. The phantom was an axial slice through the L3 vertebrae of a clinical CT. Simulated 1-cm circular lesions had a relative contrast of 1.5. The acquisitions contained from 15 to 512 parallel-beam projections over 180 degrees. Reconstructions were generated with backprojection (BP) and filtered BP (FBP). Both observers identified and compared suspicious candidate locations in an image by means of feature extraction. One observer used the lesion gradient while the other used the gradients for a set of approximate lesion profiles. Observer performance with projections and BP images was highly correlated as a function of the number of projections. FBP performance was lower but still correlated with projection and BP-image performance. VS observers may provide a novel means of optimizing CT acquisitions under clinically relevant tasks using projection data.
Characteristics of the group of radiologists that benefits the most using Breast Screen Reader Assessment Strategy (BREAST)
Purpose: To determine the impact of Breast Screen Reader Assessment Strategy (BREAST) over time in improving radiologists’ breast cancer detection performance, and to identify the group of radiologists that benefit the most by using BREAST as a training tool. Materials and Methods: Thirty-six radiologists who completed three case-sets offered by BREAST were included in this study. The case-sets were arranged in radiologists’ chronological order of completion and five performance measures (sensitivity, specificity, location sensitivity, receiver operating characteristics area under the curve (ROC AUC) and jackknife alternative free-response receiver operating characteristic (JAFROC) figure–of-merit (FOM)), available from BREAST, were compared between case-sets to determine the level of improvement achieved. The radiologists were then grouped based on their characteristics and the above performance measures between the case-sets were compared. Paired t-tests or Wilcoxon signed-rank tests with statistical significance set at p < 0.05 were used to compare the performance measures. Results: Significant improvement was demonstrated in radiologists’ case-set performance in terms of location sensitivity and JAFROC FOM over the years, and radiologists’ location sensitivity and JAFROC FOM showed significant improvement irrespective of their characteristics. In terms of ROC AUC, significant improvement was shown for radiologists who were reading screen mammograms for more than 7 years and spent more than 9 hours per week reading mammograms. Conclusion: Engaging with case-sets appears to enhance radiologists’ performance suggesting the important value of initiatives such as BREAST. However, such performance enhancement was not shown for everyone, highlighting the need to tailor the BREAST platform to benefit all radiologists.
Learning the ideal observer for SKE detection tasks by use of convolutional neural networks (Cum Laude Poster Award)
It has been advocated that task-based measures of image quality (IQ) should be employed to evaluate and optimize imaging systems. Task-based measures of IQ quantify the performance of an observer on a medically relevant task. The Bayesian Ideal Observer (IO), which employs complete statistical information of the object and noise, achieves the upper limit of the performance for a binary signal classification task. However, computing the IO performance is generally analytically intractable and can be computationally burdensome when Markov-chain Monte Carlo (MCMC) techniques are employed. In this paper, supervised learning with convolutional neural networks (CNNs) is employed to approximate the IO test statistics for a signal-known-exactly and background-known-exactly (SKE/BKE) binary detection task. The receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) are compared to those produced by the analytically computed IO. The advantages of the proposed supervised learning approach for approximating the IO are demonstrated.
Blind CT image quality assessment via deep learning strategy: initial study
Computed Tomography (CT) is one of the most important medical imaging modality. CT images can be used to assist in the detection and diagnosis of lesions and to facilitate follow-up treatment. However, CT images are vulnerable to noise. Actually, there are two major source intrinsically causing the CT data noise, i.e., the X-ray photo statistics and the electronic noise background. Therefore, it is necessary to doing image quality assessment (IQA) in CT imaging before diagnosis and treatment. Most of existing CT images IQA methods are based on human observer study. However, these methods are impractical in clinical for their complex and time-consuming. In this paper, we presented a blind CT image quality assessment via deep learning strategy. A database of 1500 CT images is constructed, containing 300 high-quality images and 1200 corresponding noisy images. Specifically, the high-quality images were used to simulate the corresponding noisy images at four different doses. Then, the images are scored by the experienced radiologists by the following attributes: image noise, artifacts, edge and structure, overall image quality, and tumor size and boundary estimation with five-point scale. We trained a network for learning the non-liner map from CT images to subjective evaluation scores. Then, we load the pre-trained model to yield predicted score from the test image. To demonstrate the performance of the deep learning network in IQA, correlation coefficients: Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) are utilized. And the experimental result demonstrate that the presented deep learning based IQA strategy can be used in the CT image quality assessment.
Can a totally different approach to soft tissue computer aided detection (CADe) result in affecting radiologists' decisions?
We tested whether a case based CADe scheme, developed only on negatively interpreted screening mammograms, has predictive value for cancer detection during subsequent screening and how this approach may affect radiologists’ performances when alerting them to a small subset (~15%) of exams on which radiologists tend to miss cancers. A series of six parameters case based CADe schemes, using 200 negative mammograms (800 images 100 women with breast cancer at subsequent screening and 100 women who remained negative), carefully matched by age and breast density, were optimized. CADe alone schemes performed at AUC=0.68 (+/- 0.01). Five radiologists and 4 residents interpreted the same cases and performed at AUC =0.71 (experienced radiologists) and AUC= 0.61 (residents). With the “CADe warnings” shown to the interpreters only if they did not recall one of 24 highest CADe scoring cases, assisted performance of radiologists and residents respectively, were 0.71 and 0.63 (p>0.05). However, when the CADe alone performance was raised to an AUC=0.78, by artificially increasing the number of possible warnings from 16 to 24, radiologists’ performances significantly improved from an AUC of 0.68 to 0.72 (p<0.05). In conclusion, the use case based information other than breast density could highlight a small fraction of women whose cancers are more likely to be missed by radiologists and later detected during subsequent mammograms, thereby, leading to an assisted approach that improves radiologists’ performances. However, to be effective, the performance of the CADe alone should be substantially higher (e.g. ΔAUC ≥0.07) than that of the un-assisted radiologist.
Feasibility study of deep convolutional generative adversarial networks to generate mammography images
We conducted a feasibility study to generate mammography images using a deep convolutional generative adversarial network (DCGAN), which directly produces realistic images without 3-D model passing through any complex rendering algorithm, such as ray tracing. We trained DCGAN with breast 2D mammography images, which were generated from anatomical noise. The generated X-ray mammography images were successful in that the image preserves reasonable quality and retains the visual patterns similar to training images. Especially, generated images share the distinctive structure of training images. For the quantitative evaluation, we used the mean and variance of beta values of generated images and observed that they are very similar to those of training images. Although the general distribution of generated images matches well with those of training images, there are several limitations of the DCGAN. First, checkboard pattern like artifacts are found in generated images, which is a well-known issue of deconvolution algorithm. Moreover, training GAN is often unstable so to require manual fine-tunes. To overcome such limitations, we plan to extend our idea to conditional GAN approach for improving training stability, and employ an auto-encoder for handling artifacts. To validate our idea on real data, we will apply clinical images to train the network. We believe that our framework can be easily extended to generate other medical images.
Simulating magnetic resonance images based on a model of tumor growth incorporating microenvironment
Pamela R. Jackson, Andrea Hawkins-Daarud, Savannah C. Partridge, et al.
Glioblastoma (GBM), the most aggressive primary brain tumor, is primarily diagnosed and monitored using gadoliniumenhanced T1-weighted and T2-weighted (T2W) magnetic resonance imaging (MRI). Hyperintensity on T2W images is understood to correspond with vasogenic edema and infiltrating tumor cells. GBM’s inherent heterogeneity and resulting non-specific MRI image features complicate assessing treatment response. To better understand treatment response, we propose creating a patient-specific untreated virtual imaging control (UVIC), which represents an individual tumor’s growth if it had not been treated, for comparison with actual post-treatment images. We generated a T2W MRI UVIC by combining a patient-specific mathematical model of tumor growth with a multi-compartmental MRI signal equation. GBM growth was mathematically modeled using the previously developed Proliferation-Invasion-Hypoxia-Necrosis- Angiogenesis-Edema (PIHNA-E) model, which simulated tumor as being comprised of three cellular phenotypes: normoxic, hypoxic and necrotic cells interacting with a vasculature species, angiogenic factors and extracellular fluid. Within the PIHNA-E model, both hypoxic and normoxic cells emitted angiogenic factors, which recruited additional vessels and caused the vessels to leak, allowing fluid, or edema, to escape into the extracellular space. The model’s output was spatial volume fraction maps for each glioma cell type and edema/extracellular space. Volume fraction maps and corresponding T2 values were then incorporated into a multi-compartmental Bloch signal equation to create simulated T2W images. T2 values for individual compartments were estimated from the literature and a normal volunteer. T2 maps calculated from simulated images had normal white matter, normal gray matter, and tumor tissue T2 values within range of literature values.
Breast elastography: Identification of benign and malignant cancer based on absolute elastic modulus measurement using vibro-elastography
Junior Arroyo, Ana Cecilia Saavedra, Jorge Guerrero, et al.
Breast cancer is a public health problem with ~ 1.7 million new cases per year worldwide and with several limitations in the state-of-art screening techniques. Ultrasound elastography involves a set of techniques intended to facilitate the noninvasive diagnosis of cancer. Among these, Vibro-elastography is an ultrasound-based technique that employs external mechanical excitation to infer the elastic properties of soft tissue. In this paper, we evaluate the Vibro-elastography performance in the differentiation of benign and malignant breast lesions. For this study, a group of 18 women with clinically confirmed tumors or suspected malignant breast lesions were invited to participate. For each volunteer, an elastogram was obtained, and the mean elasticity of the lesion and the adjacent healthy tissue were calculated. After the acquisition, the volunteers underwent core-needle biopsy. The histopathological results allowed to validate the Vibro-elastography diagnosis, which ranged from benign to malignant lesions. Results indicate that the mean elasticity value of the benign lesions, malignant lesions and healthy breast tissue were 39.4 ± 12 KPa, 55.4 ± 7.02 KPa and 23.91 ± 4.57 kPa, respectively. The classification between benign and malignant breast cancer was performed using Support Vector Machine based on the measured lesion stiffness. A ROC curve permitted to quantify the accuracy of the differentiation and to define a suitable cutoff value of stiffness, obtaining an AUC of 0.90 and a cutoff value of 44.75 KPa. The results obtained suggest that Vibro-elastography allows differentiating between benign and malignant lesions. Furthermore, the elasticity values obtained for benign, malignant and healthy tissue are consistent with previous reports.
Local-search based prediction of medical image registration error
Görkem Saygili
Medical image registration is a crucial task in many different medical imaging applications. Hence, considerable amount of work has been published recently that aim to predict the error in a registration without any human effort. If provided, these error predictions can be used as a feedback to the registration algorithm to further improve its performance. Recent methods generally start with extracting image-based and deformation-based features, then apply feature pooling and finally train a Random Forest (RF) regressor to predict the real registration error. Image-based features can be calculated after applying a single registration but provide limited accuracy whereas deformation-based features such as variation of deformation vector field may require up to 20 registrations which is a considerably high time-consuming task. This paper proposes to use extracted features from a local search algorithm as image-based features to estimate the error of a registration. The proposed method comprises a local search algorithm to find corresponding voxels between registered image pairs and based on the amount of shifts and stereo confidence measures, it predicts the amount of registration error in millimetres densely using a RF regressor. Compared to other algorithms in the literature, the proposed algorithm does not require multiple registrations, can be efficiently implemented on a Graphical Processing Unit (GPU) and can still provide highly accurate error predictions in existence of large registration error. Experimental results with real registrations on a public dataset indicate a substantially high accuracy achieved by using features from the local search algorithm.