Proceedings Volume 10952

Medical Imaging 2019: Image Perception, Observer Performance, and Technology Assessment

cover
Proceedings Volume 10952

Medical Imaging 2019: Image Perception, Observer Performance, and Technology Assessment

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 17 June 2019
Contents: 9 Sessions, 39 Papers, 27 Presentations
Conference: SPIE Medical Imaging 2019
Volume Number: 10952

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 10952
  • Image Perception
  • Model Observers I
  • Model Observers II
  • Technology Impact and Assessment
  • Deep Learning Applications
  • Observer Performance
  • Observer Performance in Breast Imaging
  • Poster Session
Front Matter: Volume 10952
icon_mobile_dropdown
Front Matter: Volume 10952
This PDF file contains the front matter associated with SPIE Proceedings Volume 10952, including the Title Page, Copyright information, Table of Contents, Author and Conference Committee lists.
Image Perception
icon_mobile_dropdown
Visual adaptation and the perception of radiological images (Conference Presentation)
The interpretation of medical images relies heavily on visual inspection by human observers. Many studies have explored how sensory and cognitive factors in visual processing influence how medical images are perceived and evaluated. But how do these images influence visual processing itself? The visual system is highly adaptable and constantly adjusting to changes in the visual environment. These adjustments recalibrate and optimize visual coding not only for simple properties of the world like the average light level, but also for complex features like the average blur or texture in a scene. Adaptation thus affects everything we see. The unique visual characteristics of radiological images suggest that they may hold the radiologist in unique states of adaptation. I will illustrate how this adaptation influences contrast sensitivity and the appearance of medical images. One proposed function of adaptation is to highlight novel information by “filtering out” the expected characteristics of scenes, and I will illustrate the implications of this by considering how adaptation may affect visual search for novel or suspicious features in medical images.
Does the strength of the gist signal predict the difficulty of breast cancer detection in usual presentation and reporting mechanisms?
Ziba Gandomkar, Ernest U. Ekpo, Sarah J. Lewis, et al.
This study measured the correlation between the magnitude of the presence of the abnormality gist and case difficulty based on standard presentation and reporting mechanisms for 80 cases. Half of the cases contained biopsy-proven cancer while the remainder were normal and confirmed to be cancer-free for at least two years of follow-up. In the gist experiment, seventeen breast radiologists and physicians gave an abnormality score on a scale from 0 (confident normal) to 100 (confident abnormal) to unilateral CC mammograms following a very brief, 500 millisecond presentation of the image. Independently, each mammogram was assessed by a separate sample of at least 40 radiologists using standard presentation and reporting mechanisms, with these readers asked to locate any cancers present. All readers reported at least 1000 cases annually. For each case and each category, the percentage of correct reports served as an objective measure of case difficulty (lower rate of correct report shows a more difficult case). For each of the 17 readers, the association between the abnormality scores from the gist study and detection rates from the earlier reports was examined using Spearman correlation. None of the coefficients were significantly different from zero (p<0.05). For the normal cases, the correlation coefficient between abnormality scores and detection rates for the 17 readers ranged from -0.262 to 0.258, and for cancer -0.180 to 0.309. The results suggest that the gist signal may indicate the presence of cancer, using mechanisms other than those employed in usual reporting, and might be exploited to improve breast cancer detection.
Oculomotor behavior of radiologists reading digital breast tomosynthesis (DBT)
Nicholas M. D'Ardenne, Robert M. Nishikawa, Margarita L. Zuley, et al.
Digital breast tomosynthesis (DBT) is beginning to be used more frequently alongside full-field digital mammography (DM) in routine breast cancer screening. However, little is known about radiologists’ search strategies reading DBT. This study aims to measure radiologists’ eye movements prior to testing search strategies intended to make DBT faster and/or more accurate. Twelve observers (board certified breast radiologists or current women’s imaging fellows) were instructed to search for lesions as they would report during normal clinical conditions. Observers were shown a single view of a single breast for each case and informed that this was an enriched study (10 positive cases out 20). Eye tracking used a SMI RED250mobile Eye Tracker sampling at 250Hz. Tracking error was below 0.5 deg. There was an increase in the detection rate/accuracy and decrease in false positives with DBT compared to DM. There was a longer search time in DBT compared to DM, with a shorter saccade length shown in DM. Different observers where shown to have differing search patterns however it was not possible to show any observer to have a true driller or scanner technique. Results generally replicated previous studies however; more work would be needed to obtain conclusions as to optimal search strategies.
Model Observers I
icon_mobile_dropdown
Automatic strategy for CHO channel reduction in x-ray angiography systems
Daniel Gomez-Cardona, Shuai Leng, Christopher P. Favazza, et al.
Multiple efforts have been made in x-ray angiography to transition from traditional image quality metrics to mathematical observer models. Recent works have successfully implemented the channelized Hotelling observer (CHO) model for x-ray angiography systems. However, in these works the channel selection process is ambiguous and limits to identifying a range of frequencies and other channel parameters that are believed to represent the most relevant features of the imaging tasks. This channel selection rationale can be sufficient for certain simple scenarios but it might not be enough for more complex ones. On the other hand, it has been shown that besides dealing with the well-known bias caused by a finite number of samples, there is also another source of bias in the estimation of the detectability index in x-ray angiography. Such source of bias has been attributed to nonrandom differences in noise between images acquired at different time points, also referred as temporally variable nonstationary noise. This work proposes a task-specific automated method for optimal channel selection and corrects for the influence of bias due to temporally variable nonstationary noise, particular from x-ray angiography systems. The proposed method is computationally inexpensive, provides time efficient selection of optimal channels, and contributes to minimize bias, all of these without significantly compromising the accuracy of the detectability index estimation. This method for channel optimization can be readily adapted to other imaging modalities.
Template models for forced-localization tasks
We have previously presented the results of a psychophysical study of forced localization performance in tasks that include ramp-spectrum noise as a model of CT acquisition noise. The study used efficiency analysis and the classification image technique to better understand the mechanisms of performance. In this work, we begin the process of developing models of observer performance for these tasks. Since the image statistics for these tasks are well defined, we can use the ensemble image mean and covariance to compute templates for several standard model observers (PWMF, NPW, NPWE, and four implementations of the CHO). The goal of this work is to compare these with the human observer classification images to see if they are using spatial weighting of images in similar ways to perform the tasks. The most direct comparison would be of a model-observer template to the average human-observer classification image. However it may be the case that the classification image estimation procedure introduces bias into the classification image. Additionally, classification images may have effects from internal noise. Thus we also explore a comparison of humans and models in which the model is elaborated with internal noise and used as a scanning linear filter to generate localization responses. A classification image is then estimated for the model and compared with human-observer classification images. We compare both performance and the classification-image profile to the human-subject data. We find very limited ability to fit both endpoints.
Autoencoder embedding of task-specific information
Task-based measures of image quality (IQ) quantify the ability of an observer to perform a specific task. Such measures are employed for assessing and optimizing medical imaging systems. Although the Bayesian ideal observer is optimal by definition, it is frequently both non-linear and intractable. In such cases, linear observers are commonly employed. However, the optimal linear observer, the Hotelling observer (HO), becomes intractable when considering large images. Channelized methods have become popular for reducing the dimensionality of image data. In this work, we propose a novel method for determining efficient channels by learning them with autoencoders (AEs). Autoencoders are neural networks that can be employed to learn concise representations of data, frequently for the purposes of reducing dimensionality. We trained several AEs to encode task-specific information by modifying the standard loss function and examined the effect of hidden layer size and the use of tied/untied weights on the resulting representation accuracy. Subsequently, HOs were applied to both the original images and the dimensionality-reduced versions of them produced by the AEs. It was demonstrated that, for a suitable specification of the AE, the performance of the HO was relatively unaffected by the encoding of the image. However, the computational cost of inverting the covariance matrix was greatly reduced when the HO was applied with the encoded data due to its reduced dimensionality. Our findings suggest that AEs may represent an attractive alternative to the use of heuristic channels for reducing the dimensionality of image data when seeking to accurately approximate the performance of the HO on signal detection tasks.
Learning the Hotelling observer for SKE detection tasks by use of supervised learning methods
Task-based measures of image quality (IQ) quantify the ability of an observer to perform a specific task. Such measures are commonly employed for assessing and optimizing medical imaging systems. In binary signal detection tasks, the Bayesian ideal observer (IO) sets an upper performance limit. However, the IO test statistic is generally intractable to compute when the log-likelihood ratio depends non-linearly on the measurement data. In such cases, the Hotelling observer (HO), which is the optimal linear observer, can be employed. However, traditional implementations of the HO require estimation and inversion of covariance matrices; for large images this can be computationally burdensome or even intractable. In this work, we describe a novel supervised learning- based method that employs artificial neural networks (ANNs) for estimating the HO test statistic and does not require estimation or inversion of covariance matrices. A signal-known-exactly and background-known-exactly (SKE/BKE) signal detection task is considered. The receiver operating characteristic (ROC) curve and Hotelling template corresponding to the proposed method are compared to the corresponding analytical solutions.
Learning the ideal observer for joint detection and localization tasks by use of convolutional neural networks
In medical imaging, task-based measures of image quality (IQ) have been commonly employed to assess and optimize imaging systems. To evaluate task-based measures of IQ, the performance of an observer on a relevant task is quantified. For a binary signal detection task, the Bayesian Ideal Observer sets an upper performance limit in a sense that it maximizes the area under the receiver operating characteristic (ROC) curve (AUC). When a joint signal detection and localization (detection-localization) task is considered, the modified generalized likelihood ratio test (MGLRT) has been advocated as an optimal decision strategy to maximize the area under the localization ROC (LROC) curve (ALROC). However, analytical computation of likelihood ratios employed in the MGLRT is generally intractable. In this work, a supervised learning-based method that employs convolutional neural networks (CNNs) is developed and implemented for approximating the Ideal Observer that maximizes the area under the LROC curve for signal detection-localization tasks. A background-known-exactly (BKE) case was considered. The resulting LROC curve and ALROC value are compared to those produced by an analytical calculation.
Model Observers II
icon_mobile_dropdown
Laguerre-Gauss and sparse difference-of-Gaussians observer models for signal detection using constrained reconstruction in magnetic resonance imaging
Magnetic resonance imaging (MRI) data acquisition is sometimes accelerated by pseudo-random under-sampling of the frequency domain which is followed by constrained reconstruction. This approach to acceleration assumes a certain level of sparsity of the object being imaged. The sparsity is typically considered for the background anatomy but not explored in terms of a signal detection task. In this study we implement a 2.56x one dimensional acceleration in the acquisition using fully sampled low frequencies and randomly sampled high frequencies with a total variation reconstruction. A small and a large lesion were synthetically placed in a 3D MRI volume in non-overlapping regions. From 40 slices of this volume and 16 regions per slice, 640 sub-images with and without signals were generated to estimate the detection performance of lesions with anatomical variation. We compared the effect of this approach on signal detection using a channelized Hotelling observer approximating the ideal linear observer (with 10 Laguerre-Gauss channels) and one approximating a human observer (with sparse difference-of-Gaussians channels). The area under the receiver operating characteristic curve (AUC) was estimated using the Mann-Whitney statistic and the uncertainty of the estimate was assessed using a bootstrap distribution with 10,000 samples. We found that for these two tasks and model observers, total variation did not lead to a statistically significant improvement in detection performance and that the effect of regularization was larger for the Laguerre-Gauss model than for the sparse difference-of-Gaussians model.
Tests of projection and reconstruction domain equivalence for a feature-driven model observer
Nonlinear visual-search (VS) observers have shown an ability to model humans for realistic detection, localization, and classification tasks with tomographic reconstructions. As reconstruction studies test the joint effects of data acquisition and postprocessing, applying observers for similar tasks in the projection domain is also of interest. To help investigate what information a non-ideal model can provide in this role, we have developed an analytical method for assessing acquisition quality in the reconstruction domain. This approach is most useful for assessing acquisition quality based on lesion-localization tasks. Observer studies with simulated Ga-67 SPECT data were conducted to test our method. The data were acquired for different numbers of projection angles. The results illustrate how image reconstruction processes can improve on acquisition quality as measured for an anthropomorphic model observer.
New difference of Gaussian channel-sets for the channelized Hotelling observer?
Christiana Balta, Ioannis Sechopoulos, Ramona W. Bouwman, et al.
The channelized-Hotelling observer (CHO) was investigated on the ability to predict the human detection performance in order to assess clinical image quality objectively. CHO applied three user-selectable difference of Gaussian (DoG) channels on the images. The choice of the parameter values that comprise the DoG channel-sets of the CHO was investigated. In order to select the optimal channels, the CHO performance was compared to that of humans who scored digital mammography (DM) images in 2-alternative forced choice experiments. Square regions-of-interest (ROI)s from DM images of an anthropomorphic breast phantom with and without calcification-like signals were extracted. Images at four dose levels were acquired and the resulting signal detectability was assessed using the CHO with various DoG channel parameters. It was found that varying these parameter values affects the correlation (r2) of the CHO with human observers for the detection task investigated. It appears that the DoG channel-sets need to be adapted to the frequency content of the signals and backgrounds present in the DM images.
A foveated channelized Hotelling search model predicts dissociations in human performance in 2D and 3D images
We developed a foveated search model based on the channelized Hotelling observer (FCHO) that processes each image in parallel with Gabor channels which center frequencies decrease with retinal eccentricity (distance away from the fovea). The FCHO model is based on how radiologists read 3D volumes by scrolling through 2D slices. The model includes a search component that explores the image by making eye movements guided by peripheral processing and a slice scroll component that changes the current slice for the 3D images. We designed an experiment consisting of a free search in 2D and 3D noise filtered with power-law statistics matching mammograms. Two different targets were embedded in this background: one resembling a microcalcification and another one resembling a mass. Unlike traditional model observers (chanellized Hotelling and non-prewhitening matched filter with an eye filter), performance for the FCHO search model for both 2D and 3D search tasks showed similar dissociations as that of human observers. Detectability of a microcalcification-like target was significantly higher than for a mass-like target in 2D search but comparable or lower in 3D search. Together, our results suggest that 3D search tasks will require more computationally complex 3D search models that take into account the foveated properties of the visual system and eye movements to predict human observer performance in clinically relevant tasks.
Using transfer learning for a deep learning model observer
Recent developments in technology assessment and optimization methodology have seen an expansion in the use of Virtual Clinical Trials (VCT) as an alternative to conventional clinical trials. However, the ultimate value gained from VCTs relies on the speed and quality of results generated from the VCT pipeline. In many cases the end-point human observer represents a bottle-neck due to resource and time limitations. This motivates the development of a machine-based observer for key task-based assessment studies. Previous work using Deep Learning for detection and observer studies has shown significant promise, but requires large amounts of data for training. We therefore have built a model observer based on the VGG19 neural network architecture combined with transfer learning to successfully train a TLMO (Transfer Learning Model Observer) that can detect both screen-detected malignancies and simulated lesions in images of 303 x 303 pixels. Our results demonstrate a strong response for the detection of simulated lesions, 4mm in diameter, using the OPTIMAM VCT Toolbox, achieving a sensitivity of 0.78 and a specificity of 0.92. The model has also been tested using well-defined and ill-defined screen-detected masses where it achieved a sensitivity of 0.85 and a specificity of 0.83.
Technology Impact and Assessment
icon_mobile_dropdown
Estimating latent reader-performance variability using the Obuchowski-Rockette method
We describe how the Obuchowski-Rockette (OR) method of analysis for multi-reader diagnostic studies can be used to estimate the variability of latent reader-performance outcomes, such as the area under the ROC curve (AUC). For a specific reader the latent or true reader performance outcome can conceptually be thought of as the estimate that would result if the reader were to read a very large number of cases. We note that for the sample sizes used in typical diagnostic studies, the latent reader-performance outcome is equal to the observed outcome minus measurement error. An often-cited study that assesses the variability of various reader-performance outcomes, including the AUC, is the study by Craig Beam et. al., “Variability in the Interpretation of Screening Mammograms by US Radiologists,” published in 1996. However, a problem with this type of study is that the variability estimates includes measurement error. Thus this approach overestimates latent reader variability and gives variability estimates that are dependent on case sample size. The proposed method overcomes these problems. We illustrate the proposed method for 29 radiologists in Jordan, with each reading 60 chest computed tomography (CT) scans. Using the OR method we were able to estimate the middle 95% range for latent AUC values to be 0.07; i.e., we estimate that 95% of radiologists differ by less than 0.07 in their ability to successfully discriminate between a pair of diseased and non-diseased cases. In contrast, the estimate for the 95% range for the observed AUCs was 0.18. Thus we see how conventional methods of describing reader variability can greatly overstate the variability of the true abilities of the readers.
Adaptive sample size re-estimation in MRMC studies
Multi-reader multi-case (MRMC) studies are often used for the evaluation of medical imaging devices. Due to limited prior information, the sizing of such studies (i.e., sizing both readers and cases) is often inaccurate. It is therefore desirable to adaptively resize the study towards a target power after an interim analysis of the study data. The major statistical concern for sample size re-estimation based on the interim analysis is the inflation of type I error rate. We developed methods that, based upon the observed data at the interim analysis, simultaneously resize the study towards a target power and adaptively adjust the critical value for the final hypothesis testing to control the type I error rate. Our methodologies apply to commonly used study endpoints including the area under the ROC curve (AUC), sensitivity, and specificity. Simulation studies show our methods can boost the statistical power to a target value by resizing the study after an interim analysis while controlling the type I error rate at the nominal level. We have developed a freely available R software package for the design and analysis of adaptive MRMC studies.
Radiation-therapy-induced erythema: comparison of spectroscopic diffuse reflectance measurements and visual assessment
Ramy Abdlaty, Lilian Doerwald, Joseph Hayward, et al.
Surveillance and assessment of radiation-induced erythema is an important aspect of managing skin toxicity in radiation therapy treated patients. Upon receiving the early fractions of radiation, an inflammatory response and vascular dilation takes place due to damage of basal cells in the skin’s epidermal layer. This process of skin reddening known as erythema. The gold standard used for assessing and grading erythema is visual assessment (VA) by an experienced clinician/ radiotherapist using toxicity scoring tools. This method is limited by the assessor’s experience, vision acuity, and the subjectivity of qualitative scores. An alternative optical technique to VA, is diffuse reflectance spectroscopy (DRS). A comparison between both techniques performance in detecting radiation therapy-induced erythema is demonstrated in this pilot study. The results evidenced that DRS is capable of detecting skin erythema before an expert eye could do so.
Impact of patient photos on detection accuracy, decision confidence, and eye-tracking parameters in chest and abdomen images with tubes and lines
To minimize errors in imaging studies, a camera system was developed that acquires images of patients simultaneously with radiographic images. 37 chest/abdomen portable radiographs showing central lines, orogastric/nasogastric/endotracheal tubes with patient photographs were viewed by 6 radiologists while eye-position was recorded. They indicated whether each line/tube was present/absent and rated confidence. Images were shown in 3 conditions: radiograph only, small or large photograph with radiograph. There was greater accuracy in detecting nasogastric and orogastric tubes with photographs present. Central lines had the most false positives but showed reduction with photographs. Decision confidence for central lines did not differ by image format, but for all tubes confidence without a photograph was significantly lower than with. For total viewing time there was a significant difference with no photograph format having the lowest viewing time followed by radiograph + large and radiograph + small photograph. For total number of fixations there was a significant difference with no photograph having the lowest number followed by radiograph + large and radiograph + small photograph. There was a significant difference in number of times observers transferred viewing from radiograph to photograph, with large photographs having fewer cross-overs than small. Adding patient photographs to radiographic interpretation of chest and abdomen films can aid in the detection of tubes/lines. If photograph size is large enough, it takes an average of only 3 extra seconds to view compared to the radiograph alone and adds significant confidence to decisions.
Is there a safety-net effect with computer-aided detection (CAD)?
Ethan Du-Crow, Lucy Warren, Susan M. Astley, et al.
Computer-Aided Detection (CAD) systems are used to aid readers interpreting screening mammograms. An expert reader searches the image initially unaided, and then once again with the aid of CAD which prompts automatically detected suspicious regions. This could lead to a ‘safety-net’ effect, where the initial unaided search of the image is adversely affected by the fact that it is preliminary to an additional search with CAD, and may, therefore, be less thorough. To investigate the existence of such an effect, we created a visual search experiment for non-expert observers mirroring breast screening with CAD. Each observer searched 100 images for microcalcification clusters within synthetic images in both prompted and unprompted (no-CAD) conditions. Fifty-two participants were recruited for the study, 48 of whom had their eye movements tracked in real-time; four participants could not be accurately calibrated so only behavioural data was collected. In the CAD condition, before prompts were displayed, image coverage was significantly lower than coverage in the no-CAD condition (t(47)=5.48, p<0.001). Observer sensitivity was significantly greater for targets marked by CAD than the same targets in the no-CAD condition (t(51)=11.67, p<0.001). For targets not marked by CAD, there was no significant difference in observer sensitivity in the CAD condition compared to the same targets in the no-CAD condition (t(51)=0.88, p=0.382). These results suggest that the initial search may be influenced by the subsequent availability of CAD; if so, CAD efficacy studies should account for the effect when estimating benefit.
Deep Learning Applications
icon_mobile_dropdown
Correlation between a deep-learning-based model observer and human observer for a realistic lung nodule localization task in chest CT
Mathematical model observers (MOs) have become popular in task-based CT image quality assessment, since, once proven to be correlated with human observers (HOs), these MOs can be used to estimate HO performance. However, typical MO studies are limited to phantom data which only involve uniform background. In practice, anatomical background variability and tissue non-uniformity affect HO lesion detection performance. Recently, we have proposed a deep-learning-based MO (DL-MO). In this study, we aim to investigate the correlation between this DL-MO and HOs for a lung-nodule localization task in chest CT. Using a patient database that contains 50 lung cancer screening CT patient cases, 12 different experimental conditions were generated, including 4 radiation dose levels, 3 nodule sizes, 2 nodule types and 3 reconstruction types. These conditions were created by using a validated noise and lesion insertion tool. Four subspecialized radiologists performed the HO study for all 12 conditions individually in a randomized fashion. The DL-MO was trained and tested for the same dataset. The performance of DL-MO and HO was compared across all the experimental conditions. DL-MO performance was strongly correlated with HO performance (Pearson’s correlation coefficient: 0.988 with a 95% confidence interval of [0.894, 0.999]). These results demonstrate the potential to use the proposed DL-MO to predict HO performance for the task of lung nodule localization in chest CT.
Implementation of an ideal observer model using convolutional neural network for breast CT images
In this work, we proposed a non-linear observer model based on convolutional neural network and compare its performance with LG-CHO for four alternative forced choice detection task using simulated breast CT images. In our network, each convolutional layer contained 3×3 filters and a leaky-ReLU as an activation function, but a pooling layer and a zero padding to the output of each convolutional layer were not used unlike general convolutional neural network. Network training was conducted using ADAM optimizer with two design parameters (i.e., network depth and width). The optimal value of the design parameter was found by brute force searching, which spanned up to 30 for depth and 128 for channel, respectively. To generate training and validation dataset, we generated anatomical noise images using a power law spectrum of breast anatomy. 50% volume glandular fraction was assumed, and 1 mm diameter signal was used for detection task. The generated images were recon- structed using filtered back-projection with a fan beam CT geometry, and ramp and Hanning filters were used as an apodization filter to generate different noise structures. To train our network, 125,000 signal present images and 375,000 signal absent images were reconstructed for each apodization filter. To measure detectability, we used percent correction with 4,000 images, generated independently from training and validation dataset. Our results show that the proposed network composed of 30 layers and 64 channels provides higher detectability than LG-CHO. We believe that the improved detectability is achieved by the presence of the non-linear module (i.e., leaky-ReLU) in the network.
Learning stochastic object model from noisy imaging measurements using AmbientGANs
Weimin Zhou, Sayantan Bhadra, Frank Brooks, et al.
The objective optimization of image-derived statistics, including the test statistic of an observer for specific decision tasks, requires a characterization of all sources of variability in the measured data. To accomplish this, it is necessary to establish a stochastic object model (SOM) that describes the variability within a group of objects to-be imaged. In order for the SOM to be realistic, it is desirable to establish it by use of experimental image data, as opposed to establishing it in a non-data-driven manner. Deep learning methods that employ generative adversarial networks (GANs) hold promise for learning SOMs that can generate images that match distributions of training image data. However, because experimental data recorded by an imaging system represent noisy and indirect measurements of the object, conventional GANs cannot be directly employed for this task. Recently, an augmented GAN architecture named AmbientGAN was proposed that can characterize a distribution of images from noisy and indirect measurements of them and knowledge of the measurement operator. In this work, for the first time, we investigate AmbientGANs for establishing SOMs by use of noisy imaging measurements. A canonical tomographic imaging system that is described by a two-dimensional Radon transform model is investigated. The AmbientGAN is evaluated by performing binary signal detection tasks that employ the generated images and true images.
BI-RADS density categorization using deep neural networks
Ziba Gandomkar, Moayyad E. Suleiman, Delgermaa Demchig, et al.
The Breast Imaging and Reporting Data System (BI-RADS) density score is a qualitative measure and thus subject to inter- and intra-radiologist variability. In this study we investigated the possibility of fine-tuning a state-of-the-art deep neural networks for (i) distinguishing fatty breasts (BI-RADS I and II) from dense ones (BI-RADS III and IV), (ii) classifying the low risk group into BIRADS I and II, and (iii) classifying the high risk group into BIRADS III and IV. To do so 3813 images acquired from nine mammography units and three manufacturers were used to train an Inception- V3 network architecture. The network was pre-trained on the ImageNet data set and we trained it on our dataset using transfer learning. Before feeding the images into the input layer of Inception- V3, the breast tissue was segmented from the background and the pectoral muscle was excluded from the image in the mediolateral oblique view. Images were then cropped by using the breast bounding box and resized to make the images compatible with the input layer of the network. The performance of the network was evaluated on a blinded test set of 150 mammograms acquired from 14 mammography units provided by six manufacturers. The reference density value for these images was obtained based on the consensus of three radiologists. The network achieved an accuracy of 92.0% in high versus low risk classification. For the second and third classification tasks, the overall accuracy was 85.9% and 86.1%. When results from all three classifications combined, the networks achieved an accuracy of 83.33% and a Cohen’s kappa of 0.775 (95% CI: 0.694-0.856) for four-point density categorization. The obtained results suggest that a deep learning-based computerized tool can be used for providing BI-RADS density scores.
Mammographic breast density classification using a deep neural network: assessment based on inter-observer variability
N. Kaiser, A. Fieselmann, S. Vesal, et al.
Mammographic breast density is an important risk marker in breast cancer screening. The ACR BI-RADS guidelines (5th ed.) define four breast density categories that can be dichotomized by the two super-classes dense" and not dense". Due to the qualitative description of the categories, density assessment by radiologists is characterized by a high inter-observer variability. To quantify this variability, we compute the overall percentage agreement (OPA) and Cohen's kappa of 32 radiologists to the panel majority vote based on the two super-classes. Further, we analyze the OPA between individual radiologists and compare the performances to an automated assessment via a convolutional neural network (CNN). The data used for evaluation contains 600 breast cancer screening examinations with four views each. The CNN was designed to take all views of an examination as input and trained on a dataset with 7186 cases to output one of the two super-classes. The highest agreement to the panel majority vote (PMV) achieved by a single radiologist is 99%, the lowest score is 71% with a mean of 89%. The OPA of two individual radiologists ranges from a maximum of 97.5% to a minimum of 50.5% with a mean of 83%. Cohen's kappa values of radiologists to the PMV range from 0.97 to 0.47 with a mean of 0.77. The presented algorithm reaches an OPA to all 32 radiologists of 88% and a kappa of 0.75. Our results show that inter-observer variability for breast density assessment is high even if the problem is reduced to two categories and that our convolutional neural network can provide labelling comparable to an average radiologist. We also discuss how to deal with automated classification methods for subjective tasks.
Observer Performance
icon_mobile_dropdown
Development of methods to evaluate probability of reviewer’s assessment bias in Blinded Independent Central Review (BICR) imaging studies
Purpose: To develop novel monitoring methods in Blinded Independent Central Review (BICR) imaging trials in which two radiologist reviewers assess the same images. In this project we aimed to ‘flag’ any reviewer that might have an assessment bias compared to the assessments of other reviewers on a specific study. Methods: Retrospective data analysis using R programming scripts was used to evaluate discordant assessments between two reviewers. We use a binomial test to determine the probability that an estimated low adjudication agreement rate is statistically less than the expected rate for all reviewer discordant assessment pairs. Results: We determined that for five or more discordant cases we can calculate the probability that each individual reviewer might have a statistically significant probability of low adjudication agreement for each discordant pair of assessments. We then analyzed the assessment data for sixteen oncological BICR clinical trials. Conclusions: The basic methods described can ‘flag’ or ‘signal’ a potential assessment ‘bias’. Although we initially focused on studies following one published clinical trial criteria to evaluate solid tumor we have applied the methods to other oncological studies with different published criteria which also may require double radiological reviews.
Reader disagreement index: a better measure of overall review quality monitoring in an oncology trial compared to adjudication rate
Purpose: Blinded Independent Central Review (BICR) is highly recommended by regulatory authorities for oncology registration trials. “Adjudication rate” provided by “Two Reviewers and Adjudicator Paradigm” of BICR has been part of reviewer performance metrics and trial efficacy. However, adjudication rate does not consider the adjudicator agreement or disagreement rate of a reviewer. We analyzed that Reader Disagreement Index (RDI) is a better measure than adjudication rate to monitor reviewer performance in BICR. Methods: BICR adjudication data from 3 different clinical trials including 10 board-certified radiologist reviewers using Response Evaluation Criteria in Solid Tumors (RECIST) 1.1 criteria was analyzed. RDI for each reviewer was calculated using the below mentioned formula. Reviewer adjudication rate and adjudicator agreement rate was calculated for each reviewer along with RDI. RDI was used to identify the discordant reviewer with highest disagreement rate. Number of cases where adjudicator disagreed with given reader RDI (%) = Total number of cases read ×100 Results: RDI identified the discordant reviewer in all 3 studies. Discordant reviewers identified using RDI were not the reviewers with highest adjudication or lowest agreement rates. Adjudication rate identified the discordant reviewer in 1 of the 3 studies. Reviewer with lowest adjudicator agreement could not have been identified as discordant reviewer using only adjudication rate in monitoring reviewer performance. RDI is more robust in identifying a discordant reviewer who neither has highest adjudication nor lowest agreement rate. Conclusions: RDI is more reliable measure of reviewer performance as compared to adjudication rate and could be effectively used to monitor reviewer performance as it combines both reviewer adjudication percentage and adjudication agreement percentage.
A 2-AFC study to validate artificially inserted microcalcification clusters in digital mammography
The validation of many studies in the field of medical imaging relies on the measurement of sensitivity and specificity of a given task. In breast cancer screening, it is common to validate studies by assessing the impact to the sensitivity and specificity of cancer detection through experiments such as N-alternative forced-choice (N-AFC) or receiver operating characteristics (ROC). In general, these experiments require large datasets of clinical mammograms containing a considerable number of cases with true positive findings. Nonetheless, in clinical practice only a small fraction of the patients (<1%) are actually diagnosed with breast cancer. One common approach to solve this data constraint is to insert lesions to images taken from healthy patients, thus creating a hybrid dataset with real and artificial data. In this work we investigate a simple method to perform the segmentation of microcalcification clusters from clinical cases, followed by the artificial insertion of the clusters into normal (true negative) mammograms. A 2-AFC human observers study was performed, where observers were asked to choose between artificially inserted and real clusters of microcalcifications. Artificially inserted clusters were selected 47.3% of the time, indicating that the readers were not able to distinguish between artificially inserted and real clusters (p = 0.65). A dataset containing 100 BI-RADS 1 clinical mammograms with artificially inserted clusters was visually evaluated by an experienced radiologist, who was asked to comment on the appearance and positioning of the clusters. The reader considered the appearance and location realistic in 98% of the images. Two cases had problems related to patient motion and breast segmentation. The codes and segmented lesions are available for download at: https://github.com/LAVI-USP/MCInsertionPackage.
The relationship between breast screening readers’ real-life performance and their associated performance on the PERFORMS scheme (Conference Presentation)
Leng Dong, Jacquie Jenkins, Eleanor Cornford, et al.
Breast Screening Information System (BSIS) records all breast screening personnel’s quality assurance results for England. The PERFORMS self-assessment scheme also invites these individuals to take part annually to report a series of challenging breast screening cases and feedback to them to help them improve their real life screening performance. How PERFORMS data relate to actual screening performance were investigated between these two sets of data. In this study, 582 screeners consented to take part. Their performance over a three-year period were acquired from BSIS database. Also, each individual’s comparative data were extracted from the PERFORMS database over the same time period and the relationship between the two sets of measures were examined. 533 participants’ data were successfully matched and validated. A kendall’s tau-b correlation was run to determine the relationship between the PPV values calculated from real-life data (cancer detected/ total recalls) over the past three years and the PERFORMS average PPV values over the same period. There was a strong, positive correlation between them, which was statistically significant (τb = .141, p <.01) confirming that PERFORMS data accurately reflect real life screening performance. It can be concluded that the PERFORMS scheme could potentially be used to provide early indication of individual performance and helped them improve appropriately. More detailed analysis will also test the cancer detection rate, recall rate and discrepant cancers to see if more measures from real life performance can be reflected by PERFORMS data.
Blinding of the second reader in mammography screening: impact on behaviour and cancer detection
Background: The policy of the NHS Breast Cancer Screening Programme is for each woman’s mammograms to be examined by two separate readers, working independently. In practice, sometimes the second reader (reader 2) can see the decision of the first reader (reader 1). The National Breast Screening Service (NBSS) computer software automatically records whether the second reader can see the decision of the first reader or whether they are ‘blinded’. This study aimed to determine the effect of blinding the second reader on the recall rate and cancer detection rate of reader 2. Methods: Data were from eight screening centers based in the Midlands area in England participating in the 'Changing Case Order to Optimize Patterns of Performance in Screening (CO-OPS)' clinical trial. A three-level Markov Chain Monte Carlo multilevel model was fitted to determine the effect of blinding reader 2 on recall rate and cancer detection. Results: 207,595 women were included in the analysis, of whom 1,796 had cancer detected. Reader 2 was blinded to reader 1’s decisions for 54.5% (113,029/207,595) cases. If reader 2 is blinded, there is a high probability that they are more likely to recall than if they were not blinded for a prevalent case but less likely to recall an incident case. The interaction effects on reader 2’s cancer detection rate were not significant. Conclusion: If the second reader is not blinded to the decision of the first reader, they appear to be influenced by the first reader’s decision suggesting that reading is not independent.
Observer Performance in Breast Imaging
icon_mobile_dropdown
An observer study to assess the detection of calcification clusters using 2D mammography, digital breast tomosynthesis, and synthetic 2D imaging
The purpose of the study is to test the performance of the combination of digital breast tomosynthesis (DBT) and synthetic views on the detection for cancers presenting as calcifications compared to the performance of planar mammography combined with DBT. A pilot study is presented. A set of 22 cases without cancer were collected from a Siemens Inspiration mammography system. Twenty-two simulated calcification clusters were inserted into the planar and DBT projections of 16 cases. For each case one breast and one view were used. The images were processed using Siemens proprietary software. Seven experienced mammography readers viewed the cases in three study arms: planar alone (ArmP), planar with DBT (ArmP&D) and synthetic 2D with DBT (ArmS&D). The observers marked the suspected location of the clusters and classified the likelihood of there being a suspicious calcification clusters for each case. A JAFROC figure of merit (FoM) was calculated for each study arm. The detection fractions of all cases were 46±16% (P and P&D), 34±19% (S&D). For lesion marked for recall then the maximum detection rate was 19%. The FoMs were 0.48±0.15 (P) and 0.42±0.17 (P&D), but significantly lower (p≤0.003) for S&D (0.32±0.16). This pilot study demonstrated the feasibility of undertaking a larger study. The overall detection were lower (<50%) than optimal for a virtual clinical trial. We plan to increase the detection rate by using less subtle clusters in the final study. When using synthetic 2D images instead of planar images alongside DBT, the FoM was lower for subtle calcification clusters.
2D single-slice vs. 3D viewing of simulated tomosynthesis images of a small-scale breast tissue model
Christiana Balta, Ioannis Sechopoulos, Wouter J. H. Veldkamp, et al.
We investigate whether humans need to be shown the entire image stack (3D) or only the central slice (2D) of the lesion of breast tomosynthesis images in signal-known-exactly detection experiments. A directional small-scale breast tissue model based on random power-law noise was used. Assuming a breast tomosynthesis geometry, the tissue volumes were projected and reconstructed forming volumes-of-interest (VOI)s. Three different sizes of spheres with blurred edges were used to simulate lesions. The spheres were added on the VOIs to represent signal-present VOIs. Signal-present and signalabsent VOIs were presented during 2-alternative forced-choice experiments to 5 human observers in two modes; (i) 3D mode, in which all slices of the VOI were repeatedly displayed in ciné mode; and in (ii) 2D mode, in which only the central slice of the reconstructed VOI (where the signal-present VOIs contained the center of the spherical lesion) was displayed using 2-alternative forced-choice experiments. Percent correct (PC) of the detection performance of all observers was evaluated. No significant differences were found systematically in the PC for the 3D and 2D image viewing for this type of backgrounds. We plan to investigate these further, along with the development of a model observer that correlates well with human performance in tomosynthesis.
Changes in breast density
L. M. Warren, M. D. Halling-Brown, L. S. Wilkinson, et al.
Purpose: To measure changes in breast density in a screening population. Method: Unprocessed mammograms were collected for 8,268 women (6034 and 2234 women with two and three sequential screening rounds respectively) with normal breasts (routine recall), from the OPTIMAM image database. The volumetric breast density (VBD), fibroglandular volume (FGV) and breast volume (BV) were determined and the changes between screening rounds calculated. Linear regression determined if the rate of change in these breast density measures varied significantly with age at initial screen. The women were split into four quartiles according to VBD in both screening rounds, and any changes in the quartile allocation of each woman determined. The VBD for these women was compared to our previously published data for women with screen detected and interval cancers. Results: Averaged over all women, the percentage change in VBD, FGV and BV, over 6 years was -11.2% (95%CI: - 12.2% to -10.2%), -5.3% (95%CI: -6.1% to -4.5%) and 11.5% (95%CI: 10.4% to 12.6%) respectively. The percentage change per month, of VBD, FGV and BD decreased significantly with age (p<0.0001). The percentage change in FGV was more strongly associated with the FGV at initial screen than age. For the least and most dense quartiles (who would be of interest for risk stratification), the majority of women did not change quartile between screening rounds. VBD was higher for women who developed interval cancers. Conclusions: The average VBD decreased by 11% over six years. The majority (~80%) of women do not change quartile of VBD in six years.
Assessment of a quantitative mammographic imaging marker for breast cancer risk prediction
The purpose of this study is to assess feasibility of applying a new quantitative mammographic imaging marker to predict short-term breast cancer risk. An image dataset involving 1,044 women was retrospectively assembled. Each woman had two sequential “current” and “prior” digital mammography screenings with a time interval from 12 to 18 months. All “prior” images were originally interpreted negative by radiologists. In “current” screenings, 402 women were diagnosed with breast cancer and 642 remained negative. There is no significant difference of BIRADS based mammographic density ratings between three case groups (p >0.6). A new computer-aided image processing scheme was applied to process negative mammograms acquired from the “prior” screenings and compute image features related to the bilateral mammographic density or tissue asymmetry between the left and right breasts. A group of 30 features related to GLCM texture features and a conventional computer-aided detection scheme generated results are extracted from both CC and MLO views. Using a leave-one-case-out cross-validation method, a support vector machine model was developed to produce a new quantitative imaging marker to predict the likelihood of a woman having mammography-detectable cancer in the next subsequent (“current”) screening. When applying the model to classify between 402 positive and 642 negative cases, area under a ROC curve is 0.70−0.02 and the odds ratios is 6.93 with 95% confidence interval of [4.80,10.01]. This study demonstrated feasibility of applying a quantitative imaging marker to predict short-term cancer risk, which aims to help establish a new paradigm of personalized breast cancer screening.
Poster Session
icon_mobile_dropdown
Comparing senior residents performance to radiologists in lung cancer detection
Badera Al Mohammad, Stephen L. Hillis, Warren Reed, et al.
Background: Lung cancer, the leading cause of cancer death worldwide, can be survived if early detection through screening programs occurs. However if a large scale lung cancer screening program needs to be implemented, it may require a substantial increase in qualified readers’ numbers. To investigate whether senior radiology residents may potentially increase the pool of available readers in screening for lung cancer, by comparing their performance with that of board-certified radiologists. Methodology: Twenty board-certified radiologists and ten senior residents read sixty chest CT scans. Thirty cases had surgically or biopsy-proven lung cancer and the remaining thirty were cancer-free cases. The cancer cases were validated by four expert radiologists who located the malignant lung nodules. Reader performance was evaluated by calculating sensitivity, location sensitivity, specificity, area under the receiver-operating-characteristic curve (AUC) and sensitivity at fixed specificity = 0.8. Results: Readers had the following (radiologists, residents) pairs of values: sensitivity = (0.782, 0.687); location sensitivity = (0.702, 0.597); specificity = (0.8, 0.83); AUC = (0.844, 0.85) and sensitivity for fixed 0.8 specificity = (0.74, 0.73). Conclusion: Initial findings suggest that senior residents compare favorably with board-certified radiologists based on the similarity of the AUCs and the summary ROC curves in terms of the ability to discriminate between diseased and non-diseased patients. However, they have demonstrated significantly lower detection sensitivity than board-certified radiologists and may require additional training, considering the importance of having high sensitivity when screening for cancer.
Data transformations for variance stabilization in the statistical assessment of quantitative imaging biomarkers
Qi Gong, Qin Li, Marios A. Gavrielides, et al.
Variance stabilization is an important step in statistical assessment of quantitative imaging biomarkers (QIBs) to meet the equal variances requirements across different subgroups for many statistical tests. The objective of this study is to compare the commonly used Log transformation to the Box-Cox transformation for variance stabilization in the context of the assessment of a computed tomography (CT) lung nodule volume estimation QIB. Our investigation included the following: (1) We developed a model characterizing repeated measurements typically observed in CT lung nodule volume estimation. Given the model, we derived the parameter of the Box-Cox transformation that stabilizes the variance of the volume measurements across lung nodules. (2) We validated our approach using simulation data and examined factors that impact the performance of the transformations by comparing it to the standard Log transformation. The coefficient of variation for the standard deviation (CVstd) was used as the metric for quantifying the performance of transformations, with smaller CVstd indicating better variance stabilization. Results showed for both transformations, CVstd decreased with larger number of repeated measurements. For all simulated datasets, the Box-Cox transformation yielded smaller CVstd than the Log transformation. This suggests the Box-Cox transformation has better performance in variance stabilization for the estimation of lung nodule volume from CT data and can be a practical alternative for improved variance stabilization in the assessment of some types of QIBs. We are generating a guideline for determining when the Box-Cox might be a viable option to the Log transformation within a QIB assessment framework.
A case study regarding clinical performance evaluation method of medical device software for approval
Koji Shimizu, Gakuto Aoyama, Mizuho Nishio, et al.
We have developed a novel medical device software program that is intended to improve radiologists’ performance with regard to the detection of bone metastases on CT images. To establish the protocol of the clinical performance evaluation for regulatory approval, the feasibility of nonclinical study (NCS) and clinical study (CS) conforming to the principles of GCP was compared. There were two main issues with establishing the protocol in Japan. Firstly, retrospective case collection was not accepted for CS due to concerns regarding selection bias. However, it is difficult to prospectively collect cases for use in an observer study due to the wide variety of bone metastases cases required. Secondly, when setting the reference standard (RS) for bone metastases, the utilization of any patient information was not accepted for NCS due to concerns regarding an inappropriate RS setting. However, it was predicted that it would be difficult to set accurate RSs only using CT images. Consequently, following discussion with a Japanese regulatory body, the protocol was established as an NCS using retrospectively collected CT images. Furthermore, we agreed to use anonymized clinical images to set the RS and to rescue the test results by reviewing the RS after the observer study, hence ensuring reliability. An NCS was a feasible method of evaluating the novel medical device software while also complying with quality management systems. By using real-world clinical image databases, this clinical performance evaluation method may be both effective and efficient as a path to regulatory approval of certain medical device software.
In-vitro and in-vivo comparison of radiation dose estimates between state-of-the-art interventional fluoroscopy systems
The purpose of this study was to compare radiation dose estimates between state -of-the-art interventional fluoroscopy systems in vitro and in clinical cases. In vitro analysis included verifying system-reported air kerma rates (AKR; with a Radcal detector) and comparing AKR for simulated patient thicknesses (20-40cm) for different dose modes and clinical protocols on Philips 'AlluraClarity' (different generations) and Siemens 'Artis Q' systems (n=4). After IRB approval, system-reported radiation dose estimates i.e., cumulative air kerma (CAK) at the interventional reference point and kerma-area product (KAP), were extracted for interventional cases performed over a 16-month period from GE Centricity RIS and compared split by procedure type. Next, CAK and KAP for patients with metastatic uveal melanoma undergoing repeat chemo/immuno-embolization (potentially high radiation dose procedure) of the same liver lobe by the same physician on AlluraClarity and Artis Q were compared, accounting for differences in patient positioning, reference locations and digital acquisitions obtained from structured dose reports using DoseMetrix (Primordial) and CareAnalytics (Siemens). IBM's SPSS Statistics was used for parametric and non-parametric tests (with Bonferroni corrections for multiple comparisons ). In vitro analysis showed significant differences (p < 0.05) in verified AKR (25-45 mGy/min lower with AlluraClarity for thicknesses of 30-40 cm). Clinical data analysis comprised of 5113 cases; significant differences for CAK and KAP were seen for certain procedure (p < 0.05), with significantly lower values for AlluraClarity systems (differences in median: 34-61%). Subset analysis included 61 patients treated on both systems at different time points; accounting for differences in other parameters, CAK and KAP were significantly lower for AlluraClarity systems (p < 0.02; median for CAK lower by 44% and for KAP by 27%). Radiation dose differences observed in vitro between the AlluraClarity and Artis Q systems were reflected in clinical cases (even for same patients undergoing the same procedure). When the differences were significant, AlluraClarity systems showed relatively lower radiation utilization.
Prostate Imaging Self-assessment and Mentoring (PRISM): a prototype self-assessment scheme
Eleni Michalopoulou, Alastair Gale, Yan Chen
Prostate cancer is the most common cancer in men and a leading cause of morbidity and mortality globally. To ensure that men receive an accurate prostate cancer diagnosis, we developed the PRISM App, a web-based self-assessment platform designed for clinicians to increase their confidence in the use of mpMRI before biopsy. The App, which provided participants with a prostate sector map, anonymous patient’s clinical history and mpMRI technical information, was tested by three radiologists of different mpMRI experience. Participants determined the number of lesions that were present in a set of twenty prostate mpMRI images, by marking and describing their location on the map. They were also asked to decide on the radiological classification, using a five-point Likert scale, and record the T-stage. Participants' screening performance was calculated by two sets of measures based on a) expert’s opinion regarding whether a case should be recalled for further investigation or not and b) the known case pathology regarding whether malignancy was present or not. The results showed that two of the participants had specificity scores at ceiling (100%) whereas the third had a specificity score at the level of change (50%), reflecting the small number of benign cases in the case set (n=6). Participants' comments regarding their experience using the App was positive, indicating that the PRISM scheme could be helpful in building confidence in reading mpMRI cases. Further testing with an appropriate number and variety of cases would be a key element in the success of the PRISM App.
Deep residual-network-based quality assessment for SD-OCT retinal images: preliminary study
Min Zhang, Jia Yang Wang, Lei Zhang, et al.
Optical coherence tomography (OCT) is widely used as an imaging technique for in vivo imaging of the human retina in clinical ophthalmology. For reliable clinical measurements, the quality of the OCT images needs to be sufficient. Hence, quality evaluation of OCT images is necessary. Although some quality assessment algorithms for OCT images have been proposed, their performance still needs to be improved. To the best of our knowledge, there is still no OCT image quality assessment algorithm based on deep learning framework. To address the OCT image quality assessment issue, we proposed an objective OCT image quality assessment (IQA) using Residual Networks (ResNets) combined with support vector regression (SVR) in this paper. A dataset of 482 OCT images is constructed, and the images quality are scored by the clinic experts. The pre-trained deep residual network from ImageNet is slightly revised and then fine-tuned to extract the features from OCT images. Then, the extracted features from the images and their corresponding subjective rating scores are used to learn the non-linear map with Support Vector Regression(SVR). To evaluate the performance of the proposed method, the correlation coefficients between the predicted score and the subjective rating score are utilized. And the experimental result demonstrates that the proposed algorithm is highly efficient in the OCT image quality assessment.
A statistical analysis of oral tagging in CT colonography and its impact on flat polyp detection and characterization
Marc J. Pomeroy, Matthew A. Barish, Perry J. Pickhardt, et al.
While computer-aided detection (CADe) and diagnosis (CADx) of colonic polyps via computed tomographic colonography (CTC) have made good progress for the subtypes of pedunculated and sessile polyps, the challenge remains for the subtypes of flat and serrated adenomas polyps. Oral fecal tagging has been widely used to increase the image contrast of colonic residues and fluids against the mucosal surface so that the colonic residues and fluids can be electronically cleansed for assessment of the entire mucosal surface, where the tagging solution containing barium, diatrizoate and/or iohexol has been frequently observed adherent coating on the polyps. This observation could provide additional useful information to relieve the challenges. This study aims to analyze the adherent coating performance of the oral tagging solution for the purpose of relieving the challenge for detection of the flat polyps and characterization of the serrated adenomas polyps. Total of 334 polyps detected by CTC and confirmed by clinical colonoscopy (OC) with pathology were analyzed, among which 251 tagged with solution containing barium and diatrizoate (BD) and the remaining 83 with tagging solution containing barium and iohexol (BL). An experience radiologist scored the polyp coating performance on all the polyps. This study evaluates the tagging efficiency for different polyp morphologies and finds that the tagging rate for flat polyps is slightly higher for BL than for BD (78.9% vs. 73.0%). For the primary goal of differentiation hyperplastic from serrated adenomas, we find that BD tags hyperplastic and serrated adenomas more that BL (75.4% and 91.8% vs. 63.2% and 80.0%, respectively). Though we find a difference in coating thickness using the BD protocol, it is not statistically significant based on the data acquired.
Missed cancer and visual search of mammograms: what feature-based machine-learning can tell us that deep-convolution learning cannot
Significant amount of effort has been invested in improving the quality of breast imaging modalities (for example, mammography) to increase the accuracy of breast cancer detection. Despite that, about 4-34% of cancers are still missed during mammographic examination of cancer of the breast. This indicates the need to explore a) The features of the lesions that are missed, and b) Whether the features of missed cancers contribute to why some cancers are not ‘looked at’ (search error) whereas others are ‘looked at’ but still not reported. In this visual search study, we perform feature analysis of all lesions that were missed by at least one participating radiologist. We focus on features extracted by means of Grey Level Co-occurrence Matrix properties, textural properties using Gabor filters, statistical information extraction using 2nd and higher-order (3rd and 4th) spectral analysis and also spatial-temporal attributes of radiologists’ visual search behaviour. We perform Analysis of Variance (ANOVA) on these features to explore the differences in features for cancers that were missed due to a) search, b) perception and c) decision making errors. Using these features, we trained Support Vector Machine, Gradient Boosting and stochastic gradient decent classifiers to determine the type of missed cancer (search, perception and decision making). We compared these feature-based models with a model trained using deep convolution neural network that learns features by itself. We determined whether deep learning or traditional machine learning performs best in this task.