Proceedings Volume 6515

Medical Imaging 2007: Image Perception, Observer Performance, and Technology Assessment

cover
Proceedings Volume 6515

Medical Imaging 2007: Image Perception, Observer Performance, and Technology Assessment

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 5 March 2007
Contents: 9 Sessions, 60 Papers, 0 Presentations
Conference: Medical Imaging 2007
Volume Number: 6515

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 6515
  • ROC Methods
  • Harold Kundel Honorary Lecture and Image Perception
  • Image Display
  • Model Observers I
  • FROC, LROC, and Other Analyses
  • Model Observers II
  • Technology Assessment I
  • Technology Assessment II
Front Matter: Volume 6515
icon_mobile_dropdown
Front Matter: Volume 6515
This PDF file contains the front matter associated with SPIE Proceedings Volume 6515, including the Title Page, Copyright information, Table of Contents, Introduction, and the Conference Committee listing.
ROC Methods
icon_mobile_dropdown
Describing three-class task performance: three-class linear discriminant analysis and three-class ROC analysis
Xin He, Eric C. Frey
Binary ROC analysis has solid decision-theoretic foundations and a close relationship to linear discriminant analysis (LDA). In particular, for the case of Gaussian equal covariance input data, the area under the ROC curve (AUC) value has a direct relationship to the Hotelling trace. Many attempts have been made to extend binary classification methods to multi-class. For example, Fukunaga extended binary LDA to obtain multi-class LDA, which uses the multi-class Hotelling trace as a figure-of-merit, and we have previously developed a three-class ROC analysis method. This work explores the relationship between conventional multi-class LDA and three-class ROC analysis. First, we developed a linear observer, the three-class Hotelling observer (3-HO). For Gaussian equal covariance data, the 3- HO provides equivalent performance to the three-class ideal observer and, under less strict conditions, maximizes the signal to noise ratio for classification of all pairs of the three classes simultaneously. The 3-HO templates are not the eigenvectors obtained from multi-class LDA. Second, we show that the three-class Hotelling trace, which is the figureof- merit in the conventional three-class extension of LDA, has significant limitations. Third, we demonstrate that, under certain conditions, there is a linear relationship between the eigenvectors obtained from multi-class LDA and 3-HO templates. We conclude that the 3-HO based on decision theory has advantages both in its decision theoretic background and in the usefulness of its figure-of-merit. Additionally, there exists the possibility of interpreting the two linear features extracted by the conventional extension of LDA from a decision theoretic point of view.
A utility-based performance metric for ROC analysis of N-class classification tasks
Darrin C. Edwards, Charles E. Metz
We have shown previously that an obvious generalization of the area under an ROC curve (AUC) cannot serve as a useful performance metric in classification tasks with more than two classes. We define a new performance metric, grounded in the concept of expected utility familiar from ideal observer decision theory, but which should not suffer from the issues of dimensionality and degeneracy inherent in the hypervolume under the ROC hypersurface in tasks with more than two classes. In the present work, we compare this performance metric with the traditional AUC metric in a variety of two-class tasks. Our numerical studies suggest that the behavior of the proposed performance metric is consistent with that of the AUC performance metric in a wide range of two-class classification tasks, while analytical investigation of three-class "near-guessing" observers supports our claim that the proposed performance metric is well-defined and positive in the limit as the observer's performance approaches that of the guessing observer.
Estimation ROC curves and their corresponding ideal observers
The LROC curve may be generalized in two ways. We can replace the location of the signal with an arbitrary set of parameters that we wish to estimate. We can also replace the binary function that determines whether an estimate is correct by a utility function that measures the usefulness of a particular estimate given the true parameter set. The expected utility for the true-positive detections may then be plotted versus the false-positive fraction as the detection threshold is varied to generate an estimation ROC curve (EROC). Suppose we run a 2AFC study where the observer must decide which image has the signal and then estimate the parameter set. Then the average value of the utility on those image pairs where the observer chooses the correct image is an estimate of the area under the EROC curve (AEROC). The ideal LROC observer may also be generalized to the ideal EROC observer, whose EROC curve lies above those of all other observers. When the utility function is non-negative the ideal EROC observer shares many properties with the ideal ROC observer, which can simplify the calculation of the ideal AEROC. When the utility function is concave the ideal EROC observer makes use of the posterior mean estimator. Other estimators that arise as special cases include maximum a posteriori estimators and maximum likelihood estimators. Multiple signals may be accomodated in this framework by making the number of signals one of the parameters in the set to be estimated.
A bivariate binormal ROC methodology for comparing new methods to an existing standard for screening applications
Validating the use of new imaging technologies for screening large patient populations is an important and very challenging area of diagnostic imaging research. A particular concern in ROC studies evaluating screening technologies is the problem of verification bias, in which an independent verification of disease status is only available for a subpopulation of patients, typically those with positive results by a current screening standard. For example, in screening mammography, a study might evaluate a new approach using a sample of patients that have undergone needle biopsy following a standard mammogram and subsequent work-up. This case sampling approach provides accurate independent verification of ground truth and increases the prevalence of disease cases. However, the selection criteria will likely bias results of the study. In this work we present an initial exploration of an approach to correcting this bias within the parametric framework of binormal assumptions. We posit conditionally bivariate normal distributions on the latent decision variable for both the new methodology as well as the screening standard. In this case, verification bias can be seen as the effect of missing data from an operating point in the screening standard. We examine the magnitude of this bias in the setting of breast cancer screening with mammography, and we derive a maximum likelihood approach to estimating bias corrected ROC curves in this model.
Pooling MRMC forced-choice data
There are at least two sources of variance when estimating the performance of an imaging device: the doctors (readers) and the patients (cases). These sources of variability generate variances and covariances in the observer study data that can be addressed with multi-reader, multi-case (MRMC) variance analysis. Frequently, a fully-crossed study design is used to collect the data; every reader reads every case. For imaging devices used during in vivo procedures, however, a fully-crossed design is infeasible. Instead, each patient is diagnosed by only one doctor, a doctor-patient study design. Here we investigate percent correct (PC) under this doctor-patient study design. From a probabilistic foundation, we present the bias and variance of two statistics: pooled PC and reader-averaged PC. We also present variance estimates of these statistics and compare them to naive estimates. Finally, we run simulations to assess the statistics and the variance estimates. The two PC statistics have the same means but different variances. The variances depend on how patients are distributed among the readers and the amount of reader variability. Regarding the variance estimates, the MRMC estimates are unbiased, whereas the naive estimates bracket the true variance and can be extremely biased.
A proposal of the evaluation method of the effectiveness of CAD use by means of the bias with variance characteristic (BVC) analysis and the internal structure analysis of the ROC-curve
Toru Matsumoto, Akira Furukawa, Kanae Nisizawa, et al.
In this paper, we propose a methodology for evaluating whether the use of CAD is effective for any given reader or case, first analyzing the results of readers' judgments (0 or 1) by the technique known as analysis of bias-variance characteristics (BVC)1,2, then by combining this with ROC analysis, elucidating the internal structure of the ROC curve. The mean and variance are first calculated for the situation when multiple readers examine a medical image for a single case without CAD and with CAD, and assign the values 0 and 1 to their judgment of whether abnormal findings are absent or present or whether the case is normal or abnormal. The mean of these values represents the degree of bias from the true diagnosis for the particular case, and the variance represents the spread of judgments between readers. When the relationship between the two parameters is examined for several cases with differing degrees of diagnostic difficulty, the mean (horizontal axis) and variance (vertical axis) show a bell-shaped relation. We have named this typical phenomenon arising when images are read, the bias-variance characteristic (BVC) of diagnosis. The mean of the 0 and 1 judgments of multiple readers is regarded as a measure of the confidence level determined for the particular case. ROC curves were drawn by usual methods for diagnoses made without CAD and with CAD. From the difference between the TPF obtained without CAD and with CAD for the same FPF on the ROC curve, we were able to quantify the number of cases, the total number of readers, and the total number of cases for which CAD support was beneficial. To demonstrate its usefulness, we applied this method to data obtained in a reading experiment that aimed to evaluate detection performance for abnormal findings and data obtained in a reading experiment that aimed to evaluate diagnostic discrimination performance for normal and abnormal cases. We analyzed the internal structure of the ROC curve produced when all cases were included, and showed that there is a relationship between the degree of diagnostic difficulty of the case and the benefit of CAD support and demonstrated that there are patients and readers for whom CAD is of benefit and those for whom it is not.
Harold Kundel Honorary Lecture and Image Perception
icon_mobile_dropdown
How to minimize perceptual error and maximize expertise in medical imaging
Harold L. Kundel
Visual perception is such an intimate part of human experience that we assume that it is entirely accurate. Yet, perception accounts for about half of the errors made by radiologists using adequate imaging technology. The true incidence of errors that directly affect patient well being is not known but it is probably at the lower end of the reported values of 3 to 25%. Errors in screening for lung and breast cancer are somewhat better characterized than errors in routine diagnosis. About 25% of cancers actually recorded on the images are missed and cancer is falsely reported in about 5% of normal people. Radiologists must strive to decrease error not only because of the potential impact on patient care but also because substantial variation among observers undermines confidence in the reliability of imaging diagnosis. Observer variation also has a major impact on technology evaluation because the variation between observers is frequently greater than the difference in the technologies being evaluated. This has become particularly important in the evaluation of computer aided diagnosis (CAD). Understanding the basic principles that govern the perception of medical images can provide a rational basis for making recommendations for minimizing perceptual error. It is convenient to organize thinking about perceptual error into five steps. 1) The initial acquisition of the image by the eye-brain (contrast and detail perception). 2) The organization of the retinal image into logical components to produce a literal perception (bottom-up, global, holistic). 3) Conversion of the literal perception into a preferred perception by resolving ambiguities in the literal perception (top-down, simulation, synthesis). 4) Selective visual scanning to acquire details that update the preferred perception. 5) Apply decision criteria to the preferred perception. The five steps are illustrated with examples from radiology with suggestions for minimizing error. The role of perceptual learning in the development of expertise is also considered.
Visual search characteristics in mammography: malignant vs. benign breast masses
Mammography screening is the most widely utilized tool to screen for breast cancer. Radiologists read a mammogram using a two-pass strategy where the first pass is guided by salient features of the image (the so-called 'pop-out' elements), and the second pass is a systematic search. It is assumed that most breast masses that are reported by the radiologist are in fact detected during the first pass of this search strategy, and that the second pass is useful for the detection of microcalcification clusters. Furthermore, experiments in other visual domains have shown that observers are attracted faster to incongruous elements in a display than to normal (i.e., more expected) elements. In this sense, it can be argued that benign findings constitute more expected findings, because they encompass a large percentage of all abnormalities found on a mammogram. In this experiment we sought to determine whether the search for malignant masses was indeed faster than the search for benign masses. We also aimed to determine whether the observers' overall visual search behavior was different between benign and malignant cases, not only in terms of how long it took the observers to hit the location of the lesion, but also how long the observers took analyzing the case, how different the distribution of false positive responses were between the two types of cases, etc.
Evaluating user interfaces for stack mode viewing
M. Stella Atkins, Arthur E. Kirkpatrick, Adelle Knight, et al.
The goal of this research was to evaluate two different stack mode layouts for 3D medical images - a regular stack mode layout where just the topmost image was visible, and a new stack mode layout, which included the images just before and after the main image. We developed stripped down user interfaces to test the techniques, and designed a look-alike radiology task using 3D artificial target stimuli implanted in the slices of medical image volumes. The task required searching for targets and identifying the range of slices containing the targets. Eight naive students participated, using a within-subjects design. We measured the response time and accuracy of subjects using the two layouts and tracked the eyegaze of several subjects while they performed the task. Eyegaze data was divided into fixations and saccades Subjects were 19% slower with the new stack layout than the standard stack layout, but 5 of the 8 subjects preferred the new layout. Analysis of the eyegaze data showed that in the new technique, the context images on both sides were fixated once the target was found in the topmost image. We believe that the extra time was caused by the difficulty in controlling the rate of scrolling, causing overshooting. We surmise that providing some contextual detail such as adjacent slices in the new stack mode layout is helpful to reduce cognitive load for this radiology look-alike task.
Evaluation of perception performance in neck dissection planning using eye tracking and attention landscapes
Oliver Burgert, Veronika Örn, Boris M. Velichkovsky, et al.
Neck dissection is a surgical intervention at which cervical lymph node metastases are removed. Accurate surgical planning is of high importance because wrong judgment of the situation causes severe harm for the patient. Diagnostic perception of radiological images by a surgeon is an acquired skill that can be enhanced by training and experience. To improve accuracy in detecting pathological lymph nodes by newcomers and less experienced professionals, it is essential to understand how surgical experts solve relevant visual and recognition tasks. By using eye tracking and especially the newly-developed attention landscapes visualizations, it could be determined whether visualization options, for example 3D models instead of CT data, help in increasing accuracy and speed of neck dissection planning. Thirteen ORL surgeons with different levels of expertise participated in this study. They inspected different visualizations of 3D models and original CT datasets of patients. Among others, we used scanpath analysis and attention landscapes to interpret the inspection strategies. It was possible to distinguish different patterns of visual exploratory activity. The experienced surgeons exhibited a higher concentration of attention on the limited number of areas of interest and demonstrated less saccadic eye movements indicating a better orientation.
Remote vs. head-mounted eye-tracking: a comparison using radiologists reading mammograms
Claudia Mello-Thoms, David Gur
Eye position monitoring has been used for decades in Radiology in order to determine how radiologists interpret medical images. Using these devices several discoveries about the perception/decision making process have been made, such as the importance of comparisons of perceived abnormalities with selected areas of the background, the likelihood that a true lesion will attract visual attention early in the reading process, and the finding that most misses attract prolonged visual dwell, often comparable to dwell in the location of reported lesions. However, eye position tracking is a cumbersome process, which often requires the observer to wear a helmet gear which contains the eye tracker per se and a magnetic head tracker, which allows for the computation of head position. Observers tend to complain of fatigue after wearing the gear for a prolonged time. Recently, with the advances made to remote eye-tracking, the use of head-mounted systems seemed destined to become a thing of the past. In this study we evaluated a remote eye tracking system, and compared it to a head-mounted system, as radiologists read a case set of one-view mammograms on a high-resolution display. We compared visual search parameters between the two systems, such as time to hit the location of the lesion for the first time, amount of dwell time in the location of the lesion, total time analyzing the image, etc. We also evaluated the observers' impressions of both systems, and what their perceptions were of the restrictions of each system.
Comparison of human observers and CDCOM software reading for CDMAM images
Nico Lanconelli, Stefano Rivetti, Paola Golinelli, et al.
Contrast-detail analysis is one the most common way for the assessment of the performance of an imaging system. Usually, the reading of phantoms, such as CDMAM, is obtained by human observers. The main drawbacks of this practice is the presence of inter-observer variability and the great amount of time needed. However, software programs are available, for reading CDMAM images in an automatic way. In this paper we present a comparison of human and software reading of CDMAM images coming from three different FFDM clinical units. Images were acquired at different exposures in the same conditions for the three systems. Once software has completed the reading, the interpretation of the results is achieved on the same way used for the human case. CDCOM results are consistent with human analysis, if we consider figures such as COR and IQF. On the other hand, we find out some discrepancies along the CD curves obtained by human observers, with respect to those estimated by automated CDCOM analysis.
How much is enough? Factors affecting the optimal interpretation of breast screening mammograms
PERFORMS (Personal Performance in Mammographic Screening), a self-assessment scheme for film-readers is undertaken as an educational tool by mammographers reading breast-screening films in the UK. The scheme has been running as a bi-annual exercise since its inception in 1991. In addition to completing the scheme each year the majority of film-readers also choose to complete a questionnaire, administered as part of the scheme, indicating key aspects of their every-day reading practice. These key aspects include, volume of cases read per week, time-on-task reading screening films, incidence and time of break periods as well as typical number of film-reading sessions per week. Previous recommendations on best screening practice (significantly the optimum time on task) were considered in the light of these film-readers' self-reports on a current PERFORMS case set. In addition we looked at performance accuracy of over 450 film-readers reading PERFORMS cases (60 difficult mammographic cases). Performance on measures akin to True Positive (Correct Recall Percentages) and True Negative (Correct Return to Screen Percentages) decisions were investigated. Data presented demonstrate that individual behaviours in real life screening, for the interpretation of mammographic cases, affect film-reading accuracy on a test set of mammograms for specificity and sensitivity (namely volume of cases read per week and film-reading experience). The consequences for best screening practice, in real life, are considered.
The dependency of pixel size on calcification reproducibility in digital mammography
Takao Kuwabara, Nobuyuki Iwasaki, Katsutoshi Yamane, et al.
The purpose of this study is to determine the relative effect of MTF, DQE, and pixel size on the shape of microcalcifications in mammography. Two original images were obtained by a) scanning the film that accompanies an RMI-156 phantom at a resolution of 25μm per pixel, b) creating an image with various shapes on a computer. Simulated images were then obtained by changing MTF, adding noise to simulate DQE effects, and changing the resolution of the original images. These images were visually evaluated to determine the recognition of the shape. In the evaluation of 400μm microcalcifications on the RMI-156 phantom, we found that shape recognition is maintained with a pixel size of 50μm or less regardless of MTF. However, at resolutions over 50μm, recognition was insufficient even when MTF was increased. Adding noise decreased visibility but did not affect shape recognition. The same results were obtained using computer-created shapes. The effect of pixel size on the recognition of the shape of microcalcifications was shown to be greater compared to MTF and DQE. It was also found that increasing MTF does not compensate for information lost because of enlarged pixel size.
Evaluation of the global effect of anatomical background on microcalcifications detection
Federica Zanca, Chantal Van Ongeval, Jurgen Jacobs, et al.
Purpose: 1/ To validate a method for simulating microcalcifications in mammography 2/ To evaluate the effect of anatomical background on visibility of (simulated) microcalcifications Materials and methods: Microcalcifications were extracted from the raw data of specimen from a stereotactic vacuum needle biopsy. The sizes of the templates varied from 200 μm to 1350μm and the peak contrast from 1.3% to 24%. Experienced breast imaging radiologists were asked to blindly evaluate images containing real and simulated lesions. Analysis was done using ROC methodology. The simulated lesions have been used for the creation of composite image datasets: 408 microcalcifications were simulated into 161 ROI's of 59 digital mammograms, having different anatomical backgrounds. Nine radiologists were asked to detect and rate them under conditions of free-search. A modified receiver operating characteristic study (FROC) was applied to find correlations between detectability and anatomical background. Results: 1/ The calculated area under the ROC curve, Az, was 0.52± 0.04. Simulated microcalcifications could not be distinguished from real ones. 2/ In the anatomical background classified as Category 1 (fatty), the detection fraction is the lowest (0.48), while for type 2,3,4 there is a gradually decrease (from 0.61 to 0.54) as the glandularity increases. The number of false positives is the highest for the background Category 1 (24%), compared to the other three types (16%). A 80% detectability is found for microcalcifications with a diameter > 400μm and a peak contrast >10%. Anatomic noise seems to limit detectability of large low contrast lesions, having a diameter >700μm.
A software system for the simulation of chest lesions
John T. Ryan, Mark McEntee, Saoirse Barrett, et al.
We report on the development of a novel software tool for the simulation of chest lesions. This software tool was developed for use in our study to attain optimal ambient lighting conditions for chest radiology. This study involved 61 consultant radiologists from the American Board of Radiology. Because of its success, we intend to use the same tool for future studies. The software has two main functions: the simulation of lesions and retrieval of information for ROC (Receiver Operating Characteristic) and JAFROC (Jack-Knife Free Response ROC) analysis. The simulation layer operates by randomly selecting an image from a bank of reportedly normal chest x-rays. A random location is then generated for each lesion, which is checked against a reference lung-map. If the location is within the lung fields, as derived from the lung-map, a lesion is superimposed. Lesions are also randomly selected from a bank of manually created chest lesion images. A blending algorithm determines which are the best intensity levels for the lesion to sit naturally within the chest x-ray. The same software was used to run a study for all 61 radiologists. A sequence of images is displayed in random order. Half of these images had simulated lesions, ranging from subtle to obvious, and half of the images were normal. The operator then selects locations where he/she thinks lesions exist and grades the lesion accordingly. We have found that this software was very effective in this study and intend to use the same principles for future studies.
Development of surgical lighting for enhanced color contrast
Maritoni Litorja, Steven W. Brown, Maria E. Nadal, et al.
The National Institute of Standards and Technology and the National Institutes of Health have started a collaborative study on the development of lighting that will provide enhanced, tissue-specific contrast with respect to its surroundings. In this paper we describe existing NIST technologies utilized for this project such as a computational model for color rendering and a new spectrally tunable lighting technology. We will also describe the calibration and validation procedure of a hyperspectral camera system. Finally, we show examples of imaged tissues under various lighting conditions.
Image Display
icon_mobile_dropdown
Influence of 8-bit versus 11-bit digital displays on observer performance and visual search: a multi-center evaluation
Monochrome monitors typically display 8 bits of data (256 shades of gray) at one time. This study determined if monitors that can display a wider range of grayscale information (11-bit) can improve observer performance and decrease the use of window/level in detecting pulmonary nodules. Three sites participated using 8 and 11-bit displays from three manufacturers. At each site, six radiologists reviewed 100 DR chest images on both displays. There was no significant difference in ROC Az (F = 0.0374, p = 0.8491) as a function of 8 vs 11 bit-depth. Average Az across all observers with 8-bits was 0.8284 and with 11-bits was 0.8253. There was a significant difference in overall viewing time (F = 10.209, p = 0.0014) favoring the 11-bit displays. Window/level use did not differ significantly for the two types of displays. Eye position recording on a subset of images at one site showed that cumulative dwell times for each decision category were lower with the 11-bit than with the 8-bit display. T-tests for paired observations showed that the TP (t = 1.452, p = 0.1507), FN (t = 0.050, p = 0.9609) and FP (t = 0.042, p = 0.9676) were not statistically significant. The difference for the TN decisions was statistically significant (t = 1.926, p = 0.05). 8-bit displays will not impact negatively diagnostic accuracy, but using 11-bit displays may improve workflow efficiency.
Ambient lighting: setting international standards for the viewing of softcopy chest images
Clinical radiological judgments are increasingly being made on softcopy LCD monitors. These monitors are found throughout the hospital environment in radiological reading rooms, outpatient clinics and wards. This means that ambient lighting where clinical judgments from images are made can vary widely. Inappropriate ambient lighting has several deleterious effects: monitor reflections reduce contrast; veiling glare adds brightness; dynamic range and detectability of low contrast objects is limited. Radiological images displayed on LCDs are more sensitive to the impact of inappropriate ambient lighting and with these devices problems described above are often more evident. The current work aims to provide data on optimum ambient lighting, based on lesions within chest images. The data provided may be used for the establishment of workable ambient lighting standards. Ambient lighting at 30cms from the monitor was set at 480 Lux (office lighting) 100 Lux (WHO recommendations), 40 Lux and <10 Lux. All monitors were calibrated to DICOM part 14 GSDF. Sixty radiologists were presented with 30 chest images, 15 images having simulated nodular lesions of varying subtlety and size. Lesions were positioned in accordance with typical clinical presentation and were validated radiologically. Each image was presented for 30 seconds and viewers were asked to identify and score any visualized lesion from 1-4 to indicate confidence level of detection. At the end of the session, sensitivity and specificity were calculated. Analysis of the data suggests that visualization of chest lesions is affected by inappropriate lighting with chest radiologists demonstrating greater ambient lighting dependency. JAFROC analyses are currently being performed.
Grayscale standard display function on LCD color monitors
Denis De Monte, Carlo Casale, Luigi Albani, et al.
Currently, as a rule, digital medical systems use monochromatic Liquid Crystal Display (LCD) monitors to ensure an accurate reproduction of the Grayscale Standard Display Function (GSDF) as specified in the Digital Imaging and Communications in Medicine (DICOM) Standard. As a drawback, special panels need to be utilized in digital medical systems, while it would be preferable to use regular color panels, which are manufactured on a wide scale and are thus available at by far lower prices. The method proposed introduces a temporal color dithering technique to accurately reproduce the GSDF on color monitors without losing monitor resolution. By exploiting the characteristics of the Human Visual System (HVS) the technique ensures that a satisfactory grayscale reproduction is achieved minimizing perceivable flickering and undesired color artifacts. The algorithm has been implemented in the monitor using a low-cost Field Programmable Gate Array (FPGA). Quantitative evaluations of luminance response on a 3 Mega-pixel color monitor have shown that the compliance with the GSDF can be achieved with the accuracy level required by medical applications. At the same time the measured color deviation is below the threshold perceivable by the human eye.
A comparison between 8-bit and 10-bit luminance resolution when generating low-contrast sinusoidal test pattern on an LCD
Patrik Sund, Magnus Båth, Linda Ungsten, et al.
Radiological images are today mostly displayed on monitors, but much is still unknown regarding the interaction between monitor and viewer. Issues like monitor luminance range, calibration, contrast resolution and luminance distribution need to be addressed further. To perform vision research of high validity to the radiologists, test images should be presented on medical displays. One of the problems has been how to display low contrast patterns in a strictly controlled way. This paper demonstrates how to generate test patterns close to the detection limit on a medical grade display using subpixel modulation. Patterns are generated with both 8-bit and 10-bit monitor input. With this technique, up to 7162 luminance levels can be displayed and the average separation is approximately 0.08 of a JND (Just Noticeable Difference) on a display with a luminance range between 1 and 400 cd/m2. These patterns were used in a 2AFC detection task and the detection threshold was found to be 0.75 ± 0.02 of a JND when the adaptation level was the same as the target luminance (20 cd/m2). This is a reasonable result considering that the magnitude of a JND is based on the method of adjustment rather than on a detection task. When test patterns with a different luminance than the adaptation level (20 cd/m2) were displayed, the detection thresholds were 1.11 and 1.06 of a JND for target luminance values 1.8 and 350 cd/m2, respectively.
Visual image quality metrics for optimization of breast tomosynthesis acquisition technique
Breast tomosynthesis is currently an investigational imaging technique requiring optimization of its many combinations of data acquisition and image reconstruction parameters for optimum clinical use. In this study, the effects of several acquisition parameters on the visual conspicuity of diagnostic features were evaluated for three breast specimens using a visual discrimination model (VDM). Acquisition parameters included total exposure, number of views, full resolution and binning modes, and lag correction. The diagnostic features considered in these specimens were mass margins, microcalcifications, and mass spicules. Metrics of feature contrast were computed for each image by defining two regions containing the selected feature (Signal) and surrounding background (Noise), and then computing the difference in VDM channel metrics between Signal and Noise regions in units of just-noticeable differences (JNDs). Scans with 25 views and exposure levels comparable to a standard two-view mammography exam produced higher levels of feature contrast. The effects of binning and lag correction on feature contrast were found to be generally small and isolated, consistent with our visual assessments of the images. Binning produced a slight loss of spatial resolution which could be compensated in the reconstruction filter. These results suggest that good image quality can be achieved with the faster and therefore more clinically practical 25-view scans with binning, which can be performed in as little as 12.5 seconds. Further work will investigate other specimens as well as alternate figures of merit in order to help determine optimal acquisition and reconstruction parameters for clinical trials.
Do aging displays impact observer performance and visual search efficiency?
LCDs age, and as they do so the whitepoint shifts to a yellow hue. This changes the appearance of the displayed images. We examined whether this shift impacts observer performance and visual search efficiency of radiologists interpreting images. Six radiologists viewed 50 DR chest images on three LCDs that had their whitepoint adjusted to simulate monitor age (new, 1-year old, 2.5 years old). They reported the presence or absence of nodules along with their confidence. Visual search was measured on a subset of 15 images using eye position recording techniques. The results indicate that there was no statistically significant difference in ROC performance due to monitor age (F = 0.4901, p = 0.6187). There were not statistically significant differences between the three monitors in terms of total viewing time (F = 0.056, p = 0.9452). Dwell times for each decision type did not differ significantly as a function of monitor age. The shift in whitepoint towards the yellow range (at least up to 2.5 years of age ) does not impact diagnostic accuracy or visual search efficiency of radiologists.
High luminance monochrome vs low luminance monochrome and color softcopy displays: observer performance and visual search efficiency
This study evaluated the potential clinical utility of a high-performance (3 Mega-pixel) color display compared with two monochrome displays--one of comparable luminance (250 cd/m2) and one of higher luminance (450 cd/m2). Six radiologists viewed 50 DR chest images, half with nodules and half without, once on each display. Eye position was recorded on a subset of images. There was no statistically significant difference in ROC Az performance as a function of monitor (F = 1.176, p = 0.3127), although there was a clear trend towards the monochrome 450 cd/m2 monitor being better than the monochrome 250 cd/m2 monitor, which was better than the color monitor. In terms of total viewing time, there were no statistically significant differences between the three monitors (F = 1.478, p = 0.2298). The dwell times associated with true and false positive decisions were shortest for the high luminance monochrome display, longer for the low luminance monochrome, and longest for the low luminance color display. Dwells for the false negative decisions were longest for the high luminance monochrome display, shorter for the low luminance monochrome, and shortest for the low luminance color display. The true negative dwells were not significantly different. The study suggest high luminance displays may have an advantage in terms of diagnostic accuracy and visual search efficiency for detecting nodules in chest images compared to both monochrome and color lower luminance displays, although these differences may have little clinical impact because they are relatively small.
Model Observers I
icon_mobile_dropdown
Bias in Hotelling observer performance computed from finite data
An observer performing a detection task analyzes an image and produces a single number, a test statistic, for that image. This test statistic represents the observers "confidence" that a signal (e.g., a tumor) is present. The linear observer that maximizes the test-statistic SNR is known as the Hotelling observer. Generally, computation of the Hotelling SNR, or Hotelling trace, requires the inverse of a large covariance matrix. Recent developments have resulted in methods for the estimation and inversion of these large covariance matrices with relatively small numbers of images. The estimation and inversion of these matrices is made possible by a covariance matrix decomposition that splits the full covariance matrix into an average detector-noise component and a background-variability component. Because the average detector-noise component is often diagonal and/or easily estimated, a full-rank, invertible covariance matrix can be produced with few images. We have studied the bias of estimates of the Hotelling trace using this decomposition for high-detector-noise and low-detector noise situations. In extremely low-noise situations, this covariance decomposition may result in a significant bias. We will present a theoretical evaluation of the Hotelling-trace bias, as well as extensive simulation studies.
Adaptive Hotelling discriminant functions
Arthur Brème, Matthew A. Kupinski, Eric Clarkson, et al.
Any observer performing a detection task on an image produces a single number that represents the observer's confidence that a signal (e.g., a tumor) is present. A linear observer produces this test statistic using a linear template or a linear discriminant. The optimal linear discriminant is well-known to be the Hotelling observer and uses both first- and second-order statistics of the image data. There are many situations where it is advantageous to consider discriminant functions that adapt themselves to some characteristics of the data. In these situations, the linear template is itself a function of the data and, thus, the observer is nonlinear. In this paper, we present an example adaptive Hotelling discriminant and compare the performance of this observer to that of the Hotelling observer and the Bayesian ideal observer. The task is to detect a signal that is imbedded in one of a finite number of possible random backgrounds. Each random background is Gaussian but has different covariance properties. The observer uses the image data to determine which background type is present and then uses the template appropriate for that background. We show that the performance of this particular observer falls between that of Hotelling and ideal observers.
Mass detection on real and synthetic mammograms: human observer templates and local statistics
Cyril Castella, Karen Kinkel, Francis R. Verdun, et al.
In this study we estimated human observer templates associated with the detection of a realistic mass signal superimposed on real and simulated but realistic synthetic mammographic backgrounds. Five trained naĂŻve observers participated in two-alternative forced-choice (2-AFC) experiments in which they were asked to detect a spherical mass signal extracted from a mammographic phantom. This signal was superimposed on statistically stationary clustered lumpy backgrounds (CLB) in one instance, and on nonstationary real mammographic backgrounds in another. Human observer linear templates were estimated using a genetic algorithm. An additional 2-AFC experiment was conducted with twin noise in order to determine which local statistical properties of the real backgrounds influenced the ability of the human observers to detect the signal. Results show that the estimated linear templates are not significantly different for stationary and nonstationary backgrounds. The estimated performance of the linear template compared with the human observer is within 5% in terms of percent correct (Pc) for the 2-AFC task. Detection efficiency is significantly higher on nonstationary real backgrounds than on globally stationary synthetic CLB. Using the twin-noise experiment and a new method to relate image features to observers trial to trial decisions, we found that the local statistical properties preventing or making the detection task easier were the standard deviation and three features derived from the neighborhood gray-tone difference matrix: coarseness, contrast and strength. These statistical features showed a dependency with the human performance only when they are estimated within an area sufficiently small around the searched location. These findings emphasize that nonstationary backgrounds need to be described by their local statistics and not by global ones like the noise Wiener spectrum.
A contrast-sensitive channelized-Hotelling observer to predict human performance in a detection task using lumpy backgrounds and Gaussian signals
Previously, a non-prewhitening matched filter (NPWMF) incorporating a model for the contrast sensitivity of the human visual system was introduced for modeling human performance in detection tasks with different viewing angles and white-noise backgrounds by Badano et al. But NPWMF observers do not perform well detection tasks involving complex backgrounds since they do not account for random backgrounds. A channelized-Hotelling observer (CHO) using difference-of-Gaussians (DOG) channels has been shown to track human performance well in detection tasks using lumpy backgrounds. In this work, a CHO with DOG channels, incorporating the model of the human contrast sensitivity, was developed similarly. We call this new observer a contrast-sensitive CHO (CS-CHO). The Barten model was the basis of our human contrast sensitivity model. A scalar was multiplied to the Barten model and varied to control the thresholding effect of the contrast sensitivity on luminance-valued images and hence the performance-prediction ability of the CS-CHO. The performance of the CS-CHO was compared to the average human performance from the psychophysical study by Park et al., where the task was to detect a known Gaussian signal in non-Gaussian distributed lumpy backgrounds. Six different signal-intensity values were used in this study. We chose the free parameter of our model to match the mean human performance in the detection experiment at the strongest signal intensity. Then we compared the model to the human at five different signal-intensity values in order to see if the performance of the CS-CHO matched human performance. Our results indicate that the CS-CHO with the chosen scalar for the contrast sensitivity predicts human performance closely as a function of signal intensity.
Effect of slow display on stack-mode reading of volumetric image datasets using an anthropomorphic observer
Hongye Liang, Subok Park, Brandon D. Gallas, et al.
Active-matrix liquid crystal displays (LCDs) are becoming widely used in medical imaging applications. With the increasing volume of CT images to be interpreted per day, the ability of showing a fast sequence of images in stack mode is preferable for a medical display. Slow temporal response of LCD display can compromise the image quality/fidelity when the images are browsed in a fast sequence. In this paper, we report on the effect of the LCD response time at different image browsing speeds based on the performance of a contrast-sensitive channelized-Hotelling observer. A correlated stack of simulated cluster lumpy background images with a signal present in some of the images was used. The effect of different browsing speeds is calculated with LCD temporal response measurements established in our previous work. The image set is then analyzed by the model observer, which has been shown to predict human detection performance in non-Gaussian lumpy backgrounds. This allows us to quantify the effect of slow temporal response of medical liquid crystal displays on the performance of the anthropomorphic observer. Slow temporal response of the display device greatly affects the lesion contrast and observer performance. This methodology, after validation with human observers, could be used to set limits for the rendering speed of large volumetric image datasets (from CT, MR, or tomosynthesis) read in stack-mode.
Validation of closed-form compression noise statistics using model observers
Model observers have been used successfully to predict human observer performance and to evaluate image quality for detection tasks on various backgrounds in medical applications. This paper will apply the closed-form compression noise statistics in analytic form to model observers and the derived channelized Hotelling observer (CHO) for decompressed images. The performance of CHO on decompressed images is validated using JPEG compression algorithm and lumpy background images. The results show that the derived CHO performance predicts closely its simulated performance.
FROC, LROC, and Other Analyses
icon_mobile_dropdown
FROC curves using a model of visual search
The purpose of this paper is to describe FROC (free-response receiver operating characteristic) curves predicted by a recent model of visual search. The model is characterized by three parameters (μ, λ and ν) which quantify perceived lesion signal-to-noise ratio, the average number of non-lesion locations per image considered for marking by the observer, and the probability that a lesion is considered for marking, respectively. An important characteristic of a search-model predicted FROC curve is that it is contained within the rectangle with corners at (0, 0) and (λ, ν). It is shown that λ and ν determine the x and y end-point coordinates of the FROC curve, respectively, and mu determines the sharpness of the transition from vertical slope at the origin to zero slope at (λ, ν). Two figures of merit (FOM) quantifying free-response performance are described. A FOM commonly used by CAD developers is the ordinate of the FROC curve at a specified abscissa. Another FOM, recently introduced by us, measures the ability of the observer to discriminate between normal and abnormal images. The latter is analogous to the Az measure widely used in ROC methodology. The search-model is related to the initial detection and candidate analysis (IDCA) method of fitting FROC curves but a key assumption, the shapes of the fitted curves and the estimation methods are different. The search-model yielded excellent fits to a designer level and to a simulated clinical level CAD data set. Available software implementing these ideas is expected to aid in the optimization of CAD algorithms.
A non-intuitive aspect of Swensson’s LROC model
If the locations of abnormalities (targets) in an image are unknown, the evaluation of human observers' detection performance can be complex. Richard Swensson in 1996 developed a model that unified the various analysis approaches to this problem. For the LROC experiment, the model assumed that a false-positive report-arises from the latent decision variable of the most suspicious non-target location of the target stimuli. The localization scoring was based on the same latent decision variable, i.e., when the latent decision variable at the non-target location was greater than latent decision variable at the target location the response was scored as a miss. Human observer reports vary, i.e., different locations have been identified during replications. A Monte Carlo model was developed to investigate this variation and identified a non-intuitive aspect of Swensson's LROC model. When the number of potentially suspicious locations was 1, the model performance was greater than apparently possible. For example, assume that target expected latent decision variable is 1.0. Both target and non-target standard deviations were assumed to be 1.0. The model predicts the area-under-the-ROC is 0.815, which implies da=1.27. If the target latent decision variable was 0.0, then da=0.61. The reason was the number latent decision variables in the model for the non-target stimuli is one, while the number latent decision variables for the target stimuli is the maximum of 2. The simulation indicated that the parameters of a LROC fit, when the number of suspicious locations is small or the observer performance is low, does not have the same intuitive meaning as ROC parameters of a SKE task.
Approximating the test statistic distribution and ALROC in signal-detection tasks with signal location uncertainty
In medical imaging, signal detection is one of the most important tasks. It is especially important to study detection tasks with signal location uncertainty. One way to evaluate system performance on such tasks is to compute the area under the localization-receiver operating characteristic (LROC) curve. In an LROC study, detecting a signal includes two steps. The first step is to compute a test statistic to determine whether the signal is present or absent. If the signal is present, the second step is to identify the location of the signal. We use the test statistic which maximizes the area under the LROC curve (ALROC). We attempt to capture the distribution of this ideal LROC test statistic with signal-absent data using the extreme value distribution. Some simulated test statistics are shown along with extreme value distributions to illustrate how well our approximation captures the characteristics of the ideal LROC test statistic. We further derive an approximation to the ideal ALROC using the extreme value distribution and compare it to the direct simulation of the ALROC. Using a different approach by defining a parameterized probability density function of the data, we are able to derive another approximation to the ideal ALROC for weak signals from a power series expansion in signal amplitude.
A Bayesian interpretation of the "proper" binormal ROC model using a uniform prior distribution for the area under the curve
Richard M. Zur, Lorenzo L. Pesce, Yulei Jiang, et al.
Maximum likelihood estimation of receiver operating characteristic (ROC) curves using the "proper" binormal model can be interpreted in terms of Bayesian estimation as assuming a flat joint prior distribution on the c and da parameters. However, this is equivalent to assuming a non-flat prior distribution for the area under the curve (AUC) that peaks at AUC = 1.0. We hypothesize that this implicit prior on AUC biases the maximum likelihood estimate (MLE) of AUC. We propose a Bayesian implementation of the "proper" binormal ROC curve-fitting model with a prior distribution that is marginally flat on AUC and conditionally flat over c. This specifies a non-flat joint prior for c and da. We developed a Monte Carlo Markov chain (MCMC) algorithm to estimate the posterior distribution and the maximum a posteriori (MAP) estimate of AUC. We performed a simulation study using 500 draws of a small dataset (25 normal and 25 abnormal cases) with an underlying AUC value of 0.85. When the prior distribution was a flat joint prior on c and da, the MLE and MAP estimates agreed, suggesting that the MCMC algorithm worked correctly. When the prior distribution was marginally flat on AUC, the MAP estimate of AUC appeared to be biased low. However, the MAP estimate of AUC for perfectly separable degenerate datasets did not appear to be biased. Further work is needed to validate the algorithm and refine the prior assumptions.
Advanced system model for the prediction of the clinical task performance of radiographic systems
A flexible software tool was developed that combines predictive models for detector noise and blur with image simulation and an improved human observer model to predict the clinical task performance of existing and future radiographic systems. The model starts with high-fidelity images from a database and mathematical models of common disease features, which may be added to the images at desired contrast levels. These images are processed through the entire imaging chain including capture, the detector, image processing, and hardcopy or softcopy display. The simulated images and the viewing conditions are passed to a human observer model, which calculates the detectability index d' of the signal (disease or target feature). The visual model incorporates a channelized Hotelling observer with a luminance-dependent contrast sensitivity function and two types of internal visual system noise (intrinsic and image background-induced). It was optimized based on three independent human observer studies of target detection, and is able to predict d' over a wide range of viewing conditions, background complexities, and target spatial frequency content. A more intuitive metric of system performance, Task-Specific Detective Efficiency (TSDE), is defined to indicate how much detector improvements would translate to better radiologist performance. The TSDE is calculated as the squared ratio of d' for a system with the actual detector and a hypothetical system containing an ideal detector. A low TSDE, e.g., 5% for the detection of 0.1 mm microcalcifications in typical mammography systems, indicates that improvements in the detector characteristics are likely to translate to better detection performance. The TSDE of lung nodule detection is as high as 75% even with the detective quantum efficiency (DQE) of the detector not exceeding 24%. Applications of the model to system optimizations for flat-panel detectors, in mammography and dual energy digital radiography, are discussed.
Model Observers II
icon_mobile_dropdown
Perception of dim targets on dark backgrounds in MRI
Some diagnostic tasks in MRI involve determining the presence of a faint feature (target) relative to a dark background. In MR images produced by taking pixel magnitudes it is well known that the contrast between faint features and dark backgrounds is reduced due to the Rician noise distribution. In an attempt to enhance detection we implemented three different MRI reconstruction algorithms: the normal magnitude, phase-corrected real, and a wavelet thresholding algorithm designed particularly for MRI noise suppression and contrast enhancement. To compare these reconstructions, we had volunteers perform a two-alternative forced choice (2AFC) signal detection task. The stimuli were produced from high-field head MRI images with synthetic thermal noise added to ensure realistic backgrounds. Circular targets were located in regions of the image that were dark, but next to bright anatomy. Images were processed using one of the three reconstruction techniques. In addition we compared a channelized Hotelling observer (CHO) to the human observers in this task. We measured the percentage correct in both the human and model observer experiments. Our results showed better performance with the use of magnitude or phase-corrected real images compared to the use of the wavelet algorithm. In particular, artifacts induced by the wavelet algorithm seem to distract some users and produce significant inter-subject variability. This contradicts predictions based only on SNR. The CHO matched the mean human results quite closely, demonstrating that this model observer may be used to simulate human response in MRI target detection tasks.
Evaluation of the channelized Hotelling observer for signal detection in 2D tomographic imaging
Signal detection by the channelized Hotelling (ch-Hotelling) observer is studied for tomographic application by employing a small, tractable 2D model of a computed tomography (CT) system. The primary goal of this manuscript is to develop a practical method for evaluating the ch-Hotelling observer that can generalize to larger 3D cone-beam CT systems. The use of the ch-Hotelling observer for evaluating tomographic image reconstruction algorithms is also demonstrated. For a realistic model for CT, the ch-Hotelling observer can be a good approximation to the ideal observer. The ch-Hotelling observer is applied to both the projection data and the reconstructed images. The difference in signal-to-noise ratio for signal detection in both of these domains provides a metric for evaluating the image reconstruction algorithm.
Perceptual difference model (Case-PDM) for evaluation of MR images: validation and calibration
There is an extraordinary number of fast MR imaging techniques, especially for parallel imaging. When one considers multiple reconstruction algorithms, reconstruction parameters, coil configurations, acceleration factors, noise levels, and multiple test images, one can easily create 1000's of test images for image quality evaluation. We have found the perceptual difference model (Case-PDM) to be quite useful as a means of rapid quantitative image quality evaluation in such experiments, and have applied it to keyhole, spiral, SENSE, and GRAPPA applications. In this study, we have compared human evaluation of MR images from multiple organs and from multiple image reconstruction algorithms to Case-PDM. We compared human DSCQS (Double Stimulus Continuous Quality Scale) scoring against Case-PDM measurements for 3 different image types and 3 different image reconstruction algorithms. We found that Case-PDM linearly correlated (r > 0.9) with human subject ratings over a very large range of image quality. We also compared Case-PDM to other image quality evaluation methods. Case-PDM generally performed better than NASA's DCTune, MITRE's IQM, Zhou Wang's NR models and mean square error (MSE) method, by showing a higher Pearson correlation coefficient, higher Spearman rank-order correlation and lower root-mean-squared error. All three models (Case-PDM, Sarnoff's IDM, and Zhou Wang's SSIM) performed very similarly in this experiment. To focus on high quality reconstructions, we performed a 2-AFC (Alternate Forced Choice) experiment to determine the "just perceptible difference" between two images. We found that threshold Case-PDM scores changed little (0.6-1.8) with 2 different image types and 3 degradation patterns, and results with Case-PDM were much tighter than the other methods (IDM and MSE) by showing a lower ratio of mean to standard deviation value. We conclude that Case-PDM can correctly predict the ordering of image quality over a large range of image quality. Case-PDM can also be used to screen the images which are "perceptually equal" to the original image. Although Case-PDM is a very useful tool for comparing "similar raw images with similar processing," one should be careful when interpreting Case-PDM scores across MR images.
Markov chain Monte Carlo (MCMC) based ideal observer estimation using a parameterized phantom and a pre-calculated dataset
Xin He, Brian S. Caffo, Eric C. Frey
The ideal observer (IO) employs complete knowledge of the available data statistics and sets an upper limit on the observer performance on a binary classification task. Kupinski proposed an IO estimation method using Markov chain Monte Carlo (MCMC) techniques. In principle, this method can be generalized to any parameterized phantoms and simulated imaging systems. In practice, however, it can be computationally burdensome, because it requires sampling the object distribution and simulating the imaging process a large number of times during the MCMC estimation process. In this work we propose methods that allow application of MCMC techniques to cardiac SPECT imaging IO estimation using a parameterized torso phantom and an accurate analytical projection algorithm that models the SPECT image formation process. To accelerate the imaging simulation process and thus enable the MCMC IO estimation, we used a phantom model with discretized anatomical parameters and continuous uptake parameters. The imaging process simulation was modeled by pre-computing projections for each organ in the finite number of discretely-parameterized anatomic models and taking linear combinations of the organ projections based on sampling of the continuous organ uptake parameters. The proposed method greatly reduces the computational burden and makes MCMC IO estimation for cardiac SPECT imaging possible.
Task-based evaluation of practical lens designs for lens-coupled digital mammography systems
Liying Chen, Leslie D. Foo, Rebecca L. Cortesi, et al.
Recent developments in low-noise, large-area CCD detectors have renewed interest in radiographic systems that use a lens to couple light from a scintillation screen to a detector. The lenses for this application must have very large numerical apertures and high spatial resolution over a FOV. This paper expands on our earlier work by applying the principles of task-based assessment of image quality to development of meaningful figures of merit for the lenses. The task considered in this study is detection of a lesion in a mammogram, and the figure of merit used is the lesion detectability, expressed as a task-based signal-to-noise ratio (SNR), for a channelized Hotelling observer (CHO). As in the previous work, the statistical model accounts for the random structure in the breast, the statistical properties of the scintillation screen, the random coupling of light to the CCD, the detailed structure of the shift-variant lens point spread function (PSF), and Poisson noise of the X-ray flux. The lenses considered range from F/0.9 to F/1.2. All yield nominally the same spot size at a given field. Among the F/0.9 lenses, some of them were designed by conventional means for high resolution and some for high contrast, and the shapes of the PSF differ considerably. The results show that excessively large lens numerical apertures do not improve the task-based SNR but dramatically increase the optics fabrication cost. Contrary to common wisdom, high-contrast designs have higher task-based SNRs than high-resolution designs when the signal is small. Additionally, we constructed a merit function to successfully tune the lenses to perform equally well anywhere in the FOV.
Tools and methods for exposure control optimization in digital mammography in presence of texture
To accurately detect radiological signs of cancer, mammography requires the best possible image quality for a target patient dose. The application of automatic optimization of parameters (AOP) to digital systems has been improved recently. The metric used to derive this AOP was based on the expected CNR of calcium material in a uniform background. In this work, we use a new metric, based on the detection performance of an a-contrario observer on lesions in simulated images. Breast images at various thicknesses and glandularity levels were simulated with flat and textured backgrounds. Various exposure spectra (Mo/Mo, Mo/Rh and Rh/Rh anode/filter materials, kVp ranging from 25 to 33 kV) were considered. The tube output has been normalized in order to obtain comparable AGD values for each image of a given breast over the various acquisition techniques. Images were scored with the a-contrario observer, the performance criterion being the minimal lesion size needed to reach a given detection threshold. The optimal spectra are found similar to those delivered by the AOP in both flat and textured backgrounds. The choice of the anode/filter combination appears to be more critical than kVp adjustments in particular for the thicker breasts. Our approach also yields an estimate of the detection variability due to texture signal. We found that the anatomical structure variability cannot be overcome by beam quality optimization of the current system in presence of complex background, which confirms the potential benefit of any imaging technology reducing the variability of detection due to texture.
Combining a wavelet transform with a channelized Hotelling observer for tumor detection in 3D PET oncology imaging
Carole Lartizien, Sandrine Tomei, Voichita Maxim, et al.
This study evaluates new observer models for 3D whole-body Positron Emission Tomography (PET) imaging based on a wavelet sub-band decomposition and compares them with the classical constant-Q CHO model. Our final goal is to develop an original method that performs guided detection of abnormal activity foci in PET oncology imaging based on these new observer models. This computer-aided diagnostic method would highly benefit to clinicians for diagnostic purpose and to biologists for massive screening of rodents populations in molecular imaging. Method: We have previously shown good correlation of the channelized Hotelling observer (CHO) using a constant-Q model with human observer performance for 3D PET oncology imaging. We propose an alternate method based on combining a CHO observer with a wavelet sub-band decomposition of the image and we compare it to the standard CHO implementation. This method performs an undecimated transform using a biorthogonal B-spline 4/4 wavelet basis to extract the features set for input to the Hotelling observer. This work is based on simulated 3D PET images of an extended MCAT phantom with randomly located lesions. We compare three evaluation criteria: classification performance using the signal-to-noise ratio (SNR), computation efficiency and visual quality of the derived 3D maps of the decision variable &lgr;. The SNR is estimated on a series of test images for a variable number of training images for both observers. Results: Results show that the maximum SNR is higher with the constant-Q CHO observer, especially for targets located in the liver, and that it is reached with a smaller number of training images. However, preliminary analysis indicates that the visual quality of the 3D maps of the decision variable &lgr; is higher with the wavelet-based CHO and the computation time to derive a 3D &lgr;-map is about 350 times shorter than for the standard CHO. This suggests that the wavelet-CHO observer is a good candidate for use in our guided detection method.
Technology Assessment I
icon_mobile_dropdown
Performance analysis for computer-aided lung nodule detection on LIDC data
For more than one decade computer aided detection (CAD) for pulmonary nodules has been an active research area. There are numerous publications dedicated to this topic. Most authors have created their own database with their own ground truth for validation. This makes it hard to compare the performance of different systems with each other. It is a known fact that the performance of a CAD system can differ significantly depending on which data it is tested and on the underlying ground truth. The lung image data base consortium (LIDC) has recently released 93 publicly available lung images with ground truth lists from 4 different radiologists. This data base will make it possible to compare the performance of different CAD algorithms. In this paper we do the first step to use the LIDC data as a benchmark test. We present a CAD algorithm with a validation study on these data sets. The CAD performance was analyzed by virtue of multiple Free Response Receiver Operator Characteristic (FROC) curves for different lower thresholds of the nodule diameter. There are different ways to merge the ground truth lists of the 4 radiologists and we discuss the performance of our CAD algorithm for several of these possibilities. For nodules with a volume-equivalent diameter ≥4mm which have been simultaneously confirmed by all four radiologists our CAD system shows a detection rate of 89 % at a median false positive rate of 2 findings per patient.
Effect of CAD on radiologists’ detection of lung nodules on thoracic CT scans: observer performance study
The purpose of this study was to evaluate the effect of computer-aided diagnosis (CAD) on radiologists' performance for the detection of lung nodules on thoracic CT scans. Our computer system was designed using an independent training set of 94 CT scans in our laboratory. The data set for the observer performance study consisted of 48 CT scans. Twenty scans were collected from patient files at the University of Michigan, and 28 scans by the Lung Imaging Database Consortium (LIDC). All scans were read by multiple experienced thoracic radiologists to determine the true nodule locations, defined as any region identified by one or more expert radiologists as containing a nodule larger than 3 mm in diameter. Eighteen CT examinations were nodule-free, while the remaining 30 CT examinations contained a total of 73 nodules having a median size of 5.5 mm (range 3.0-36.4 mm). Four other study radiologists read the CT scans first without and then with CAD, and provided likelihood of nodule ratings for suspicious regions. Two of the study radiologists were fellowship trained in cardiothoracic radiology, and two were cardiothoracic radiology fellows. Freeresponse receiver-operating characteristic (FROC) curves were used to compare the two reading conditions. The computer system had a sensitivity of 79% (58/73) with an average of 4.9 marks per normal scan (88/18). Jackknife alternative FROC (JAFROC) analysis indicated that the improvement with CAD was statistically significant (p=0.03).
Reducing variability in the output of artificial neural networks through output calibration
Shalini Gupta, Wendy C. Kan, Tiffany C. Lin, et al.
In this study we developed an effective novel method for reducing the variability in the output of different artificial neural network (ANN) configurations that have the same overall performance as measured by the area under their receiver operating characteristic (ROC) curves. This variability can lead to inaccuracies in the interpretation of results when the outputs are employed as classification predictors. We extended a method previously proposed to reduce the variability in the performance of a classifier with data sets from different institutions to the outputs of ANN configurations. Our approach is based on histogram shaping of the outputs of all ANN configurations to resemble the output histogram of a baseline ANN configuration. We tested the effectiveness of the technique using synthetic data generated from two two-dimensional isotropic Gaussian distributions and 100 ANN configurations. The proposed output calibration technique significantly reduced the median standard deviation of the ANN outputs from 0.010 before calibration to 0.006 after calibration. The standard deviation of the sensitivity of the 100 ANN configurations at the same decision threshold reduced significantly from 0.005 before calibration to 0.003 after calibration. Similarly the standard deviation of their specificity values decreased significantly from 0.016 before calibration to 0.006 after calibration.
Technology assessment: observer study directly compares screen/film to CR mammography
Lynn Fletcher-Heath, Anne Richards, Susan Ryan-Kron
A new study supports and expands upon a previous reporting that computed radiography (CR) mammography offers as good, or better, image quality than state-of-the-art screen/film mammography. The suitability of CR mammography is explored through qualitative and quantitative study components: feature comparison and cancer detection rates of each modality. Images were collected from 150 normal and 50 biopsy-confirmed subjects representing a range of breast and pathology types. Comparison views were collected without releasing compression, using automatic exposure control on Kodak MIN-R films, followed by CR. Digital images were displayed as both softcopy (S/C) and hardcopy (H/C) for the feature comparison, and S/C for the cancer detection task. The qualitative assessment used preference scores from five board-certified radiologists obtained while viewing 100 screen/film-CR pairs from the cancer subjects for S/C and H/C CR output. Fifteen general image-quality features were rated, and up to 12 additional features were rated for each pair, based on the pathology present. Results demonstrate that CR is equivalent or preferred to conventional mammography for overall image quality (89% S/C, 95% H/C), image contrast (95% S/C, 98% H/C), sharpness (86% S/C, 93% H/C), and noise (94% S/C, 91% H/C). The quantitative objective was satisfied by asking 10 board-certified radiologists to provide a BI-RADSTM score and probability of malignancy per breast for each modality of the 200 cases. At least 28 days passed between observations of the same case. Average sensitivity and specificity was 0.89 and 0.82 for CR and 0.91 and 0.82 for screen/film, respectively.
Evaluation of hardware in a small-animal SPECT system using reconstructed images
Evaluation of imaging hardware represents a vital component of system design. In small-animal SPECT imaging, this evaluation has become increasingly diffcult with the emergence of multi-pinhole apertures and adaptive, or patient-specific, imaging. This paper will describe two methods for hardware evaluation using reconstructed images. The first method is a rapid technique incorporating a system-specific non-linear, three-dimensional point response. This point response is easily computed and offers qualitative insight into an aperture's resolution and artifact characteristics. The second method is an objective assessment of signal detection in lumpy backgrounds using the channelized Hotelling observer (CHO) with 3D Laguerre-Gauss and difference-of-Gaussian channels to calculate area under the receiver-operating characteristic curve (AUC). Previous work presented at this meeting described a unique, small-animal SPECT system (M3R) capable of operating under a myriad of hardware configurations and ideally suited for image quality studies. Measured system matrices were collected for several hardware configurations of M3R. The data used to implement these two methods was then generated by taking simulated objects through the measured system matrices. The results of these two methods comprise a combination of qualitative and quantitative analysis that is well-suited for hardware assessment.
Technology Assessment II
icon_mobile_dropdown
A method for analyzing contrast-detail curves
K. M. Ogden, W. Huda, K. Shah, et al.
The purpose of this study was to develop a concise way to summarize radiographic contrast detail curves. We obtained experimental data that measured lesion detection in CT images of a 5-year-old anthropomorphic phantom. Five lesion diameters (2.5 to 12.5 mm) were investigated, and contrast detail (CD) curves were generated at each of five tube current-exposure time product (mAs) values using twoalternative forced-choice (2-AFC) studies. A performance index for each CD curve was calculated as the area under the curve bounded by the maximum and minimum lesion sizes, with this value being normalized by the range of lesion sizes used. We denote this quantity, which is mathematically equal to the mean value of the CD curve, as the contrast-detail performance index (PCD). This quantity is inspired by the area under the curve (Az) that is used as a performance index in ROC studies, though there are important differences. PCD, like Az, allows for the reduction in the dimensionality of experimental results, simplifying interpretation of data while discarding details of the respective curve (CD or ROC). Unlike Az, PCD decreases with increasing performance, and the range of values is not fixed as for Az (i.e. 0 < Az < 1). PCD is proportional to the average SNR for the lesions used in the 2-AFC experiments, and allows relative performance comparisons as experimental parameters are changed. For the CT data analyzed, the PCD values were 0.196, 0.166, 0.146, 0.132, and 0.121 at mAs values of 30, 50, 70, 100, and 140, respectively. This corresponds to an increase in performance (i.e. decrease in required contrast) relative to the 30 mAs PCD value of 62%, 48%, 33%, and 18% for the 140, 100, 70, and 50 mAs data, respectively.
WorkstationJ: workstation emulation software for medical image perception and technology evaluation research
Kevin M. Schartz, Kevin S. Berbaum, Robert T. Caldwell, et al.
We developed image presentation software that mimics the functionality available in the clinic, but also records time-stamped, observer-display interactions and is readily deployable on diverse workstations making it possible to collect comparable observer data at multiple sites. Commercial image presentation software for clinical use has limited application for research on image perception, ergonomics, computer-aids and informatics because it does not collect observer responses, or other information on observer-display interactions, in real time. It is also very difficult to collect observer data from multiple institutions unless the same commercial software is available at different sites. Our software not only records observer reports of abnormalities and their locations, but also inspection time until report, inspection time for each computed radiograph and for each slice of tomographic studies, window/level, and magnification settings used by the observer. The software is a modified version of the open source ImageJ software available from the National Institutes of Health. Our software involves changes to the base code and extensive new plugin code. Our free software is currently capable of displaying computed tomography and computed radiography images. The software is packaged as Java class files and can be used on Windows, Linux, or Mac systems. By deploying our software together with experiment-specific script files that administer experimental procedures and image file handling, multi-institutional studies can be conducted that increase reader and/or case sample sizes or add experimental conditions.
Automatic evaluation of uterine cervix segmentations
In this work we focus on the generation of reliable ground truth data for a large medical repository of digital cervicographic images (cervigrams) collected by the National Cancer Institute (NCI). This work is part of an ongoing effort conducted by NCI together with the National Library of Medicine (NLM) at the National Institutes of Health (NIH) to develop a web-based database of the digitized cervix images in order to study the evolution of lesions related to cervical cancer. As part of this effort, NCI has gathered twenty experts to manually segment a set of 933 cervigrams into regions of medical and anatomical interest. This process yields a set of images with multi-expert segmentations. The objectives of the current work are: 1) generate multi-expert ground truth and assess the diffculty of segmenting an image, 2) analyze observer variability in the multi-expert data, and 3) utilize the multi-expert ground truth to evaluate automatic segmentation algorithms. The work is based on STAPLE (Simultaneous Truth and Performance Level Estimation), which is a well known method to generate ground truth segmentation maps from multiple experts' observations. We have analyzed both intra- and inter-expert variability within the segmentation data. We propose novel measures of "segmentation complexity" by which we can automatically identify cervigrams that were found difficult to segment by the experts, based on their inter-observer variability. Finally, the results are used to assess our own automated algorithm for cervix boundary detection.
Improved detection of coronary artery calcifications using dual energy subtraction radiography
R. S. Lazebnik M.D., P. B. Sachs, R. C. Gilkeson M.D.
PURPOSE: Detection of coronary artery calcifications (CAC) using conventional chest radiographs has a high positive predictive value but low sensitivity for coronary artery disease. We investigated the role of dual energy imaging to enhance reader performance in the detection of CAC, indicative of atherosclerotic plaques. METHODS: A sample of 53 patients with CT documented CAC and 23 patients without CT evidence of CAC, was imaged using a dual energy protocol on an amorphous silicon flat panel system (Revolution XR/d, GE Medical Systems). The acquisition sequence consisted of a 60kVp ("low energy") exposure, followed by a 120 kVp ("high energy") exposure with a time separation of 150ms. Subsequent image processing yielded conventional PA and lateral radiographs and a subtracted PA "bone image". For all patients and both data sets, CAC were evaluated by two experienced board-certified thoracic radiologists via Likert scale measurement (1-5 score). RESULTS: Sensitivity for CAC detection, using conventional radiographs, was 34.0% and 56.6% while specificity was 96.6% and 91.3%, for the two readers respectively. Using the "bone images", sensitivity was 92.4% and 83.0% while specificity was 100% and 91.3%. For patients with verified CAC, "bone images" resulted in at least a one Likert score increase in 73.6% and 54.7% of cases for the two readers. CONCLUSION: We conclude that using dual energy technology, "bone images" may allow higher sensitivity in detecting CAC compared with conventional radiographs, without decreased specificity. Thus, we believe our findings are useful in defining a role for dual energy subtraction radiography in improved detection of coronary artery disease.
Visual quality assessment of watermarked medical images
Increasing transmission of medical images across multiple user systems raises concerns for image security. Hiding watermark information in medical image data files is one solution for enhancing security and privacy protection of data. Medical image watermarking however is not a widely studied area, due partially to speculations on loss in viewer performance caused by degradation of image information. Such concerns are addressed if the amount of information lost due to watermarking can be kept at minimal levels and below visual perception thresholds. This paper describes experiments where three alternative visual quality metrics were used to assess the degradation caused by watermarking medical images. Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) medical images were watermarked using different methods: Block based Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT) with various embedding strengths. The visual degradation of each watermarking parameter setting was assessed using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Measure (SSIM) and Steerable Visual Difference Predictor (SVDP) numerical metrics. The suitability of each of the three numerical metrics for medical image watermarking visual quality assessment is noted. In addition, subjective test results from human observers are used to suggest visual degradation thresholds.
Evaluation of an interactive computer-aided diagnosis (ICAD) system for mammography: a pilot study
Bin Zheng, Gordon Abrams, Cynthia A. Britton M.D., et al.
Five radiologists detected suspicious mass regions depicted on mammograms acquired from 32 examinations during this pilot study. Among these, 24 examinations depicted subtle masses (12 malignant and 12 benign) and 8 were negative. Each observer interpreted a case in a sequential order under three reading modes. In mode one, the observer interpreted images without viewing CAD-generated cues. The observer provided two likelihood scores (for detection and classification) for each identified suspicious region. In mode two, CAD-cued results were provided and the observer could decide whether to make any changes in the previous ratings. In mode three, each observer was forced to query at least one suspected region. Once a region was queried, CAD scheme automatically segmented the mass region and computed a set of image features. Using a conditioned k-nearest neighbor (KNN) algorithm, six reference regions that were considered "the most similar" to the queried region were selected and displayed along with CAD-generated scores. Again, the observer had an option to change previous ratings. Experimental results were analyzed using ROC method. Five observers marked total 271, 276, and 281 mass regions under the three reading modes, respectively. In mode 2 observers marked 5 new suspected mass regions and did not make any changes in previously rated detection or classification scores. In mode three, although observers queried 18 additional regions, 13 were discarded and 5 were marked with region specific related scores. The observers also changed previous rating scores of 28 mass regions marked during mode one. The areas under ROC curves for individual readers ranged from 0.51 to 0.71 for mass detection (p = 0.67) and from 0.50 to 0.73 for mass classification (p = 0.43). This pilot study suggested that using ICAD could increase radiologists' confidence in their decision making. We also found that because radiologists tend to accept a higher false-positive rate in a laboratory environment, once they made their detection decision during the initial reading, they are frequently reluctant to make changes during the following modes. Hence, while simple and efficient operationally, the sequential reading mode may not be an optimal approach to evaluate the actual utility of ICAD.
Observer evaluation of computer-aided detection: second reader versus concurrent reader scenario
Sophie Paquerault, Darin I. Wade, Nicholas Petrick, et al.
We are comparing the performance of computer-aided detection (CAD) used as a second reader to concurrent-use CAD. We have designed a multi-reader multi-case (MRMC) observer study using fixed-size mammographic background images with fixed intensity Gaussian signals added in two experiments. A CAD system was developed to automatically detect these signals. The two experiments utilized signals of different contrast levels to assess the impact of CAD when the standalone CAD sensitivity was superior (low contrast) or equivalent (high contrast) to the average reader in the study. Seven readers participated in the study and were asked to review 100 images, identify signal locations, and rate each on a 100-point scale. A rating of 50 was used as a cutpoint and provided a binary classification of each candidate. Readers read the case set using CAD in both the second-reader and concurrent-reader scenarios. Results from the different signal intensities and reading paradigms were analyzed using the area under the Free-response Receiver Operating Characteristics curves. Sensitivity and the average number of FPs/image were also determined. The results showed that CAD, either used as a second reader or as a concurrent reader, can increase reader sensitivity but with an increase in FPs. The study demonstrated that readers may benefit from concurrent CAD when CAD standalone performance outperforms average reader sensitivity. However, this trend was not observed when CAD performance was equivalent to the sensitivity of the average reader.
A new image-quality evaluation method for low-contrast resolution in computed tomography
Tsunemichi Akita, Naoyuki Yagi
The contrast-to-noise ratio (CNR) is often used as a physical evaluation parameter for low-contrast resolution in computed tomography (CT). However, CNR is not affected by the window conditions. This study proposes a new physical evaluation method for low-contrast resolution that takes into account changes in window conditions. This new parameter is called the gray-scale contrast-to-noise ratio (GSCNR) was assessed and was compared with CNR. For each reconstruction image, the window width (WW) was varied from 100 to 400 in steps of 100 while keeping the window level (WL) fixed, and CNR and GSCNR were calculated. WL was then varied from 0 to 100 in steps of 20 while keeping WW fixed, and CNR and GSCNR were calculated again. CNR did not vary with WW, but it varied inversely with the standard deviation (SD) of the CT number (from 2.2 for an SD of 7 to 1.4 for an SD of 16). In contrast, GSCNR decreased with the increase in WW for each SD. In addition, GSCNR did not vary with WL, but it varied inversely with SD. GSCNR was found to be a useful physical evaluation parameter and was also thought to be useful for optimizing the window conditions.
Performance evaluation of CT-automatic exposure control devices
Daniel Gutierrez, Sabine Schmidt, Alban Denys, et al.
Technological developments of computed tomography (CT) have led to an increase of its clinical utilization. To optimize patient dose and image quality, scanner manufacturers have introduced X-ray tube current modulation coupled to Automatic Exposure Control (AEC) devices. The purpose of this work was to assess the performance of the CT-AEC of three different MSCT manufacturers by means of two phantoms: a conical PMMA phantom to vary the thickness of the absorber in a monotonous way, and an anthropomorphic chest phantom to assess the response of the CT-AEC in more realistic conditions. Noise measurements were made by standard deviation assessments, and dose indicators (CTDIvol and DLP) were calculated. All scanners were able to compensate for thickness variation by an adaptation of tube current. Initial current adaptation lengths varied for all systems in the range of 1 to 5 cm. With the anthropomorphic phantom, noticeable differences appeared concerning the adaptation rapidity in a sudden X-ray attenuation change, and non-intuitive behavior of current evolution was noticed for some acquisitions. The xyz-modulation allowed to reduce the DLP of the acquisition by 18% compared to the z-modulation. It is also showed that a homogeneous test object is not sufficient to characterize CT-AEC devices.
Multi-center MRI volume and linearity behavior
M. Barbu-McInnis, L. Molinelli, E. Durkin, et al.
Accurate longitudinal measurements are essential in understanding treatment effectiveness. Miss-calibrated gradients, low acquisition bandwidth, or abnormally high B0 inhomogeneities may cause geometric distortions in the MRI, which in turn may affect the imaging based biomarkers used to understand disease progression. This work presents the behavior of several MRI sites over an average period of 12 months using the analysis of a volume and linearity phantom with known geometry. The phantom was scanned in the axial, coronal, and sagittal planes using a T2 FSE sequence. For each month's scan, the average phantom length was measured in the right/left, anterior/posterior, and superior/inferior directions. The distortion variation was measured in each gradient axis and orientation over time. Results show that some magnets exhibit a significant drift within the scanning period. Unless this type of distortion is considered, the treatment efficacy outcome may be annulled due to misleading and erroneous conclusions.
Clinical evaluation of new workflow-efficient image processing for digital radiography
An observer study was conducted on a randomly selected sampling of 152 digital projection radiographs of varying body parts obtained from four medical institutions for the purpose of assessing a new workflow-efficient imageprocessing framework. Five rendering treatments were compared to measure the performance of a new processing algorithm against the control condition. A key feature of the new image processing is the capability of processing without specifying the exam. Randomized image pairs were presented at a softcopy workstation equipped with two diagnosticquality flat-panel monitors. Five board-certified radiologists and one radiology resident independently reviewed each image pair blinded to the specific processing used and provided a diagnostic-quality rating using a subjective rank-order scale for each image. In addition, a relative preference rating was used to indicate rendering preference. Aggregate results indicate that the new fully automated processing is preferred (sign test for median = 0 (α = 0.05): p < 0.0001 preference in favor of the control).
Implementing a large-scale multicentric study for evaluation of lossy JPEG and JPEG2000 medical image compression: challenges and rewards
David Koff, Peter Bak, Andrew Volkening, et al.
At modest compression ratios, lossy compression schemes allow substantial image size reduction without a significant loss in visual information. This is a consequence of the coding engines' transformation (such as the Discrete Cosine Transfomation (DCT) and the Discrete Wavelet Transform (DWT) in combination with quantization and truncation operations which all exploit the characteristics of the human visual system to achieve file-size reduction. The objective of our study was to determine levels of lossy compression that can be confidently used in diagnostic imaging. We conducted an extensive clinical evaluation using a standardized methodology incorporating two recognized evaluation techniques: Diagnostic Accuracy with Receiver Operating Characteristic (ROC) Analysis and Original-Revealed Forced Choice. Images covering 5 modalities and 7 anatomical regions were compressed at 3 different levels using JPEG and JPEG 2000 compression algorithms. To enable radiologists across Canada to evaluate images for our study, we developed a dedicated software application that was synchronized to a centralized server; which allowed results were reported, in real-time, to the central database via the Internet. In order to obtain findings that were relevant to everyday clinical evaluation, images were not viewed under a strict laboratory environment, but rather they were read under typical viewing conditions that comply with current standards of practice. We present here the methodology and specific technology developed for the purpose of this study, we explain the specific problems that we have encountered during the implementation and we give preliminary results. Our preliminary findings suggest that the most appropriate compression algorithm and compression ratios are largely dependent on the image specifics including the type/ modality and anatomical region studied.