Proceedings Volume 7263

Medical Imaging 2009: Image Perception, Observer Performance, and Technology Assessment

cover
Proceedings Volume 7263

Medical Imaging 2009: Image Perception, Observer Performance, and Technology Assessment

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 27 February 2009
Contents: 10 Sessions, 65 Papers, 0 Presentations
Conference: SPIE Medical Imaging 2009
Volume Number: 7263

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 7263
  • Novel Tools and Optimization
  • Image Display
  • Medical Imaging and Radiological Health: Contributions of Dr. Robert F. Wagner
  • Eye Tracking and Human Visual System
  • Model Observers
  • ROC
  • Observer Performance
  • Observer Performance in Mammography
  • Poster Session
Front Matter: Volume 7263
icon_mobile_dropdown
Front Matter: Volume 7263
This PDF file contains the front matter associated with SPIE Proceedings Volume 7263, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing
Novel Tools and Optimization
icon_mobile_dropdown
Optimization of exposure index values for the antero-posterior pelvis and antero-posterior knee examination
M. L. Butler, L. Rainford, J. Last, et al.
Introduction The American Association of Medical Physicists is currently standardizing the exposure index (EI) value. Recent studies have questioned whether the EI value offered by manufacturers is optimal. This current work establishes optimum EIs for the antero-posterior (AP) projections of a pelvis and knee on a Carestream Health (Kodak) CR system and compares these with manufacturers recommended EI values from a patient dose and image quality perspective. Methodology Human cadavers were used to produce images of clinically relevant standards. Several exposures were taken to achieve various EI values and corresponding entrance surface doses (ESD) were measured using thermoluminescent dosimeters. Image quality was assessed by 5 experienced clinicians using anatomical criteria judged against a reference image. Visualization of image specific common abnormalities was also analyzed to establish diagnostic efficacy. Results A rise in ESD for both examinations, consistent with increasing EI was shown. Anatomic image quality was deemed to be acceptable at an EI of 1560 for the AP pelvis and 1590 for the AP knee. From manufacturers recommended values, a significant reduction in ESD (p=0.02) of 38% and 33% for the pelvis and knee respectively was noted. Initial pathological analysis suggests that diagnostic efficacy at lower EI values may be projection-specific. Conclusion The data in this study emphasize the need for clinical centres to consider establishing their own EI guidelines, and not necessarily relying on manufacturers recommendations. Normal and abnormal images must be used in this process.
Histopathology reconstruction on digital imagery
Wenjing Li, Rich W. Lieberman, Sixiang Nie, et al.
Diagnosing cervical cancer in a woman is a multi-step procedure involving examination of the cervix, possible biopsy and follow-up. It is open to subjective interpretation and highly dependent upon the skills of cytologists, colposcopists, and pathologists. In an effort to reduce the subjectiveness of the colposcopist-directed biopsy and to improve the diagnostic accuracy of colposcopy, we have developed new colposcopic imaging systems with accompanying computer aided diagnostic (CAD) techniques to guide a colposcopist in deciding if and where to biopsy. If the biopsy's histopathology, the identification of the disease state at the cellular and near-cellular level, is to be used as the gold standard for CAD, then the location of the histopathologic analysis must match exactly to the location of the biopsy tissue in the digital image. Otherwise, no matter how perfect the histopathology and the quality of the digital imagery, the two data sets cannot be matched and the true sensitivity and specificity of the CAD cannot be ascertained. We report here on new approaches to preserving, continuously, the location and orientation of a biopsy sample with respect to its location in the digital image of the cervix so as to preserve the exact spatial relationship throughout the mechanical aspects of the histopathologic analysis. This new approach will allow CAD to produce a linear diagnosis and pinpoint the location of the tissue under examination.
A novel teaching tool using dynamic cues improves visualisation of chest lesions by naive observers
Introduction Dynamic cueing is an effective way of stimulating perception of regions of interest within radiological images. This study explores the impact of a novel teaching tool using dynamic cueing for lesion detection on plain chest radiographs. Materials and methods Observer performance studies were carried out where 36 novices examined 30 chest images in random order. Half of these contained between one and three simulated pulmonary nodules. Three groups were investigated: A (control: no teaching tool), B (retested immediately after undergoing the teaching tool) and C (retested a week after undergoing the teaching tool). The teaching tool involved dynamically displaying the same images with and without lesions. Results were compared using Receiver Operating Characteristics (ROC), sensitivity and specificity analyses. Results The second reading showed significantly greater area under the ROC curve (Az value) (p<0.0001) and higher sensitivity value (p=0.004) compared to the first reading for Group B. No differences between readings were demonstrated for groups A or C. When the magnitudes of the above changes were compared between Group B and the other two groups, greater changes in Az value for Group B were noted (B vs. A:p=0.0003, B vs. C:p=0.0005). For sensitivity, when Group B was compared to Group A, the magnitude of the change was significantly greater (p=0.0029) whereas when Group B was compared to Group C, the magnitude change demonstrated a level approaching significance (p=0.0768). Conclusions The novel teaching tool improves identification of pulmonary nodular lesions on chest radiographs in the short term.
Validation of an improved abnormality insertion method for medical image perception investigations
Mark T. Madsen, Gregory R. Durst, Robert T. Caldwell, et al.
The ability to insert abnormalities in clinical tomographic images makes image perception studies with medical images practical. We describe a new insertion technique and its experimental validation that uses complementary image masks to select an abnormality from a library and place it at a desired location. The method was validated using a 4-alternative forced-choice experiment. For each case, four quadrants were simultaneously displayed consisting of 5 consecutive frames of a chest CT with a pulmonary nodule. One quadrant was unaltered, while the other 3 had the nodule from the unaltered quadrant artificially inserted. 26 different sets were generated and repeated with order scrambling for a total of 52 cases. The cases were viewed by radiology staff and residents who ranked each quadrant by realistic appearance. On average, the observers were able to correctly identify the unaltered quadrant in 42% of cases, and identify the unaltered quadrant both times it appeared in 25% of cases. Consensus, defined by a majority of readers, correctly identified the unaltered quadrant in only 29% of 52 cases. For repeats, the consensus observer successfully identified the unaltered quadrant only once. We conclude that the insertion method can be used to reliably place abnormalities in perception experiments.
Motion and display effects on perception of multiple coronary stents
Sheng Zhang, Craig K. Abbey, Arian Teymoorian, et al.
The placement of multiple coronary stents requires fine judgments of distance between a deployed stent and stent/guidewire assembly. The goal of this deployment is to achieve continuous and gapless coverage between them. However, making these judgments is difficult because of limited system resolution, noise, relatively low contrast of the deployed stent, and stent motion during the cardiac cycle. In this work, we extend our previous work by investigating wider range of conditions associated with this task. The present studies consider number of frames and frame rate separately, and include stabilization of the stents as a way to quantify the performance effects of stent motion. We find that (1) stabilization reduces the uncertainty when detecting the gap size; (2) observer performance increases with the number of frames; (3) the effect of display frame rate is highly dependent on the motion of the target.
Image Display
icon_mobile_dropdown
The impact of faceplate surface characteristics on detection of pulmonary nodules
Introduction In order to prevent specular reflections, many monitor faceplates have features such as tiny dimples on their surface to diffuse ambient light incident on the monitor, however, this "anti-glare" surface may also diffuse the image itself. The purpose of the study was to determine whether the surface characteristics of monitor faceplates influence the detection of pulmonary nodules under low and high ambient lighting conditions. Methods and Materials Separate observer performance studies were conducted at each of two light levels (<1 lux and >250 lux). Twelve examining radiologists with the American Board of Radiology participated in the darker condition and eleven in the brighter condition. All observers read on both smooth "glare" and dimpled "anti-glare" faceplates in a single lighting condition. A counterbalanced methodology was utilized to minimise memory effects. In each reading, observers were presented with thirty chest images in random order, of which half contained a single simulated pulmonary nodule. They were asked to give their confidence that each image did or did not contain a nodule and to mark the suspicious location. ROC analysis was applied to resultant data. Results No statistically significant differences were seen in the trapezoidal area under the ROC curve (AUC), sensitivity, specificity or average time per case at either light level for chest specialists or radiologists from other specialities. Conclusion The characteristics of the faceplate surfaces do not appear to affect detection of pulmonary nodules. Further work into other image types is being conducted.
Clinical validation of a medical grade color monitor for chest radiology
J. Jacobs, F. Zanca, J. Verschakelen, et al.
Until recently, the specifications of medical grade monochrome LCD monitors outperformed those of color LCD monitors. New generations of color LCD monitors, however, show specifications that are in many respects similar to those of monochrome monitors typically used in diagnostic workstations. The aim of present study was to evaluate the impact of different medical grade monitors in terms of detection of simulated lung nodules in chest x-ray images. Specifically, we wanted to compare a new medical grade color monitor (Barco Coronis 6MP color) to a medical grade grayscale monitor (Barco Coronis 3MP monochrome) and a consumer color monitor (Philips 200VW 1.7MP color) by means of an observer performance experiment. Using the free-response acquisition data paradigm, seven radiologists were asked to detect and locate lung nodules (170 in total), simulated in half of the 200 chest X-ray images used in the experiment. The jackknife free-response receiver operating characteristic (JAFROC) analysis of the data showed a statistically significant difference between at least two monitors, F-value=3.77 and p-value =0.0481. The different Figure of Merit values were 0.727, 0.723 and 0.697 for the new color LCD monitor, the medical grade monitor and the consumer color monitor respectively. There was no difference between the needed reading times but there was a difference between the mean calculated Euclidian distances between the position marked by the observers and the center of the simulated nodule, indicating a better accuracy with both medical grade monitors. Present data suggests that the new generation of medical grade color monitors could be used as diagnostic workstations.
Spatial noise suppression for LCD displays
We deal with monochrome high-resolution LCD monitors used for displaying medical images. We discuss reducing near-pixel-sized components of fixed-pattern noise. This noise is composed of irregularities in the LCD pixel structure and fine background structures between pixels. We display a series of test images on the monitor with controlled LCD digital driving levels or DDL's. A calibrated CCD camera is used to magnify and capture a portion of the monitor for each image. The captured CCD digital values, or DSL's for digital sensor levels, are converted to luminance. This conversion is necessary because we employ a subsequent local-area processing step that relies on linearity of image-spread being in energy flux-density. Because we are working with two digital systems, the CCD camera and the LCD display, there is no continuous map between the CCD DSL's and LCD DDL's. We map the discrete LCD DDL's to a quantized luminance space. Once we have determined the target luminance values, we use an error-diffusion algorithm to select luminance values from the discrete addressable set for each pixel and then map those values to LCD DDL's. The result is a set of adjusted LCD DDL's that reduces the fixed-pattern noise in a locally averaged sense.
Performance evaluation of a full line of medical diagnostic displays and test of a web-based service for remote calibration and quality assurance
S. Busoni, G. Belli, A. Taddeucci, et al.
Main goal of this study is to compare the performances of a full line of LCD diagnostic display systems, in terms of white point luminance, accuracy and stability over time, GSDF conformance and luminance uniformity, and to test a web-based service for remote calibration and QA in a large hospital. The display systems under test included 3MP and 5MP grayscale, 2MP and 6MP color and 5MP mammography LCD monitors, all manufactured by BARCO, for a total amount of 119 units. Measured performances were all within the acceptance range proposed by the major international protocols and show a very good stability in time, except for a few cases. The web-based service for remote QA and calibration resulted well suited for the management of a large scale medical facility, where high performance displays are in use and time saving QA programs and a central QA policy are both needed.
Impact of luminance distribution in the visual field on foveal contrast sensitivity in the context of mammographic softcopy reading
Dörte Apelt, Richard Rascher-Friesenhausen, Jan Klein, et al.
For quality control in mammographic softcopy reading (SCR) a number of recommendations exists. Among them is a room illuminance of 10 lx. Moreover, the use of masks on the image seems to be advantageous, due to a reduction of scattered light in the focus of view. Room illuminance affects the global luminance adaptation and the maximal monitor contrast; masking decreases the luminance in the central and near-peripheral region. We investigated the effects of masking and illuminance on foveal contrast sensitivity. A study with eight observers was conducted in the context of mammographic softcopy reading. Using Gabor patterns with varying spatial frequencies, orientations and contrast levels as stimuli and an orientation discrimination task, the intraobserver contrast sensitivity was determined for foveal vision. Tested illuminances for a non-masked image were 10, 30, 50 and 90 lx, and for a masked image 10 lx. Major findings are: (1) Masking does not lead to improved contrast sensitivity. Instead, all observers reported a strong fatigue effect during the presentation of the masked image. (2) Among the illuminances tested, only half of the observers showed the best contrast sensitivity at 10 lx. For the other observers best results were achieved at illuminance levels of 50 or 90 lx, respectively. The results can be used to appraise the effects of viewing conditions with the aim of drawing conclusions for mammographic SCR, and to initiate further studies.
High-quality remote interactive imaging in the operating theatre
Ian J. Grimstead, Nick J. Avis, Peter Ll. Evans, et al.
We present a high-quality display system that enables the remote access within an operating theatre of high-end medical imaging and surgical planning software. Currently, surgeons often use printouts from such software for reference during surgery; our system enables surgeons to access and review patient data in a sterile environment, viewing real-time renderings of MRI & CT data as required. Once calibrated, our system displays shades of grey in Operating Room lighting conditions (removing any gamma correction artefacts). Our system does not require any expensive display hardware, is unobtrusive to the remote workstation and works with any application without requiring additional software licenses. To extend the native 256 levels of grey supported by a standard LCD monitor, we have used the concept of "PseudoGrey" where slightly off-white shades of grey are used to extend the intensity range from 256 to 1,785 shades of grey. Remote access is facilitated by a customized version of UltraVNC, which corrects remote shades of grey for display in the Operating Room. The system is successfully deployed at Morriston Hospital, Swansea, UK, and is in daily use during Maxillofacial surgery. More formal user trials and quantitative assessments are being planned for the future.
Medical Imaging and Radiological Health: Contributions of Dr. Robert F. Wagner
icon_mobile_dropdown
Bob's first decade: in at the beginning
Arriving at the Bureau of Radiological Health in 1972, Bob Wagner was thrust into the Bureau's quandary over how to quantify the imaging benefit associated with the radiation dose cost of medical imaging procedures. In short order he had set up the framework for FDA imaging research for the next 36 years. Bob played a key role in these early years in assisting in the founding of the SPIE Medical Imaging series of meetings, in measuring and organizing round robin comparisons of imaging measurements of the fundamental physical quantities required for performance evaluation, and in developing the framework for how these measurements could be combined to provide meaningful assessment figures of merit. He worked assiduously to counter both those who claimed that radiology was an art not a science and those who made extravagant claims for the dose reduction/image quality benefits of their particular variety of image capture/image processing system. In the process he became one of the founding fathers and key participants in the medical image performance assessment community as represented today at SPIE Medical Imaging 2009.
Statistical ultrasonics: the influence of Robert F. Wagner
An important ongoing question for higher education is how to successfully mentor the next generation of scientists and engineers. It has been my privilege to have been mentored by one of the best, Dr Robert F. Wagner and his colleagues at the CDRH/FDA during the mid 1980s. Bob introduced many of us in medical ultrasonics to statistical imaging techniques. These ideas continue to broadly influence studies on adaptive aperture management (beamforming, speckle suppression, compounding), tissue characterization (texture features, Rayleigh/Rician statistics, scatterer size and number density estimators), and fundamental questions about how limitations of the human eye-brain system for extracting information from textured images can motivate image processing. He adapted the classical techniques of signal detection theory to coherent imaging systems that, for the first time in ultrasonics, related common engineering metrics for image quality to task-based clinical performance. This talk summarizes my wonderfully-exciting three years with Bob as I watched him explore topics in statistical image analysis that formed a rational basis for many of the signal processing techniques used in commercial systems today. It is a story of an exciting time in medical ultrasonics, and of how a sparkling personality guided and motivated the development of junior scientists who flocked around him in admiration and amazement.
NEQ: its progenitors and progeny
The historical evolution of the concept of noise-equivalent quanta (NEQ) and its application to task-based assessment of image quality is surveyed, with particular emphasis on the seminal contributions of Robert F. Wagner.
Performance-based assessment of reconstructed images
During the early 90s, I engaged in a productive and enjoyable collaboration with RobertWagner and his colleague, Kyle Myers. We explored the ramifications of the principle that the quality of an image should be assessed on the basis of how well it facilitates the performance of appropriate visual tasks. We applied this principle to algorithms used to reconstruct scenes from incomplete and/or noisy projection data. For binary visual tasks, we used both the conventional disk detection and a new challenging task, inspired by the Rayleigh resolution criterion, of deciding whether an object was a blurred version of two dots or a bar. The results of human and machine observer tests were summarized with the detectability index based on the area under the ROC curve. We investigated a variety of reconstruction algorithms, including ART, with and without a nonnegativity constraint, and the MEMSYS3 algorithm. We concluded that the performance of the Raleigh task was optimized when the strength of the prior was near MEMSYS's default "classic" value for both human and machine observers. A notable result was that the most-often-used metric of rms error in the reconstruction was not necessarily indicative of the value of a reconstructed image for the purpose of performing visual tasks.
Finding order in complexity: themes from the career of Dr. Robert F. Wagner
Over the course of his long and productive career, Dr. Robert F. Wagner built a framework for the evaluation of imaging systems based on a task-based, decision theoretic approach. His most recent contributions involved the consideration of the random effects associated with multiple readers of medical images and the logical extension of this work to the problem of the evaluation of multiple competing classifiers in statistical pattern recognition. This contemporary work expanded on familiar themes from Bob's many SPIE presentations in earlier years. It was driven by the need for practical solutions to current problems facing FDA'S Center for Devices and Radiological Health and the medical imaging community regarding the assessment of new computer-aided diagnosis tools and Bob's unique ability to unify concepts across a range of disciplines as he gave order to increasingly complex problems in our field.
Eye Tracking and Human Visual System
icon_mobile_dropdown
Spatial frequency characteristics at image decision-point locations for observers with different radiological backgrounds in lung nodule detection
Aim: The goal of the study is to determine the spatial frequency characteristics at locations in the image of overt and covert observers' decisions and find out if there are any similarities in different observers' groups: the same radiological experience group or the same accuracy scored level. Background: The radiological task is described as a visual searching decision making procedure involving visual perception and cognitive processing. Humans perceive the world through a number of spatial frequency channels, each sensitive to visual information carried by different spatial frequency ranges and orientations. Recent studies have shown that particular physical properties of local and global image-based elements are correlated with the performance and the level of experience of human observers in breast cancer and lung nodule detections. Neurological findings in visual perception were an inspiration for wavelet applications in vision research because the methodology tries to mimic the brain processing algorithms. Methods: The wavelet approach to the set of postero-anterior chest radiographs analysis has been used to characterize perceptual preferences observers with different levels of experience in the radiological task. Psychophysical methodology has been applied to track eye movements over the image, where particular ROIs related to the observers' fixation clusters has been analysed in the spaces frame by Daubechies functions. Results: Significance differences have been found between the spatial frequency characteristics at the location of different decisions.
Modeling decision-making in single- and multi-modal medical images
This research introduces a mode-specific model of visual saliency that can be used to highlight likely lesion locations and potential errors (false positives and false negatives) in single-mode PET and MRI images and multi-modal fused PET/MRI images. Fused-modality digital images are a relatively recent technological improvement in medical imaging; therefore, a novel component of this research is to characterize the perceptual response to these fused images. Three different fusion techniques were compared to single-mode displays in terms of observer error rates using synthetic human brain images generated from an anthropomorphic phantom. An eye-tracking experiment was performed with naïve (non-radiologist) observers who viewed the single- and multi-modal images. The eye-tracking data allowed the errors to be classified into four categories: false positives, search errors (false negatives never fixated), recognition errors (false negatives fixated less than 350 milliseconds), and decision errors (false negatives fixated greater than 350 milliseconds). A saliency model consisting of a set of differentially weighted low-level feature maps is derived from the known error and ground truth locations extracted from a subset of the test images for each modality. The saliency model shows that lesion and error locations attract visual attention according to low-level image features such as color, luminance, and texture.
Radiology image perception and observer performance: How does expertise and clinical information alter interpretation? Stroke detection explored through eye-tracking
Historically, radiology research has been dominated by chest and breast screening. Few studies have examined complex interpretative tasks such as the reading of multidimensional brain CT or MRI scans. Additionally, no studies at the time of writing have explored the interpretation of stroke images; from novices through to experienced practitioners using eye movement analysis. Finally, there appears a lack of evidence on the clinical effects of radiology reports and their influence on image appraisal and clinical diagnosis. A computer-based, eye-tracking study was designed to assess diagnostic accuracy and interpretation in stroke CT and MR imagery. Eight predetermined clinical cases, five images per case, were presented to participants (novices, trainee, and radiologists; n=8). The presence or absence of abnormalities was rated on a five-point Likert scale and their locations reported. Half cases of the cases were accompanied by clinical information; half were not, to assess the impact of information on observer performance. Results highlight differences in visual search patterns amongst novice, trainee and expert observers; the most marked differences occurred between novice readers and experts. Experts spent more time in challenging areas of interest (AOI) than novices and trainee, and were more confident unless a lesion was large and obvious. The time to first AOI fixation differed by size, shape and clarity of lesion. 'Time to lesion' dropped significantly when recognition appeared to occur between slices. The influence of clinical information was minimal.
The holistic grail: possible implications of an initial mistake in the reading of digital mammograms
In 1967 Ulric Neisser, studying how laypeople examined pictures, hypothesized that image perception occurs in two stages, a pre-attentive stage in which the entire image is processed in parallel, where a 'holistic' view of what is being displayed is formed, and a secondary stage in which items or groups of items are examined by focal attention. Later, the proponents of Neisser's theory suggested that the pre-attentive stage may bias the selection of the areas that will be subjected to further analysis. This is easily seen in those dual interpretation figures; once one 'sees' the figure in a certain way, it is very hard to instruct the eye-brain system to let go of that perception and 'see' the figure in the alternative way. In medical image perception, Harold Kundel and Calvin Nodine proposed a model of medical image interpretation that is based upon Neisser's two stages, and have become so convinced of the influence of the 'holistic' view on the subsequent reading of the image that they have recently questioned the traditional framework that determines how lesions are found. In other words, as opposed to the traditional view of SEARCH THE IMAGE - DETECT A POSSIBLE FINDING - IDENTIFY THE FINDING - DECIDE WHAT TO DO ABOUT THE IMAGE, Kundel and Nodine have recently suggested a new framework: DETECT A POSSIBLE FINDING - IDENTIFY THE FINDING - SEARCH THE IMAGE - DECIDE WHAT TO DO ABOUT THE IMAGE. In light of this significant switch, we decided to investigate what happens when the 'holistic' view is incorrect.
Activity in the fusiform face area supports expert perception in radiologists and does not depend upon holistic processing of images
Stephen A. Engel, Erin M. Harley, Whitney B. Pope, et al.
Training in radiology dramatically changes observers' ability to process images, but the neural bases of this visual expertise remain unexplored. Prior imaging work has suggested that the fusiform face area (FFA), normally selectively responsive to faces, becomes responsive to images in observers' area of expertise. The FFA has been hypothesized to be important for "holistic" processing that integrates information across the entire image. Here, we report a cross-sectional study of radiologists that used functional magnetic resonance imaging to measure neural activity in first-year radiology residents, fourth-year radiology residents, and practicing radiologists as they detected abnormalities in chest radiographs. Across subjects, activity in the FFA correlated with visual expertise, measured as behavioral performance during scanning. To test whether processing in the FFA was holistic, we measured its responses both to intact radiographs and radiographs that had been divided into 25 square pieces whose locations were scrambled. Activity in the FFA was equal in magnitude for intact and scrambled images, and responses to both kinds of stimuli correlated reliably with expertise. These results suggest that the FFA is one of the cortical regions that provides the basis of expertise in radiology, but that its contribution is not holistic processing of images.
Visually lossless compression of breast biopsy virtual slides for telepathology
Jeffrey P. Johnson, Elizabeth A. Krupinski, John S. Nafziger, et al.
A major issue in telepathology is the extreme size of digitized slides, which require several gigabytes of storage and cause significant delays in image delivery to pathologists. We investigated the utility of a visual discrimination model (VDM) to predict bit rates for visually lossless JPEG2000 compression of breast biopsy virtual slides. Visually lossless bit rates were determined experimentally with human observers. VDM metrics computed for those bit rates were nearly constant, suggesting that VDMs could be used to achieve visually lossless image quality while providing about four times the data reduction of reversible compression.
Model Observers
icon_mobile_dropdown
Mass detection in breast tomosynthesis and digital mammography: a model observer study
C. Castella, M. Ruschin, M. P. Eckstein, et al.
In this study, we adapt and apply model observers within the framework of realistic detection tasks in breast tomosynthesis (BT). We use images consisting of realistic masses digitally embedded in real patient anatomical backgrounds, and we adapt specific model observers that have been previously applied to digital mammography (DM). We design alternative forced-choice experiments (AFC) studies for DM and BT tasks in the signal known exactly but variable (SKEV) framework. We compare performance of various linear model observers (non-prewhitening matched filter with an eye filter, and several channelized Hotelling observers (CHO) against human. A good agreement in performance between human and model observers can be obtained when an appropriate internal noise level is adopted. Models achieve the same detection performance across BT and DM with about three times less projected signal intensity in BT than in DM (humans: 3.8), due to the anatomical noise reduction in BT. We suggest that, in the future, model observers can potentially be used as an objective tool for automating the optimization of BT acquisition parameters or reconstruction algorithms, or narrowing a wide span of possible parameter combinations, without requiring human observers studies.
Optimization of medical imaging display systems: using the channelized Hotelling observer for detecting lung nodules: experimental study
Ljiljana Platisa, Ewout Vansteenkiste, Bart Goossens, et al.
Medical-imaging systems are designed to aid medical specialists in a specific task. Therefore, the physical parameters of a system need to optimize the task performance of a human observer. This requires measurements of human performance in a given task during the system optimization. Typically, psychophysical studies are conducted for this purpose. Numerical observer models have been successfully used to predict human performance in several detection tasks. Especially, the task of signal detection using a channelized Hotelling observer (CHO) in simulated images has been widely explored. However, there are few studies done for clinically acquired images that also contain anatomic noise. In this paper, we investigate the performance of a CHO in the task of detecting lung nodules in real radiographic images of the chest. To evaluate variability introduced by the limited available data, we employ a commonly used study of a multi-reader multi-case (MRMC) scenario. It accounts for both case and reader variability. Finally, we use the "oneshot" methods to estimate the MRMC variance of the area under the ROC curve (AUC). The obtained AUC compares well to those reported for human observer study on a similar data set. Furthermore, the "one-shot" analysis implies a fairly consistent performance of the CHO with the variance of AUC below 0.002. This indicates promising potential for numerical observers in optimization of medical imaging displays and encourages further investigation on the subject.
Using partial least squares to compute efficient channels for the Bayesian ideal observer
We define image quality by how accurately an observer, human or otherwise, can perform a given task, such as determining to which class an image belongs. For detection tasks, the Bayesian ideal observer is the best observer, in that it sets an upper bound for observer performance, summarized by the area under the receiver operating characteristic curve. However, the use of this observer is frequently infeasible because of unknown image statistics, whose estimation is computationally costly. As a result, a channelized ideal observer (CIO) was investigated to reduce the dimensionality of the data, yet approximate the performance of the ideal observer. Previously investigated channels include Laguerre Gauss (LG) channels and channels via the singular value decomposition of the given linear system (SVD). Though both types are highly efficient for the ideal observer, they nevertheless have the weakness that they may not be as efficient for general detection tasks involving complex/realistic images; the former is particular to the signal and background shape, and the latter is particular to the system operator. In this work, we attempt to develop channels that can be applied to a system with any signal and background type and without knowledge of any characteristics of the system. The method used is a partial least squares algorithm (PLS), in which channels are chosen to maximize the squared covariance between images and their classes. Preliminary results show that the CIO with PLS channels outperforms one with either the LG or SVD channels and very closely approximates ideal-observer performance.
Tests of scanning model observers for myocardial SPECT imaging
H. C. Gifford, P. H. Pretorius, J. G. Brankov
Many researchers have tested and applied human-model observers as part of their evaluations of reconstruction methods for SPECT perfusion imaging. However, these model observers have generally been limited to signal-known- exactly (SKE) detection tasks. Our objective is to formulate and test scanning model observers that emulate humans in detection-localization tasks involving perfusion defects. Herein, we compare several models based on the channelized nonprewhitening (CNPW) observer. Simulated Tc-99m images of the heart with and without defects were created using a mathematical anthropomorphic phantom. Reconstructions were performed with an iterative algorithm and postsmoothed with a 3D Gaussian filter. Human and model-observer studies were conducted to assess the optimal number of iterations and the smoothing level of the filter. The human-observer study was a multiple-alternative forced-choice (MAFC) study with five defects. The CNPW observer performed the MAFC study, but also performed an SKE-but-variable (SKEV) study and a localization ROC (LROC) study. A separate LROC study applied an observer based on models of human search in mammograms. The amount of prior knowledge about the possible defects differed for these four model-observer studies. The trend was towards improved agreement with the human observers as prior knowledge decreased.
On ideal AFROC and FROC observers
Detection of multiple lesions (signals) in images is a medically important task and Free-response Receiver Operating Characteristic (FROC) analyses and its variants, such as Alternative FROC (AFROC) analyses, are commonly used to quantify performance in such tasks. However, ideal observers that optimize FROC or AFROC performance metrics have not yet been formulated in the general case. If available, such ideal observers may turn out to be valuable for imaging system optimization and in the design of computer aided diagnosis (CAD) techniques for lesion detection in medical images. In this paper we derive ideal AFROC and FROC observers. They are ideal in that they maximize, amongst all decision strategies, the area under the associated AFROC or FROC curve. In addition these ideal observers minimize Bayes risk for particular choices of cost constraints. Calculation of observer performance for these ideal observers is computationally quite complex. We can reduce this complexity by considering forms of these observers that use false positive reports derived from signal-absent images only. We present a performance comparison of our ideal AFROC observer versus that of a more conventional scan-statistic observer.
ROC
icon_mobile_dropdown
JAFROC analysis revisited: figure-of-merit considerations for human observer studies
Jackknife alternative free-response receiver operating characteristic (JAFROC) is a method for measuring human observer performance in localization tasks. JAFROC is being increasingly used to evaluate imaging modalities because it has been shown to have greater statistical power than conventional receiver operating characteristic (ROC) analysis, which neglects location information. JAFROC neglects the non-lesion localization marks ("false positives") on abnormal images. JAFROC1 is an alternative method that includes these marks. Both methods are lesion-centric in the sense that they assign equal importance to all lesions; an image with many lesions would tend to dominate the performance metric, and clinically less significant lesions are treated identically as more significant ones. In this paper weighted JAFROC and JAFROC1 analyses are described that treat each abnormal image (not each lesion) as a unit of measurement and account for different lesion clinical significances (weights). Lesion-centric and weighted methods were tested using a simulator that includes multiple-reader multiple-case multiple-modality location level correlations. For comparison, ROC analysis was also tested where the rating of the highest rated mark on an image was assumed to be its "ROC" rating. The testing involved random numbers of lesions per image, random weights, case-mixes (ratio of normal to abnormal images) and different correlation structures. We found that for either JAFROC or JAFROC1, both lesion-centric and weighted analyses had correct NH behavior and comparable statistical powers. For either lesion-centric or weighted analyses JAFROC1 yielded the highest power, followed by JAFROC and ROC yielded the least power, confirming a recent study using a less flexible single-reader dual-modality simulator. Provided the number of normal cases is not too small, JAFROC1 is the preferred method for analyzing human observer free-response data. For either JAFROC or JAFROC1 weighted analysis is preferable.
Non-localization and localization ROC analyses using clinically based scoring
We are investigating the potential for differences in study conclusions when assessing the estimated impact of a computer-aided detection (CAD) system on readers' performance. The data utilized in this investigation were derived from a multi-reader multi-case observer study involving one hundred mammographic background images to which fixed-size and fixed-intensity Gaussian signals were added, generating a low- and high-intensity signal sets. The study setting allowed CAD assessment in two situations: when CAD sensitivity was 1) superior or 2) lower than the average reader. Seven readers were asked to review each set in the unaided and CAD-aided reading modes, mark and rate their findings. Using this data, we studied the effect on study conclusion of three clinically-based receiver operating characteristic (ROC) scoring definitions. These scoring definitions included both location-specific and non-location-specific rules. The results showed agreement in the estimated impact of CAD on the overall reader performance. In the study setting where CAD sensitivity is superior to the average reader, the mean difference in AUC between the CAD-aided read and unaided read was 0.049 (95%CIs: -0.027; 0.130) for the image scoring definition that is based on non-location-specific rules, and 0.104 (95%CIs: 0.036; 0.174) and 0.090 (95%CIs: 0.031; 0.155) for image scoring definitions that are based on location-specific rules. The increases in AUC were statistically significant for the location-specific scoring definitions. It was further observed that the variance on these estimates was reduced when using the location-specific scoring definitions compared to that using a non-location-specific scoring definition. In the study setting where CAD sensitivity is equivalent or lower than the average reader, the mean differences in AUC are slightly above 0.01 for all image scoring definitions. These increases in AUC were not statistical significant for any of the image scoring definitions. The results on the variance analysis differed from those observed in the other study setting. This investigation furthers our understanding of the relationships between non-localization-specific and localization-specific ROC assessment methodologies and their relevance to clinical practice.
Comparison of ROC methods for partially paired data
Brandon D. Gallas, Lorenzo L. Pesce
In this work we investigate ROC methods that compare the difference in AUCs (area under the ROC curve) from two modalities given partially paired data. Such methods are needed to accommodate the real world situations, where every case cannot be imaged or interpreted using both modalities. We compare variance estimation of the bivariate binormal-model based method ROCKIT of Metz et al., as well as several different non-parametric methods, including the bootstrap and U-statistics. This comparison explores different ROC curves, study designs (pairing structure of the data), sample sizes, case mix, and modality effect sizes.
Comparing the performance of two observers using a novel utility-based performance metric for ROC analysis
Darrin C. Edwards, Charles E. Metz
We previously introduced a utility-based ROC performance metric, the "surface-averaged expected cost" (SAEC), to address difficulties which arise in generalizing the well-known area under the ROC curve (AUC) to classification tasks with more than two classes. In a two-class classification task, the SAEC can be shown explicitly to be twice the area above the conventional ROC curve (1-AUC) divided by the arclength along the ROC curve. In the present work, we show that in tasks comparing the performance of two observers whose behavior is described by the proper binormal model, our proposed performance metric is consistent with AUC in the qualitative sense of deciding which of the two observers is "better," and by how wide a margin.
Comparison of classifier performance estimators: a simulation study
Weijie Chen, Robert F. Wagner, Waleed A. Yousef, et al.
We aim to compare resampling-based estimators of the area under the ROC curve (AUC) of a classifier with a Monte Carlo simulation study. The comparison is in terms of bias, variance, and mean square error. We also examine the corresponding variance estimators of these AUC estimators. We compared three AUC estimators: the hold-out (HO) estimator, the leave-one-out cross validation (LOOCV) estimator, and the leave-pair-out bootstrap (LPOB) estimator. Each performance estimator has its own variability estimator. In our simulations, in terms of the mean square error, HO is always the worst and the ranking of the other two estimators depends on the interplay of sample size, dimensionality, and the population separability. In terms of estimator variability, the LPOB is the least variable estimator and the HO is the most variable estimator. The results also show that the estimation of the variance of LPOB using the influence function approach with a finite data set is unbiased or conservatively biased whereas the estimation of the variance of the LOOCV or the HO is downwardly (i.e., anti-conservatively) biased.
A theoretical treatment of the sources of variability in the output of pattern classifiers
Previously, several instances of variability in the output of pattern classifiers that have the same Receiver Operating Characteristic (ROC) curve have been observed. We present a theoretical framework for understanding some sources of this variability, which result in classifiers with monotonically related outputs. We restrict our analysis to pattern classifiers that discriminate between two linearly separable classes. We show that variability in the output of pattern classifiers can arise due to differences in the functional mappings between their inputs and outputs. We further identify some practical situations wherein such variability in the output of such pattern classifiers arises. These include situations in which there are differences in (a) the datasets employed for training and evaluation of classifiers, (b) the a priori probabilities of the two classes, or (c) the stochastic processes employed for training the different pattern classifiers. Previously, we proposed a technique based on the matching of the histograms of differently distributed classifier output to reduce the variability in their diagnostic performance and their output values. Here, we prove theoretically and demonstrate empirically on simulated data, that for monotonically related classifier outputs, this technique successfully learns the true monotonic transformation function that exists between different pattern classifier outputs.
Observer Performance
icon_mobile_dropdown
Evaluation of chest tomosynthesis for the detection of pulmonary nodules: effect of clinical experience and comparison with chest radiography
Sara Zachrisson, Jenny Vikgren, Angelica Svalkvist, et al.
Chest tomosynthesis refers to the technique of collecting low-dose projections of the chest at different angles and using these projections to reconstruct section images of the chest. In this study, a comparison of chest tomosynthesis and chest radiography in the detection of pulmonary nodules was performed and the effect of clinical experience of chest tomosynthesis was evaluated. Three senior thoracic radiologists, with more than ten years of experience of chest radiology and 6 months of clinical experience of chest tomosynthesis, acted as observers in a jackknife free-response receiver operating characteristics (JAFROC-1) study, performed on 42 patients with and 47 patients without pulmonary nodules examined with both chest tomosynthesis and chest radiography. MDCT was used as reference and the total number of nodules found using MDCT was 131. To investigate the effect of additional clinical experience of chest tomosynthesis, a second reading session of the tomosynthesis images was performed one year after the initial one. The JAFROC-1 figure of merit (FOM) was used as the principal measure of detectability. In comparison with chest radiography, chest tomosynthesis performed significantly better with regard to detectability. The observer-averaged JAFROC-1 FOM was 0.61 for tomosynthesis and 0.40 for radiography, giving a statistically significant difference between the techniques of 0.21 (p<0.0001). The observer-averaged JAFROC-1 FOM of the second reading of the tomosynthesis cases was not significantly higher than that of the first reading, indicating no improvement in detectability due to additional clinical experience of tomosynthesis.
Correlation of emphysema score with perceived malignancy of pulmonary nodules: a multi-observer study using the LIDC-IDRI CT lung database
Presence of emphysema is recognized to be one of the single most significant risk factors in risk models for the prediction of lung cancer. Therefore, an automatically computed emphysema score would be a prime candidate as an additional numerical feature for computer aided diagnosis (CADx) for indeterminate pulmonary nodules. We have applied several histogram-based emphysema scores to 460 thoracic CT scans from the IDRI CT lung image database, and analyzed the emphysema scores in conjunction with 3000 nodule malignancy ratings of 1232 pulmonary nodules made by expert observers. Despite the emphysema being a known risk factor, we have not found any impact on the readers' malignancy rating of nodules found in a patient with higher emphysema score. We have also not found any correlation between the number of expert-detected nodules in a patient and his emphysema score, or the relative craniocaudal location of the nodules and their malignancy rating. The inter-observer agreement of the expert ratings was excellent on nodule diameter (as derived from manual delineations), good for calcification, and only modest for malignancy and shape descriptions such as spiculation, lobulation, margin, etc.
The effect of digitising film prior mammograms on radiologists' performance in breast screening: a JAFROC study
Sian Taylor-Phillips, Matthew G. Wallis, Alison Duncan, et al.
After the introduction of digital mammography the film mammograms from the previous screening round (the prior mammograms) can be displayed in a variety of ways. This paper investigates the performance of radiologists reading digital screening mammograms with the prior mammograms displayed either as film or in digitised format. A set of 162 cases was assembled, each with two view digital mammograms and two view film prior mammograms. Of these cases 66 were malignant as proven by biopsy, and the others were normal or benign. The film prior mammograms were digitised at 75μm. Eight participants, with four to seventeen years experience of reading screening mammograms, each read the mammograms twice; once with the digitised prior mammograms displayed on the digital workstation, and once with the film prior mammograms displayed on an adjacent multi-viewer. The two viewings were at least one month apart. Participants marked the location of abnormalities on a paper copy of the mammograms and rated the probability of malignancy of each abnormality. Participants were video-taped whilst reading the cases to enable analysis of gross eye movements for information regarding the level of use of the prior mammograms. JAFROC analysis showed no difference in performance between the conditions.
The impact of focal spot size on clinical images
The physical assessment of the spatial resolution produced by broad and fine focal spot sizes has been well established. There is however an evident lack of study into the impact of focal spot selection on clinical image quality. While the excessive use of the fine focus has an impact on tube life, the benefit of its use in radiological imaging should be investigated. Cadaver images were produced in order to compare the 0.8mm and 1.8mm focal spot sizes. The range of radiographic projections assessed included the medio-lateral ankle, antero-posterior (AP) knee, AP thoracic spine and horizontal beam lateral (HBL) lumbar spine. Five clinicians analysed the images using a 1 - 4 visual grading analysis score against a reference image to assess the visibility of specific anatomical criteria. A Mann- Whitney U statistical test was employed to assess the results. No significant statistical differences between the scores for the broad and fine focus images were found, although a non-significant higher score in image quality was shown for the fine compared with the broad focus images with large object to detector distance. No difference in image quality was shown for examinations traditionally produced with a fine focus. The study results questions the wide spread usage of fine foci for specific examinations, particularly for extremity examinations. Current practice based on international guidelines can lead to a reduced life and increased cost with little clinical benefit.
Contrast detail curves in head CT examinations
The purpose of this study was to generate contrast detail (CD) curves for low contrast mass lesions embedded in images obtained in head and neck CT examinations. Axial head and neck CT slice images were randomly chosen from patients at five different levels. All images were acquired at 120 kV, and reconstructed using a standard soft tissue reconstruction filter. For each head CT image, we measured detection of low contrast mass lesions using a 2 Alternate Forced Choice (2-AFC) experimental paradigm. In an AFC experiment, an observer identifies the lesion location in one of two regions of interest. After performing 128 sequential observations, it is possible to compute the lesion contrast corresponding to a 92% accuracy of lesion detection (i.e., I92%). Five lesion sizes were investigated ranging from 4 mm to 12.5 mm, with the experimental order randomized to eliminate learning curve as well as observer fatigue. Contrast detail curves were generated by plotting log[I92%] versus log[lesion size]. Experimental slopes ranged from ~ -0.1 to ~ -0.4. The slope of the CD curve was directly related to the complexity of the anatomical structure in the head CT image. As the apparent anatomical complexity increased, the slope of the corresponding CD curve was reduced. Results from our pilot study suggest that anatomical structure is of greater importance than quantum mottle, and that the type of anatomical background structure is an important determinant of lesion detection in CT imaging.
Observer Performance in Mammography
icon_mobile_dropdown
Analysis of probed regions in an interactive CAD system for the detection of masses in mammograms
M. Samulski, A. Hupse, C. Boetes, et al.
Most computer aided detection (CAD) systems for mammographic mass detection display all suspicious regions identified by computer algorithms and are mainly intended to avoid missing cancers due to perceptual oversights. Considering that interpretation failure is recognized to be a more common cause of missing cancers in screening than perceptual oversights, a dedicated mammographic CAD system has been developed that can be queried interactively for the presence of CAD prompts using a mouse click. To assess the potential benefit of using CAD in an interactive way, an observer study was conducted in which 4 radiologists and 6 non-radiologists evaluated 60 cases with and without CAD, to compare the detection performance of the unaided reader with that of the reader with CAD assistance. 20 cases had a malignant mass, and 40 were cancer-free. During the reading sessions we recorded time and probed locations which reveal information about the search strategy and detection process. The purpose of this study is to determine a relation between detection performance and time to first probe of the lesion and to investigate if longer reading times lead to more reports of malignant lesions in lesion-free areas. On average, 65.0% of the malignant lesions were found within 60 seconds and this percentage stabilizes after this period. Results suggest that longer reading time did not lead to more false positives. 74.6% of the reported true positive findings were hit by the first probe, and 93.2% were hit within 5 probes, which may suggest that many of the correctly reported malignant masses were perceived immediately after image onset.
Inter- and intra-observer variability in radiologists' assessment of mass similarity on mammograms
The purpose of this study was to compare the performances of two recently-developed image retrieval methods for mammographic masses, and to investigate the inter- and intra-observer variability in radiologists' assessment of mass similarity. Method 1 retrieved masses that are similar to a query mass from a reference library based on radiologists' margin and shape descriptions and the mass size. Method 2 used computer-extracted features. Two MQSA radiologists participated in an observer study in which they rated the similarity between 100 query masses and the retrieved lesions based on margins, shape, and size. For each query mass, three masses retrieved using Method 1 and three masses retrieved using Method 2 were displayed in random order using a graphical user interface. A nine-point similarity rating scale was used, with a rating of 1 indicating lowest similarity. Each radiologist repeated the readings twice, separated by more than three months, so that intra-observer variability could be studied. Averaged over the two radiologists, two readings, and all masses, the mean similarity ratings were 5.59 and 5.57 for Methods 1 and 2, respectively. The difference between the two methods did not reach significance (p>0.20) for either radiologist. The intra-observer variability was significantly lower than the inter-observer variability, which may indicate that each radiologist may have their image similarity criteria, and the criteria may vary from radiologist to radiologist. The understanding of the trends in radiologists' assessment of mass similarity may guide the development of decision support systems that make use of mass similarity to aid radiologists in mammographic interpretation.
Analysis of double reading in an observer study
Previously we showed based on theoretical analysis that it is possible to attain greater diagnostic performance from appropriately combining the diagnostic opinions of two or more equally skilled readers. Such gain in performance is available from combining the readers' "latent decision variables" that are accessible through ROC analysis, but is generally ambiguous at best if the readers' binary decisions with regard to clinical actions (e.g., recall vs. annual screening mammogram) are combined. We now analyze the data of an observer study. In this observer study, ten radiologists interpreted 104 cases of mammograms containing clustered microcalcifications in a diagnostic-study setting to decide whether to recommend biopsy. They also reported diagnostic confidence on a quasi-continuous scale that the calcifications indicated malignancy. A previous analysis showed that combining the radiologists' binary decisions (biopsy vs. no biopsy) would change both sensitivity and specificity generally along the radiologists' single-reading, average, ROC curve but would not increase the diagnostic performance. Combining two radiologists' "latent decision variables" resulted in small increases in the ROC curves consistent with the theoretical predictions. However, the shapes of the single-reading ROC curves were inconsistent with the expectation of the clinical diagnostic-study setting because all benign cases in the observer study were difficult-to-diagnose cases (all cases clinically biopsied). The double-reading results would have been different, and gains in diagnostic performance possible, if the ROC curve shape more accurately resembled that of clinical practice. There is need to estimate the ROC curve of clinical practice.
Higher-order scene statistics of breast images
Craig K. Abbey, Jascha N. Sohl-Dickstein, Bruno A. Olshausen, et al.
Researchers studying human and computer vision have found description and construction of these systems greatly aided by analysis of the statistical properties of naturally occurring scenes. More specifically, it has been found that receptive fields with directional selectivity and bandwidth properties similar to mammalian visual systems are more closely matched to the statistics of natural scenes. It is argued that this allows for sparse representation of the independent components of natural images [Olshausen and Field, Nature, 1996]. These theories have important implications for medical image perception. For example, will a system that is designed to represent the independent components of natural scenes, where objects occlude one another and illumination is typically reflected, be appropriate for X-ray imaging, where features superimpose on one another and illumination is transmissive? In this research we begin to examine these issues by evaluating higher-order statistical properties of breast images from X-ray projection mammography (PM) and dedicated breast computed tomography (bCT). We evaluate kurtosis in responses of octave bandwidth Gabor filters applied to PM and to coronal slices of bCT scans. We find that kurtosis in PM rises and quickly saturates for filter center frequencies with an average value above 0.95. By contrast, kurtosis in bCT peaks near 0.20 cyc/mm with kurtosis of approximately 2. Our findings suggest that the human visual system may be tuned to represent breast tissue more effectively in bCT over a specific range of spatial frequencies.
Contrast sensitivity in mammographic softcopy reading
Dörte Apelt, Hans Strasburger, Richard Rascher-Friesenhausen, et al.
In mammographic softcopy reading, assessment of contrast resolution is mainly performed with phantoms, including detection tasks with a homogeneous image background. For tasks in visual perception a processing hierarchy is assumed, where detection tasks represent the base level. The results of investigations based on detection tasks might not allow predictions on the sensitivity for recognizing low-contrast patterns in a situation with complex images. We introduce the MCS method (Mammographic Contrast Sensitivity) for determining the contrast sensitivity function (CSF) in mammograms. Gabor patterns and digits are used as visual targets. The observers have to cope with an orientation discrimination task for the Gabor patterns and an identification task for the digits. The contrast thresholds are measured by a psychophysical staircase procedure at six spatial frequencies up to 16 cycles per degree. A study with eight observers was performed to show the applicability of the MCS method. The results of the observer study with several mammographic cases show that the approach is applicable independent of the chosen images. The results for Gabor pattern targets were different from those with digits, both in overall sensitivity and in the shape of the contrast sensitivity function. Sensitivity to pattern recognition is thus not reliably predicted from the Gabor CSF, and a more complex target like a digit or a character should be preferred. The measurement of a contrast sensitivity function does not take more than 4 minutes. The results can be used to appraise the effects of viewing conditions with an aim of drawing conclusions for mammographic softcopy reading.
Can the evaluation of a simple test object be used to predict the performance of a contrast-detail analysis in digital mammography?
H. Bosmans, K. Lemmens, J. Jacobs, et al.
The purpose was to find the correlation between a Figure of Merit (FoM) calculated from a new (simple) test object for Quality Control in digital mammography and CDMAM threshold thicknesses. The FoM included the signal difference to noise ratio, modulation transfer function of the complete system (including scatter and grid) and normalized noise power spectrum. The pre-programmed exposure settings for clinical work were used, as was done for the CDMAM acquisitions. The FoM is calculated from 2 images only (an image from the QC test object and an image of a corresponding homogeneous plate imaged with the same exposure settings). This FoM was evaluated in frequencies that match with the diameters of the gold disks in the CDMAM phantom. Computerized CDMAM analysis uses 16 images per system. The software program "cdcom" (www.euref.org) was used for the 4-AFC experiment. All matrices were averaged, smoothed with a Gaussian filter and psychometric curves were fitted through the correctly detected fractions to obtain the threshold thickness with a detectability of 62.5% for all diameters. Images have been acquired on 10 different systems (2 computed radiography (CR) systems, 6 direct radiology (DR) systems and 2 photon counting systems). The reproducibility of the QC metrics from images of the new phantom was assessed. The standard error on the mean of the FoM was for the highest frequency 8.1% for a CR system and 5.6% for a DR system. The main component in this error is due to the NNPS and the limited number of independent pixels used in this analysis. Parameters calculated from both phantoms are sensitive to variation in mean glandular dose levels. Present results show a weak correlation (R2=0.60) between the FoM at 5lp/mm and CDMAM threshold values for the 0.1mm objects when all system data are pooled. If evaluated for separate systems, the correlation holds promise for automated, periodic performance evaluations of digital mammography systems with the simplified phantom.
Poster Session
icon_mobile_dropdown
Automated scoring method for the CDMAM phantom
M. Yip, W. Chukwu, E. Kottis, et al.
CDMAM phantoms are widely used in the Europe to assess the performance of mammography systems utilising small size and low contrast disc details. However, the assessment of CDMAM images by human observers is slow and tedious. An automated method for scoring CDMAM images (CDCOM) is widely available to address this issue. We have developed an alternative automated scoring tool to score CDMAM images, Quantitative Assessment System (QAS), for similarly removing inter- and intra- observer variability. This provides additional valuable information about the contrast and SNR of each gold disc within the image. The QAS scores CDMAM phantom images using a scanning algorithm. QAS scoring results were compared with human observers and with CDCOM. It was found that QAS was comparable with human observers in scoring, whereas CDCOM consistently scored a higher number of discs correctly in CDMAM images compared with QAS and human observers. QAS results have been used to analyse the effects of different digital mammography system modulation transfer functions (MTFs) on fine details for a number of systems in the form of contrast degradation factor (CDF) measurements. CDF curves for experimentally acquired CDMAM images were compared with those for simulated CDMAM images to assess the accuracy of contrast measurements.
Recognition of detail in mammography
Dörte Apelt, Hans Strasburger, Richard Rascher-Friesenhausen, et al.
In radiological practice the term recognition of detail is widely used. We examined how the term can be defined and interpreted, and how recognition of detail relates to radiological phantoms such as CDMAM (Contrast Detail Mammography). For tasks in visual perception a processing hierarchy can be assumed: The perception of a structure can occur at different processing levels, such as required in detection, discrimination, identification and recognition, with an ascending order of hierarchical relation. It is not always possible to predict from results at one hierarchy level those at another level. If an observer detects a structure, there is no prediction whether the observer will be able to discriminate the structure from another or whether he or she is even able to interpret the structure. Furthermore, the perceptibility of a detail is influenced by surrounding or overlapping anatomical noise. The presence of noise elevates visual thresholds and may change the overall perceptual behavior with regard to the examined parameter. Thus, perceptibility of structures (details) is strongly bound to the type of perceptual task and the image background used. Speaking of recognition of detail should not liberally extended to evaluating performance parameters of a technical system. If the term is applied, it needs to be specified how detail is characterized and which perceptual task is used for operationalizing recognition.
Mammographic interpretation training in the UK: current difficulties and future outlook
In the UK, most mammographic interpretation training needs to be undertaken where there is a mammo-alternator or other suitable light box; consequently limiting the time and places where training can take place. However, the gradual introduction of digital mammography is opening up new opportunities of providing such training without the restriction of current viewing devices. Whilst high-resolution monitors in appropriate viewing environments are de rigour for actual reporting; advantages of the digital image over film are in the flexibility of training opportunity afforded, e.g. training whenever, wherever suits the individual. A previous study indicated the possible potential for reporting mammographic cases utilising handheld devices with suitable interaction techniques. In a pilot study, a group of mammographers (n=4) were questioned in semi-structured interviews in order to help establish current UK film-readers' training profile. On the basis of the pilot study data, 109 Breast Screening Units (601 film readers) were approached to complete a structured questionnaire in order to establish the potential role of smaller computer devices in mammographic interpretation training (given the use of digital mammography). Subsequently, a study of radiologists' visual search behaviour in digital screening has begun. This has highlighted different image manipulations than found in structured experiments in this area and poses new challenges for visualising the inspection process. Overall the results indicate that using different display sizes for training is possible but is also a challenging task requiring novel interaction approaches.
Reading the lesson: eliciting requirements for a mammography training application
M. Hartswood, L. Blot, P. Taylor, et al.
Demonstrations of a prototype training tool were used to elicit requirements for an intelligent training system for screening mammography. The prototype allowed senior radiologists (mentors) to select cases from a distributed database of images to meet the specific training requirements of junior colleagues (trainees) and then provided automated feedback in response to trainees' attempts at interpretation. The tool was demonstrated to radiologists and radiographers working in the breast screening service at four evaluation sessions. Participants highlighted ease of selecting cases that can deliver specific learning objectives as important for delivering effective training. To usefully structure a large data set of training images we undertook a classification exercise of mentor authored free text 'learning points' attached to training case obtained from two screening centres (n=333, n=129 respectively). We were able to adduce a hierarchy of abstract categories representing classes of lesson that groups of cases were intended to convey (e.g. Temporal change, Misleading juxtapositions, Position of lesion, Typical/Atypical presentation, and so on). In this paper we present the method used to devise this classification, the classification scheme itself, initial user-feedback, and our plans to incorporated it into a software tool to aid case selection.
The relationship between real life breast screening and an annual self assessment scheme
Hazel J. Scott, Andrew Evans, Alastair G. Gale, et al.
Incidence of cancer in the UK NHS Breast Screening Programme (NHSBSP) is relatively low (approximately 7% per 1,000 cases screened). As such, feedback from cancers missed or interval cancers can be a relatively lengthy process (whereby a woman will not present for corroborating imaging for a further three years). Therefore in order to monitor their radiological skill, all breast screening radiologists and technologists read a self-assessed, standard set of challenging mammographic images bi-yearly. This scheme, 'PERFORMS' (Personal Performance in Mammographic Screening) has been running since near the inception of the NHSBSP in 1991. Although PERFORMS has functioned as an educational tool for film-readers on the UKBSP for decades, its relation to real life screening in past years has proven to be somewhat equivocal (Cowley & Gale, 1999). The present study investigated the relationship between performance measures in real life and their equivalent on the PERFORMS self assessment scheme namely: Miss Rate (FN), Cases Arbitrated and Returned to Routine screening and Incorrect recall (FP), Specificity (TN) and Cancer Detection (TP). Over 40 individuals from one NHS region in the UK submitted their real life data for comparison with PERFORMS results from the same time frame. Data from this initial study were taken from the year 2005-2006 and compared with the relevant PERFORMS set of cases. Results indicated a significant positive correlation between PERFORMS performance measures and performance measures for real life. These results are discussed in the light of the legitimacy of self-assessment comparative to film-reading skill (during real life clinical practice).
Developing and testing a multi-probe resonance electrical impedance spectroscopy system for detecting breast abnormalities
In our previous study, we reported on the development and preliminary testing of a prototype resonance electrical impedance spectroscopy (REIS) system with a pair of probes. Although our pilot study on 150 young women ranging from 30 to 50 years old indicated the feasibility of using REIS output sweep signals to classify between the women who had negative examinations and those who would ultimately be recommended for biopsy, the detection sensitivity was relatively low. To improve performance when using REIS technology, we recently developed a new multi-probe based REIS system. The system consists of a sensor module box that can be easily lifted along a vertical support device to fit women of different height. Two user selectable breast placement "cups" with different curvatures are included in the system. Seven probes are mounted on each of the cups on opposing sides of the sensor box. By rotating the sensor box, the technologist can select the detection sensor cup that better fits the breast size of the woman being examined. One probe is mounted in the cup center for direct contact with the nipple and the other six probes are uniformly distributed along an outside circle to enable contact with six points on the outer and inner breast skin surfaces. The outer probes are located at a distance of 60mm away from the center (nipple) probe. The system automatically monitors the quality of the contact between the breast surface and each of the seven probes and data acquisition can only be initiated when adequate contact is confirmed. The measurement time for each breast is approximately 15 seconds during which time the system records 121 REIS signal sweep outputs generated from 200 KHz to 800 KHz at 5 KHz increments for all preselected probe pairs. Currently we are measuring 6 pairs between the center probe and each of six probes located on the outer circle as well as two pairs between probe pairs on the outer circle. This new REIS system has been installed in our clinical breast imaging facility. We are conducting a prospective study to assess performance when using this REIS system under an approved IRB protocol. Over 200 examinations have been conducted to date. Our experience showed that this new REIS system was easy to operate and the REIS examination was fast and considered "comfortable" by examinees since the women presses her breast into the cup herself without any need for forced breast compression, and all but a few highly sensitive women have any sensation of an electrical current during the measurement.
ViewDEX 2.0: a Java-based DICOM-compatible software for observer performance studies
Markus Håkansson, Sune Svensson, Sara Zachrisson, et al.
ViewDEX (Viewer for Digital Evaluation of X-ray images) is a Java-based DICOM-compatible software tool for observer performance studies that can be used to display medical images with simultaneous registration of the observer's response. The current release, ViewDEX 2.0 is a development of ViewDEX 1.0, which was released in 2007. Both versions are designed to run in a Java environment and do not require any special installation. For example, the program can be located on a memory stick or stand alone hard drive and be run from there. ViewDEX is managed and configured by editing property files, which are plain text files where users, tasks (questions, definitions, etc.) and functionality (WW/WL, PAN, ZOOM, etc.) are defined. ViewDEX reads all common DICOM image formats and the images can be stored in any location connected to the computer. ViewDEX 2.0 is designed so that the user in a simple way can alter if the questions presented to the observers are related to localization or not, enabling e.g. free-response ROC, standard ROC and visual grading studies, as well as combinations of these, to be conducted in a fast and efficient way. The software can also be used for bench marking and for educational purposes. The results from each observer are saved in a log file, which can be exported for further analysis. The software is freely available for non-commercial purposes.
Ambient temperature variation affects radiological diagnostic performance
Mark McEntee, Selina Gafoor
No guidelines currently exist for optimum ambient temperature during radiology reporting. The objective of this study is to determine whether changes in ambient temperature effect performance during radiological detection tasks. Ambient temperatures and humidity were measured in 11 radiological reporting environments. Observers were then asked to assess CT images at 18°C, 21°C and 23°C. Thirty non-contrast cranial CT images, 15 with Intra cranial bleeds and 15 without were used. A ROC analysis was performed. The shortest time taken to assess the images was recorded at 18°C which took 10.5 (sd 4.07) seconds per image this was significantly shorter than 21°C which took 14.93 seconds (sd 3.87) (p ≤ 0.017). There is a trend of increasing sensitivity with decreasing temperature with 18, 21 and 23°C resulting in sensitivity values of 0.52, 0,42 and 0.37 respectively, with 18°C (0.52 sd 0.21) resulting in significantly higher sensitivity than 23°C (0.37 sd 0.14) (p ≤ 0.030).
Improving mouse pointing for radiology tasks
Yan Tan, Geoffrey Tien, Bruce Forster, et al.
Radiologists make their main analysis and diagnosis based on careful observation of medical images, although there are all kinds of automatic methods under development. Radiologists typically use a scroll mouse to click on an image when they find something interesting, and they also use the mouse to navigate through the image slices in volumetric scans. Thus they perform many thousands of mouse clicks every day, causing wrist fatigue. This paper presents a method of improving the mouse pointing performance by reducing the time taken to move the mouse to a target. We use a dynamic Control-to-Display (C-D) ratio of the mouse, by adjusting the C-D ratio according to the current distance to the target. In theory this reduces the difficulty of the target selection, and also reduces the movement time. The result of preliminary study demonstrates that the speed of pointing can be improved under certain conditions, particularly for small targets and for long distances to move. In addition, all participants claim that this mouse speed change reduces the difficulty of selecting a small target.
Quality control evaluation of 37 liquid crystal displays used in diagnostic services
J. J. Morant, M. Chevalier, E. Guibelalde, et al.
The performance of 37 primary class liquid crystal display devices (2, 3 and 5 Mpixel matrix size) used in 9 different diagnostic services in Spain has been determined in terms of 13 quantitative and visual evaluations. The equipment had never been subjected to calibration or to QC tests since commissioning by vendors, between 2 and 18 months before measurements. Tests, using calibrated luminance meters and TG18 patterns, have evaluated ambient light conditions and other basic performance indicators, namely, display geometric distortion, artefacts, resolution and low-contrast visibility, contrast luminance response compliance to DICOM standard, luminance extreme values and uniformity between pairs of monitors associated to a same workstation. The principal sources of non-compliance are failures to visualize low-contrast test objects (73% of displays), excessive differences with the DICOM contrast response standard (57%), and non-uniform response of monitor pairs (54%). Also, 43% of LCD were found located in places with excessive illumination and presenting specular reflections from faceplates. The analysis of ten 5 Mpixel displays, of possible use in mammography services, indicates similar performance as the rest of monitors, except for the ambient luminance (100% complying with recommendations) and larger non-compliance with the DICOM response standard (80%). No correlation between image quality indicators and monitor hours of operation was found.
Content-based versus semantic-based retrieval: an LIDC case study
Content based image retrieval is an active area of medical imaging research. One use of content based image retrieval (CBIR) is presentation of known, reference images similar to an unknown case. These comparison images may reduce the radiologist's uncertainty in interpreting that case. It is, therefore, important to present radiologists with systems whose computed-similarity results correspond to human perceived-similarity. In our previous work, we developed an open-source CBIR system that inputs a computed tomography (CT) image of a lung nodule as a query and retrieves similar lung nodule images based on content-based image features. In this paper, we extend our previous work by studying the relationships between the two types of retrieval, content-based and semantic-based, with the final goal of integrating them into a system that will take advantage of both retrieval approaches. Our preliminary results on the Lung Image Database Consortium (LIDC) dataset using four types of image features, seven radiologists' rated semantic characteristics and two simple similarity measures show that a substantial number of nodules identified as similar based on image features are also identified as similar based on semantic characteristics. Furthermore, by integrating the two types of features, the similarity retrieval improves with respect to certain nodule characteristics.
Reverse hierarchy theory and medical image perception
We are unsure about what information is extracted from an image to allow a decision about pathology to be made. Our knowledge of the interplay between top down processing or bottom up, local or global perception, perceptual or cognitive processes is uncertain. However recent research has emphasised the importance of the global or holistic look in medical image perception in which recognition of abnormalities precedes search. Reverse Hierarchy Theory [1] is a useful general theory that helps to explain this. It also enables us to understand what information is extracted from an image and how this relates to expertise. Essentially the theory states that perceptual learning begins at high levels areas and progresses down to lower level areas when better signal to noise is needed. So perceptual learning, defined as an improvement in sensory abilities after training, stems from a gradual top down guided increase in usability of first high then lower level task relevant information. Evaluation of the scan paths of groups of observers with different levels of expertise when undertaking a lung nodule perception task seems to be consistent with the theory. Experts' perception is generally immediate and holistic suggesting high level representations whereas those with an intermediate level of expertise tend to be more variable in their scan paths. Interestingly naïve observers have eye tracking metrics that are more similar to experts suggesting they take a common sense approach using perceptual skills we all have as they lack experience in being able to access the low level information from the chest radiograph.
Selective evaluation of noise, blur, and aliasing artifacts in fast MRI reconstructions using a weighted perceptual difference model: Case-PDM
The perceptual difference model (Case-PDM) is being used to quantify image quality of fast, parallel MR acquisitions and reconstruction algorithms as compared to slower, full k-space, high quality reference images. To date, most perceptual difference models average image quality over a wide range of image degradations. Here, we create metrics weighted to different types of artifacts. The selective PDM is tuned using test images from an input reference image degraded by noise, blur, or aliasing. Using an objective function based on the computation of diffusivity and edges applied to the output perceptual difference map, cortex channels in the PDM are arranged in a matrix and weighted by a 2D Gaussian function to ensure maximal response to each artifact in turn. PDM scores were compared to human ratings across a large set of MR reconstruction test images of varying quality. Human ratings (i.e. overall, noise, blur, and aliasing ratings) were obtained from a modified Double Stimulus Continuous Quality Scale experiment. For 3 different image types (brain, cardiac, and phantom), averaged r values [PDM, noise-PDM, blur-PDM, aliasing-PDM] were [0.933±0.018, 0.938±0.015, 0.727±0.106, 0.500±0.193], which with the possible exception of aliasing compared favorably inter-subject correlation [0.936±0.028, 0.856±0.064, 0.539±0.230, 0.767±0.125], respectively. With continued fine tuning, we believe that the weighted Case-PDM score will be useful for selectively evaluating artifacts in fast MR imaging.
Impact of visual fatigue on observer performance
Our overall hypothesis is that current radiology practice produces oculomotor fatigue reducing diagnostic accuracy. The goal of this study is to determine whether accommodative stability and diagnostic accuracy are reduced following digital radiology interpretation. We are collecting data at two points in time - once in the morning prior to diagnostic reading and once in the afternoon after reading. Subjects are completing surveys about their current physical status and number of hours spent reading that day along and the type of images read. We are measuring accommodation using the WAM- 5500 Auto Refkeratometer. Subjects view bone images with subtle fractures and dislocations to determine if a fracture is present, locate it, and provide rating of their decision confidence to be used in a ROC analysis of the data. Preliminary results confirm our previous findings that we can measure visual fatigue. Radiologists are less able to focus on a distinct point, especially at near distances, after a day of reading images on digital displays as opposed to before any reading takes place. The SOFI and SSQ measures also indicate that radiologists are more fatigued at the end of a day's reading as compared to before. The confidence ratings are being evaluated using ROC techniques. The results so far suggest a reduction in diagnostic accuracy with tired eyes. Preliminary data from measuring visual accommodation and observer performance support our hypothesis that radiologists suffer visual fatigue after a day reading diagnostic images from digital displays reducing interpretation accuracy.
Effectiveness of computer aided detection for solitary pulmonary nodules
Jiayong Yan, Wenjie Li, Xiangying Du, et al.
This study is to investigate the incremental effect of using a high performance computer-aided detection (CAD) system in detection of solitary pulmonary nodules in chest radiographs. The Kodak Chest CAD system was evaluated by a panel of six radiologists at different levels of experience. The observer study consisted of two independent phases: readings without CAD and readings with assistance of CAD. The study was conducted over a set of chest radiographs comprising 150 cancer cases and 150 cancer-free cases. The actual sensitivity of the CAD system is 72% with 3.7 false positives per case. Receiver operating characteristic (ROC) analysis was used to assess the overall observer performance. The AUZ (area under ROC curve) showed a significantly improvement (P=0.0001) from 0.844 to 0.884 after using CAD. The ROC analysis was also applied for observer performances on nodules in different sizes and visibilities. The average AUZs are improved from 0.798 to 0.835 (P=0.0003) for 5-10mm nodules, 0.853 to 0.907 (P=0.001) for 10-15mm nodules, 0.864 to 0.897 (P=0.051) for 15-20 mm nodules and 0.859 to 0.896 (P=0.0342) for 20-30mm nodules, respectively. For different visibilities, the average AUZs are improved from 0.886 to 0.915 (P=0.0337), 0.803 to 0.840 (P=0.063), 0.830 to 0.893 (P=0.0001), and 0.813 to 0.847 (P=0.152), for nodules clearly visible, hidden by ribs, partially overlap with ribs, and overlap with other structures, respectively. These results showed that observer performance could be greatly improved when the CAD system is employed as a second reader, especially for small nodules and nodules occluded by ribs.
Lung nodule detection in chest radiography: image components analysis
Tao Luo, Xuanqin Mou, Ying Yang, et al.
We aimed to evaluate the effect of different components of chest image on performances of both human observer and channelized Fisher-Hotelling model (CFH) in nodule detection task. Irrelevant and relevant components were separated from clinical chest radiography by employing Principal Component Analysis (PCA) methods. Human observer performance was evaluated in two-alternative forced-choice (2AFC) on original clinical images and anatomical structure only images obtained by PCA methods. Channelized Fisher-Hotelling model with Laguerre-Gauss basis function was evaluated to predict human performance. We show that relevant component is the primary factor influencing on nodule detection in chest radiography. There is obvious difference of detectability between human observer and CFH model for nodule detection in images only containing anatomical structure. CFH model should be used more carefully.
A study on the effect of CT imaging acquisition parameters on lung nodule image interpretation
Shirley J. Yu, Joseph S. Wantroba, Daniela S. Raicu, et al.
Most Computer-Aided Diagnosis (CAD) research studies are performed using a single type of Computer Tomography (CT) scanner and therefore, do not take into account the effect of differences in the imaging acquisition scanner parameters. In this paper, we present a study on the effect of the CT parameters on the low-level image features automatically extracted from CT images for lung nodule interpretation. The study is an extension of our previous study where we showed that image features can be used to predict semantic characteristics of lung nodules such as margin, lobulation, spiculation, and texture. Using the Lung Image Data Consortium (LIDC) dataset, we propose to integrate the imaging acquisition parameters with the low-level image features to generate classification models for the nodules' semantic characteristics. Our preliminary results identify seven CT parameters (convolution kernel, reconstruction diameter, exposure, nodule location along the z-axis, distance source to patient, slice thickness, and kVp) as influential in producing classification rules for the LIDC semantic characteristics. Further post-processing analysis, which included running box plots and binning of values, identified four CT parameters: distance source to patient, kVp, nodule location, and rescale intercept. The identification of these parameters will create the premises to normalize the image features across different scanners and, in the long run, generate automatic rules for lung nodules interpretation independently of the CT scanner types.
Scanning model observers to predict human performance in LROC studies of SPECT reconstruction using anatomical priors
We use scanning model observers to predict human performance in lesion search/detection study. The observer's task is to locate gallium-avid tumors in simulated SPECT images of a digital phantom. The goal of our model is to predict the optimal prior strength β for human observers of smoothing priors incorporated into the reconstruction algorithm. These priors use varying amounts of anatomical knowledge. We present results from a scanning channelized non-prewhitening matched filter, and compare them with results from a human-observer study. Including a step to mimic the greyscale perceptual-linearization used during the human-observer study improves the accuracy of the model. However we find that for lesions close to an organ boundary even the improved model does not accurately predict human performance.
Singular value decomposition of pinhole SPECT systems
A single photon emission computed tomography (SPECT) imaging system can be modeled by a linear operator H that maps from object space to detector pixels in image space. The singular vectors and singular-value spectra of H provide useful tools for assessing system performance. The number of voxels used to discretize object space and the number of collection angles and pixels used to measure image space make the matrix dimensions H large. As a result, H must be stored sparsely which renders several conventional singular value decomposition (SVD) methods impractical. We used an iterative power methods SVD algorithm (Lanczos) designed to operate on very large sparsely stored matrices to calculate the singular vectors and singular-value spectra for two small animal pinhole SPECT imaging systems: FastSPECT II and M3R. The FastSPECT II system consisted of two rings of eight scintillation cameras each. The resulting dimensions of H were 68921 voxels by 97344 detector pixels. The M3R system is a four camera system that was reconfigured to measure image space using a single scintillation camera. The resulting dimensions of H were 50864 voxels by 6241 detector pixels. In this paper we present results of the SVD of each system and discuss calculation of the measurement and null space for each system.
Is Grannum grading of the placenta reproducible?
Current ultrasound assessment of placental calcification relies on Grannum grading. The aim of this study was to assess if this method is reproducible by measuring inter- and intra-observer variation in grading placental images, under strictly controlled viewing conditions. Thirty placental images were acquired and digitally saved. Five experienced sonographers independently graded the images on two separate occasions. In order to eliminate any technological factors which could affect data reliability and consistency all observers reviewed images at the same time. To optimise viewing conditions ambient lighting was maintained between 25-40 lux, with monitors calibrated to the GSDF standard to ensure consistent brightness and contrast. Kappa (κ) analysis of the grades assigned was used to measure inter- and intra-observer reliability. Intra-observer agreement had a moderate mean κ-value of 0.55, with individual comparisons ranging from 0.30 to 0.86. Two images saved from the same patient, during the same scan, were each graded as I, II and III by the same observer. A mean κ-value of 0.30 (range from 0.13 to 0.55) indicated fair inter-observer agreement over the two occasions and only one image was graded consistently the same by all five observers. The study findings confirmed the lack of reproducibility associated with Grannum grading of the placenta despite optimal viewing conditions and highlight the need for new methods of assessing placental health in order to improve neonatal outcomes. Alternative methods for quantifying placental calcification such as a software based technique and 3D ultrasound assessment need to be explored.