Proceedings Volume 7865

Human Vision and Electronic Imaging XVI

cover
Proceedings Volume 7865

Human Vision and Electronic Imaging XVI

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 1 February 2011
Contents: 11 Sessions, 43 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2011
Volume Number: 7865

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 7865
  • Keynote Session
  • Design, Composition, and Illumination
  • Extracting, Integrating, and Analyzing Features
  • Perceptual and Cognitive Challenges in the Visualization and Visual Analysis of Bioinformatics Data
  • Evaluating the Quality of the Stereoscopic Experience II: Joint Session with Conference 7863
  • Perceptual Approaches to Video Quality
  • Visual Attention, Saliency, and Quality I: Joint Session with Conference 7867
  • Visual Attention, Saliency, and Quality II: Joint Session with Conference 7867
  • Attention and Gaze in Constructing the Visual World
  • Interactive Paper Session
Front Matter: Volume 7865
icon_mobile_dropdown
Front Matter: Volume 7865
This PDF file contains the front matter associated with SPIE Proceedings Volume 7865, including the Title Page, Copyright information, Table of Contents, and the Conference Committee listing.
Keynote Session
icon_mobile_dropdown
How 3D immersive visualization is changing medical diagnostics
Anton H. J. Koning
Originally the only way to look inside the human body without opening it up was by means of two dimensional (2D) images obtained using X-ray equipment. The fact that human anatomy is inherently three dimensional leads to ambiguities in interpretation and problems of occlusion. Three dimensional (3D) imaging modalities such as CT, MRI and 3D ultrasound remove these drawbacks and are now part of routine medical care. While most hospitals 'have gone digital', meaning that the images are no longer printed on film, they are still being viewed on 2D screens. However, this way valuable depth information is lost, and some interactions become unnecessarily complex or even unfeasible. Using a virtual reality (VR) system to present volumetric data means that depth information is presented to the viewer and 3D interaction is made possible. At the Erasmus MC we have developed V-Scope, an immersive volume visualization system for visualizing a variety of (bio-)medical volumetric datasets, ranging from 3D ultrasound, via CT and MRI, to confocal microscopy, OPT and 3D electron-microscopy data. In this talk we will address the advantages of such a system for both medical diagnostics as well as for (bio)medical research.
Vision as a user interface
The egg-rolling behavior of the graylag goose is an often quoted example of a fixed-action pattern. The bird will even attempt to roll a brick back to its nest! Despite excellent visual acuity it apparently takes a brick for an egg." Evolution optimizes utility, not veridicality. Yet textbooks take it for a fact that human vision evolved so as to approach veridical perception. How do humans manage to dodge the laws of evolution? I will show that they don't, but that human vision is an idiosyncratic user interface. By way of an example I consider the case of pictorial perception. Gleaning information from still images is an important human ability and is likely to remain so for the foreseeable future. I will discuss a number of instances of extreme non-veridicality and huge inter-observer variability. Despite their importance in applications (information dissemination, personnel selection,...) such huge effects have remained undocumented in the literature, although they can be traced to artistic conventions. The reason appears to be that conventional psychophysics-by design-fails to address the qualitative, that is the meaningful, aspects of visual awareness whereas this is the very target of the visual arts.
Design, Composition, and Illumination
icon_mobile_dropdown
What makes good image composition?
Ron Banner
Some people are born with an intuitive sense of good composition. They do not need to be taught composition, and their work is immediately perceived as being well by other people. In an attempt to help others learn composition, art critics, scientists and psychologists analyzed well-compose works in the hope of recognizing patterns and trends that anyone could employ to achieve similar results. Unfortunately, the identified patterns are by no means universal. Moreover, since a compositional rule is useful only as long as it enhances the idea that the artist is trying to express, there is no objective standard to judge whether a given composition is "good" or "bad". As a result, the study of composition seems to be full of contradictions. Nevertheless, there are several basic "low level" rules supported by physiological studies in visual perception that artists and photographers intuitively obey. Regardless of image content, a prerequisite for all good images is that their respective composition would be balanced. In a balanced composition, factors such as shape, direction, location and color are determined in a way that is pleasant to the eye. An unbalanced composition looks accidental, transitory and its elements show a tendency to change place or shape in order to reach a state that better reflects the total structure. Under these conditions, the artistic statement becomes incomprehensive and confusing.
A comparison of perceived lighting characteristics in simulations versus real-life setup
In the design of professional luminaires, improving visibility has always been a core target. Recently, it has become clearer that especially for consumer lighting, generating an appropriate atmosphere and pleasant feeling is of almost equal importance. In recent studies it has been shown that the perception of an atmosphere can be described by four variables: cosiness, liveliness, tenseness, and detachment. In this paper we compare the perception of these lighting characteristics when viewed in reality with the perception when viewing a simulated picture. Replacing reality by a picture on a computer screen such as an LCD monitor, or a piece of paper, introduces several differences. These include a reduced dynamic range, reduced maximum brightness and quantization noise in the brightness levels, but also in a different viewing angle, and a different adaption of the human visual system. Research has been done before to compare simulations with photographs, and simulations with reality. These studies have focused on 'physical variables', such as brightness and sharpness, but also on naturalness and realism. We focus on the accuracy of a simulation for the prediction of the actual goal of a lot of luminaires: atmosphere creation. We investigate the correlation between perceptual characteristics of the atmosphere of a real-world scene and a simulated image of it. The results show that for all 4 tested atmosphere words similar main effects and similar trends (over color temperature, fixtures, intensities) can be found in both the real life experiments and the simulation experiments. This implies that it is possible to use simulations on a screen or printout for the evaluation of atmosphere characteristics.
Investigating two features of aesthetic perception in consumer photographic images: clutter and center
A key aspect of image effectiveness is how well the image visually communicates the main subject. In consumer images, two important features that impact viewer appreciation of the main subject are the amount of clutter and the main subject placement within the image. Two subjective experiments were conducted to assess the relationship between aesthetic and technical quality and perception of clutter and image center. For each experiment, 30 participants evaluated the same 70 images, on 0 to 100-point scales for aesthetic and technical quality. For the clutter experiment, participants also evaluated the images, on 0 to 100-point scales for amount of clutter and main subject emphasis. For the center experiment, participants pointed directly onto the image to mark the center of interest. Results indicate that aesthetic quality, technical quality, amount of clutter, and main subject emphasis are strongly correlated. Based on 95% confidence ellipses and mean-shift clustering, expert main subject maps are consistent with observer identification of main subject location. Further, the distribution of the observer identification of the center of interest is related to the object class (e.g., person, scenery). Additional features related to image composition can be used to explain clusters formed by patterns of mean ratings.
Extracting, Integrating, and Analyzing Features
icon_mobile_dropdown
Analyzing near-infrared images for utility assessment
Neda Salamati, Zahra Sadeghipoor, Sabine Süsstrunk
Visual cognition is of significant importance in certain imaging applications, such as security and surveillance. In these applications, an important issue is to determine the cognition threshold, which is the maximum distortion level that can be applied to the images while still ensuring that enough information is conveyed to recognize the scene. The cognition task is usually studied with images that represent the scene in the visible part of the spectrum. In this paper, our goal is to evaluate the usefulness of another scene representation. To this end, we study the performance of near-infrared (NIR) images in cognition. Since surface reflections in the NIR part of the spectrum is material dependent, an object made of a specific material is more probable to have uniform response in the NIR images. Consequently, edges in the NIR images are likely to correspond to the physical boundaries of the objects, which are considered to be the most useful information for cognition. This feature of the NIR images leads to the hypothesis that NIR is better than a visible scene representation to be used in cognition tasks. To test this hypothesis, we compared the cognition thresholds of NIR and visible images performing a subjective study on 11 scenes. The images were compressed with different compression factors using JPEG2000 compression. The results of this subjective test show that recognizing 8 out of the 11 scenes is significantly easier based on the NIR images when compared to their visible counterparts.
Appearance-based human gesture recognition using multimodal features for human computer interaction
Dan Luo, Hua Gao, Hazim Kemal Ekenel, et al.
The use of gesture as a natural interface plays an utmost important role for achieving intelligent Human Computer Interaction (HCI). Human gestures include different components of visual actions such as motion of hands, facial expression, and torso, to convey meaning. So far, in the field of gesture recognition, most previous works have focused on the manual component of gestures. In this paper, we present an appearance-based multimodal gesture recognition framework, which combines the different groups of features such as facial expression features and hand motion features which are extracted from image frames captured by a single web camera. We refer 12 classes of human gestures with facial expression including neutral, negative and positive meanings from American Sign Languages (ASL). We combine the features in two levels by employing two fusion strategies. At the feature level, an early feature combination can be performed by concatenating and weighting different feature groups, and LDA is used to choose the most discriminative elements by projecting the feature on a discriminative expression space. The second strategy is applied on decision level. Weighted decisions from single modalities are fused in a later stage. A condensation-based algorithm is adopted for classification. We collected a data set with three to seven recording sessions and conducted experiments with the combination techniques. Experimental results showed that facial analysis improve hand gesture recognition, decision level fusion performs better than feature level fusion.
Adaptive user interfaces for relating high-level concepts to low-level photographic parameters
Edward Scott, Pubudu Madhawa Silva, Bryan Pardo, et al.
Common controls for photographic editing can be difficult to use and have a significant learning curve. Often, a user does not know a direct mapping from a high-level concept (such as "soft") to the available parameters or controls. In addition, many concepts are subjective in nature, and the appropriate mapping may vary from user to user. To overcome these problems, we propose a system that can quickly learn a mapping from a high-level subjective concept onto low- level image controls using machine learning techniques. To learn such a concept, the system shows the user a series of training images that are generated by modifying a seed image along different dimensions (e.g., color, sharpness), and collects the user ratings of how well each training image matches the concept. Since it is known precisely how each modified example is different from the original, the system can determine the correlation between the user ratings and the image parameters to generate a controller tailored to the concept for the given user. The end result - a personalized image controller - is applicable to a variety of concepts. We have demonstrated the utility of this approach to relate low-level parameters, such as color balance and sharpness, to simple concepts, such as "lightness" and "crispness," as well as more complex and subjective concepts, such as "pleasantness." We have also applied the proposed approach to relate subband statistics (variance) to perceived roughness of visual textures (from the CUReT database).
Parametric quality assessment of synthesized textures
Darshan Siddalinga Swamy, Kellen J. Butler, Damon M. Chandler, et al.
In this paper, we present the results of a study designed to investigate the visual factors which contribute to the perceived quality of synthesized textures. A psychophysical experiment was performed in which subjects rated the quality of textures synthesized from a variety of modern texture-synthesis algorithms. The ratings were given in terms of how well each synthesized texture represented a sample from the same material from which the original texture was obtained. The results revealed that the most detrimental artifact was lack of structural details. Other pronounced artifacts included: (1) misalignment of the texture patterns; (2) blurring introduced in the texture patterns; and (3) repeating the same patch again and again (tiling). Based on these results, we present an analysis of the efficacy of various measureable parameters at predicting the ratings. We show how a linear combination of the parameters from a parametric texture-synthesis algorithm demonstrates better performance at predicting the ratings compared to traditional quality-assessment algorithms.
On the perception of band-limited phase distortion in natural scenes
It is widely believed that the phase spectrum of an image contributes much more to the image's visual appearance than the magnitude spectrum. Several researchers have also shown that this phase information can be computed indirectly from local magnitude information, a theory which is consistent with the physiological evidence that complex cells respond to local magnitude (and are insensitive to local phase). Recent studies have shown that tasks such as image recognition and categorization can be performed using only local magnitude information. These findings suggest that the human visual system (HVS) uses local magnitude to infer global phase (image-wide phase spectrum) and thereby determine the image's appearance. However, from a signal-processing perspective, both local magnitude and local phase are related to global phase. Moreover, in terms of image quality, distorting the local phase can result in a severely degraded image. These latter facts suggest that the HVS uses both local magnitude and local phase to determine an image's appearance. We conducted an experiment to quantify the contributions of local magnitude and local phase toward image appearance as a function of spatial frequency. Hybrid images were created via a complex wavelet transform in which the the low frequency magnitude, low frequency phase, high frequency magnitude, and high frequency phase were taken from 2-4 different images. Subjects were then asked to rate how much each of the 2-4 images contributed to the the appearance of the hybrid image. We found that local magnitude is indeed an important factor for image appearance; however, local phase can play an equally important role, and in some cases, local phase can dominate the image's appearance. We discuss the implication of these results in terms of image quality and visual coding.
Perceptual and Cognitive Challenges in the Visualization and Visual Analysis of Bioinformatics Data
icon_mobile_dropdown
Applying information visualization principles to biological network displays
We use the principles of information visualization to guide the design of systems to best meet the needs of specific targets group of users, namely biologists who have different tasks involving the visual exploration of biological networks. For many biologists who explore networks of interacting proteins and genes, the topological structure of these node-link graphs is only one part of the story. The Cerebral system supports graph layout in a style inspired by hand-drawn pathway diagrams, where location of the proteins within the cell constrains the location within the drawing, and functional groups of proteins are visually apparent as clusters. It also supports exploration of expression data using linked views, to show these multiple attributes at each node in the graph. The Pathline system attacks the problem of visually encoding the biologically interesting relationships between multiple pathways, multiple genes, and multiple species. We propose new methods based on the principle that perception of spatial position is the most accurate visual channel for all data types. The curvemap view is an alternative to heatmaps, and linearized pathways support the comparison of quantitative display as a primary task while showing topological information at a secondary level.
Hypergraph visualization and enrichment statistics: how the EGAN paradigm facilitates organic discovery from big data
Jesse Paquette, Taku Tokuyasu
The EGAN software is a functional implementation of a simple yet powerful paradigm for exploration of large empirical data sets downstream from computational analysis. By focusing on systems-level analysis via enrichment statistics, EGAN enables a human domain expert to transform high-throughput analysis results into hypergraph visualizations: concept maps that leverage the expert's semantic understanding of metadata and relationships to produce insight.
Perceptual issues in the recovery and visualisation of integrated systems biology data
Tony Pridmore
The systems approach to biological research emphasises understanding of complete biological systems, rather than a reductionist focus on tightly defined component parts. Systems biology is naturally interdisciplinary; research groups active in this area typically contain experimental and theoretical biologists, mathematicians, statisticians, computer scientists and engineers. A wide range of tools are used to generate a variety of data types which must be integrated, presented to and analysed by researchers from any and all of the contributing disciplines. The goal here is to create predictive models of the system of interest; the models produced must also be analysed, and in the context of the data from which they were generated. Effective, integrated data and model visualisation methods are crucial if scientificallyappropriate judgments are to be made. The Nottingham Centre for Plant Integrative Biology (CPIB) takes a systems approach to the study of the root of the model plant Arabidopsis Thaliana. A rich mixture of data types, many extracted via automatic analysis of individual and time-ordered sequences of standard CCD and confocal laser microscope images, is used to create models of different aspects of the growth of the Arabidopsis root. This talk briefly reviews the data sets and flow of information within CPIB, and discuss issues raised by the need to interpret images of the Arabidopsis root and integrate and present the resulting data and models to an interdisciplinary audience.
Using cellular network diagrams to interpret large-scale datasets: past progress and future challenges
Peter D. Karp, Mario Latendresse, Suzanne Paley
Cellular networks are graphs of molecular interactions within the cell. Thanks to the confluence of genome sequencing and bioinformatics, scientists are now able to reconstruct cellular network models for more than 1,000 organisms. A variety of bioinformatics tools have been developed to support the visualization and navigation of cellular network data. Another important application is the use of cellular network diagrams to visualize and interpret large-scale datasets, such as gene-expression data. We present the Cellular Overview, a network visualization tool developed at SRI International (SRI) to support visualization, navigation, and interpretation of large-scale datasets on metabolic networks. Different variations of the diagram have been generated algorithmically for more than 1,000 organisms. We discuss the graphical design of the diagram and its interactive capabilities.
Visualizing large high-throughput datasets based on the cognitive representation of biological pathways
Axel Nagel, Marc Lohse, Anthony Bolger, et al.
The data explosion in the biological sciences has led to many novel challenges for the individual researcher. One of these is to interpret the sheer mass of data at hand. Typical high-throughput data sets from transcriptomic data can easily comprise hundred thousand data points. It is thus necessary to provide tools to interactively visualize these data sets in a way that aids in their interpretation. Thus we have developed the MAPMAN application. This application renders individual data points from different domains as different glyphs that are color coded to reflect underlying changes in the magnitude/abundance of the underlying data. In order to augment the human comprehensibility of the biologist domain experts these data are organized on meaningful pathway diagrams that the biologist has encountered numerous times. Using these representations together with a high level organization thus helps to quickly realize the main outcome of such a high throughput study and to further decide on additional tasks that should be performed to explore the data.
Metadata Mapper: a web service for mapping data between independent visual analysis components, guided by perceptual rules
Bernice E. Rogowitz, Naim Matasci
The explosion of online scientific data from experiments, simulations, and observations has given rise to an avalanche of algorithmic, visualization and imaging methods. There has also been enormous growth in the introduction of tools that provide interactive interfaces for exploring these data dynamically. Most systems, however, do not support the realtime exploration of patterns and relationships across tools and do not provide guidance on which colors, colormaps or visual metaphors will be most effective. In this paper, we introduce a general architecture for sharing metadata between applications and a "Metadata Mapper" component that allows the analyst to decide how metadata from one component should be represented in another, guided by perceptual rules. This system is designed to support "brushing [1]," in which highlighting a region of interest in one application automatically highlights corresponding values in another, allowing the scientist to develop insights from multiple sources. Our work builds on the component-based iPlant Cyberinfrastructure [2] and provides a general approach to supporting interactive, exploration across independent visualization and visual analysis components.
Evaluating the Quality of the Stereoscopic Experience II: Joint Session with Conference 7863
icon_mobile_dropdown
Examination of 3D visual attention in stereoscopic video content
Quan Huynh-Thu, Luca Schiatti
Recent advances in video technology and digital cinema have made it possible to produce entertaining 3D stereoscopic content that can be viewed for an extended duration without necessarily causing extreme fatigue, visual strain and discomfort. Viewers focus naturally their attention on specific areas of interest in their visual field. Visual attention is an important aspect of perception and its understanding is therefore an important aspect for the creation of 3D stereoscopic content. Most of the studies on visual attention have focused on the case of still images or 2D video. Only a very few studies have investigated eye movement patterns in 3D stereoscopic moving sequences, and how these may differ from viewing 2D video content. In this paper, we present and discuss the results of a subjective experiment that we conducted using an eye-tracking apparatus to record observers' gaze patterns. Participants were asked to watch the same set of video clips in a free-viewing task. Each clip was shown in a 3D stereoscopic version and 2D version. Our results indicate that the extent of areas of interests is not necessarily wider in 3D. We found a very strong content dependency in the difference of density and locations of fixations between 2D and 3D stereoscopic content. However, we found that saccades were overall faster and that fixation durations were overall lower when observers viewed the 3D stereoscopic version.
Quantifying how the combination of blur and disparity affects the perceived depth
Junle Wang, Marcus Barkowsky, Vincent Ricordel, et al.
The influence of a monocular depth cue, blur, on the apparent depth of stereoscopic scenes will be studied in this paper. When 3D images are shown on a planar stereoscopic display, binocular disparity becomes a pre-eminent depth cue. But it induces simultaneously the conflict between accommodation and vergence, which is often considered as a main reason for visual discomfort. If we limit this visual discomfort by decreasing the disparity, the apparent depth also decreases. We propose to decrease the (binocular) disparity of 3D presentations, and to reinforce (monocular) cues to compensate the loss of perceived depth and keep an unaltered apparent depth. We conducted a subjective experiment using a twoalternative forced choice task. Observers were required to identify the larger perceived depth in a pair of 3D images with/without blur. By fitting the result to a psychometric function, we obtained points of subjective equality in terms of disparity. We found that when blur is added to the background of the image, the viewer can perceive larger depth comparing to the images without any blur in the background. The increase of perceived depth can be considered as a function of the relative distance between the foreground and background, while it is insensitive to the distance between the viewer and the depth plane at which the blur is added.
Perceptual Approaches to Video Quality
icon_mobile_dropdown
Preferences for the balance between true image detail and noise
We designed a series of experiments to measure user preference for the noise-detail tradeoff, including tests of the assumption that all true image detail is preferred. We generated samples with noise-detail tradeoff by designing a sequence of coring filters with increasing strength. A user study method is developed using magnitude estimation approach. In the first experiment the coring filter sequence is applied to original video samples without any additional noise. It is observed that the subjective quality score increases as coring strength is increased, reaches a peak and then decreases. Thus users prefer slightly cored images compared to original images. In the second experiment the coring filter sequence is applied to video samples with additive noise of different strength. It is observed that the most preferred coring strength increases as the amount of noise in the image increases. The results from our experiments can be used to design parameters for various image/ video post-processing and noise removal algorithms.
Measurement of compression-induced temporal artifacts in subjective and objective video quality assessment
Claire Mantel, Patricia Ladret, Thomas Kunlin
Temporal pooling and temporal defects are the two dierences between image and video quality assessment. Whereas temporal pooling has been the object of two recent studies, this paper focuses on the rarely addressed topic of compression-induced temporal artifacts, such as mosquito noise. To study temporal aspects in subjective quality assessment, we compared the perceived quality of two versions of a mosquito noise corrector: one purely spatial and the other spatio-temporal. We set up a paired-comparison experiment and choose videos whose compression mainly creates temporal artifacts. Results proved the existence of a purely temporal aspect in video quality perception. We investigate the correlation between subjective results from the experiment and three video metrics (VQM, MOVIE, VQEM), as well as two temporally-pooled image metrics (SSIM and PSNR). SSIM and PSNR metrics nd the corrected sequences of better quality than the compressed ones but do not distinguish spatial and spatio-temporal processings. The confrontation of those results with the VQM and Movie objective metrics show that they do not account for this type of defects. A detailed study highlights that either they do not detect them or the response of their temporal component is masked by the one of their spatial components.
Perceived contrast of electronically magnified video
Andrew M. Haun, Russell L. Woods, Eli Peli
It has been observed that electronic magnification of imagery results in a decrease in the apparent contrast of the magnified image relative to the original. The decrease in perceived contrast might be due to a combination of image blur and of sub-sampling the larger range of contrasts in the original image. In a series of experiments, we measured the effect on apparent contrast of magnification in two contexts: either the entire image was enlarged to fill a larger display area, or a portion of an image was enlarged to fill the same display area, both as a function of magnification power and of viewing distance (visibility of blur induced by magnification). We found a significant difference in the apparent contrast of magnified versus unmagnified video sequences. The effect on apparent contrast was found to increase with increasing magnification, and to decrease with increasing viewing distance (or with decreasing angular size). Across observers and conditions the reduction in perceived contrast was reliably in the range of 0.05 to 0.2 log units (89% to 63% of nominal contrast). These effects are generally consistent with expectations based on both the contrast statistics of natural images and the contrast sensitivity of the human visual system. It can be demonstrated that 1) local areas within larger images or videos will usually have lower physical contrast than the whole; and 2) visibility of 'missing content' (e.g. blur) in an image is interpreted as a decrease in contrast, and this visibility declines with viewing distance.
Estimating the impact of single and multiple freezes on video quality
S. van Kester, T. Xiao, R. E Kooij, et al.
This paper studies the impact of freezing of video on quality as experienced by users. Two types of freezes are investigated. First a freeze where the image pauses, so no frames were lost (frame halt). In the second type of freeze, the image freezes and skips that part of the video (frame drop). Measuring Mean Opinion Score (MOS) was done by subjective tests. Video sequences of 20 seconds were displayed for four types of content, to a total of 23 test subjects. We conclude there is no difference in the perceived quality between frame drops and frame halts. Therefore one model for single freezes was constructed. According to this model the acceptable freezing time (MOS>3.5) is 0.36 seconds. Pastrana - Vidal et al. (2004) suggested a relationship between the probability of detection and the duration of the dropped frames. They also found that it is important to consider not only the duration of the freeze but also the number of freeze occurrences. Using their relationship between the total duration of the freeze and the number of occurrences, we propose a model for multiple freezes, based upon our model for single freeze occurrences. A subjective test was designed to evaluate the performance of the model for multiple freezes. Good performance was found on this data i.e a correlation higher than 0.9.
The effects of scene characteristics, resolution, and compression on the ability to recognize objects in video
Joel Dumke, Carolyn G. Ford, Irena W. Stange
Public safety practitioners increasingly use video for object recognition tasks. These end users need guidance regarding how to identify the level of video quality necessary for their application. The quality of video used in public safety applications must be evaluated in terms of its usability for specific tasks performed by the end user. The Public Safety Communication Research (PSCR) project performed a subjective test as one of the first in a series to explore visual intelligibility in video-a user's ability to recognize an object in a video stream given various conditions. The test sought to measure the effects on visual intelligibility of three scene parameters (target size, scene motion, scene lighting), several compression rates, and two resolutions (VGA (640x480) and CIF (352x288)). Seven similarly sized objects were used as targets in nine sets of near-identical source scenes, where each set was created using a different combination of the parameters under study. Viewers were asked to identify the objects via multiple choice questions. Objective measurements were performed on each of the scenes, and the ability of the measurement to predict visual intelligibility was studied.
Supplemental subjective testing to evaluate the performance of image and video quality estimators
Frank M. Ciaramello, Amy R. Reibman
The subjective tests used to evaluate image and video quality estimators (QEs) are expensive and time consuming. More problematic, the majority of subjective testing is not designed to find systematic weaknesses in the evaluated QEs. As a result, a motivated attacker can take advantage of these systematic weaknesses to gain unfair monetary advantage. In this paper, we draw on some lessons of software testing to propose additional testing procedures that target a specific QE under test. These procedures supplement, but do not replace, the traditional subjective testing procedures that are currently used. The goal is to motivate the design of objective QEs which are better able to accurately characterize human quality assessment.
On evaluation of video quality metrics: an HDR dataset for computer graphics applications
In this paper we propose a new dataset for evaluation of image/video quality metrics with emphasis on applications in computer graphics. The proposed dataset includes LDR-LDR, HDR-HDR, and HDR-LDR reference-test video pairs with various types of distortions. We also present an example evaluation of recent image and video quality metrics that were applied in the field of computer graphics. In this evaluation all video sequences were shown on an HDR display, and subjects were asked to mark the regions where they saw differences between test and reference videos. As a result, we capture not only the magnitude of distortions, but also their spatial distribution. This has two advantages: on one hand the local quality information is valuable for computer graphics applications, on the other hand the subjectively obtained distortion maps are easily comparable to the maps predicted by quality metrics.
Visual Attention, Saliency, and Quality I: Joint Session with Conference 7867
icon_mobile_dropdown
Interactions of visual attention and quality perception
Judith Redi, Hantao Liu, Rodolfo Zunino, et al.
Several attempts to integrate visual saliency information in quality metrics are described in literature, albeit with contradictory results. The way saliency is integrated in quality metrics should reflect the mechanisms underlying the interaction between image quality assessment and visual attention. This interaction is actually two-fold: (1) image distortions can attract attention away from the Natural Scene Saliency (NSS), and (2) the quality assessment task in itself can affect the way people look at an image. A subjective study was performed to analyze the deviation in attention from NSS as a consequence of being asked to assess the quality of distorted images, and, in particular, whether, and if so how, this deviation depended on the distortion kind and/or amount. Saliency maps were derived from eye-tracking data obtained during scoring distorted images, and they were compared to the corresponding NSS, derived from eye-tracking data obtained during freely looking at high quality images. The study revealed some structural differences between the NSS maps and the ones obtained during quality assessment of the distorted images. These differences were related to the quality level of the images; the lower the quality, the higher the deviation from the NSS was. The main change was identified as a shrinking of the region of interest, being most evident at low quality. No evident role for the kind of distortion in the change in saliency was found. Especially at low quality, the quality assessment task seemed to prevail on the natural attention, forcing it to deviate in order to better evaluate the impact of artifacts.
Task dependence of visual attention on compressed videos: point of gaze statistics and analysis
We tracked the points-of-gaze of human observers as they viewed videos drawn from foreign films while engaged in two different tasks: (1) Quality Assessment and (2) Summarization. Each video was subjected to three possible distortion severities - no compression (pristine), low compression and high compression - using the H.264 compression standard. We have analyzed these eye-movement locations in detail. We extracted local statistical features around points-of-gaze and used them to answer the following questions: (1) Are there statistical differences in variances of points-of-gaze across videos between the two tasks?, (2) Does the variance in eye movements indicate a change in viewing strategy with change in distortion severity? (3) Are statistics at points-of-gaze different from those at random locations? (4) How do local low-level statistics vary across tasks? (5) How do point-of-gaze statistics vary across distortion severities within each task?
Measuring contour degradation in natural image utility assessment: methods and analysis
Guilherme O. Pinto, David M. Rouse, Sheila S Hemami
Utility estimators predict the usefulness or utility of a distorted natural image when used as a surrogate for a reference image. They differ from quality estimators in that they should provide accurate estimates even when images are extremely visibly distorted relative to the original, yet are still sufficient for the task. Our group has previously proposed the Natural Image Contour Evaluation (NICE) utility estimator. NICE estimates perceived utility by comparing morphologically dilated binary edge maps of the reference and distorted images using the Hamming distance. This paper investigates perceptually inspired approaches to evaluating the degradation of image contours in natural images for utility estimation. First, the distance transform is evaluated as an alternative to the Hamming distance measure in NICE. Second, we introduce the image contour fidelity (ICF) computational model that is compatible with any block-based quality estimator. The ICF pools weighted fidelity degradations across image blocks with weights based on the local contour strength of an image block, and allows quality estimators to be repurposed as utility estimators. The performances of these approaches were evaluated on the CU-Nantes and CU-ObserverCentric databases, which provide perceived utility scores for a collection of distorted images. While the distance transform provides an improvement over the Hamming distance, the ICF model shows greater promise. The performances of common fidelity estimators for utility estimation are substantially improved when they are used in ICF computational model. This suggests that the utility estimation problem can be recast as a problem of fidelity estimation on image contours.
Visual Attention, Saliency, and Quality II: Joint Session with Conference 7867
icon_mobile_dropdown
Evolution of attention mechanisms for early visual processing
Thomas Müller, Alois Knoll
Early visual processing as a method to speed up computations on visual input data has long been discussed in the computer vision community. The general target of a such approaches is to filter nonrelevant information from the costly higher-level visual processing algorithms. By insertion of this additional filter layer the overall approach can be speeded up without actually changing the visual processing methodology. Being inspired by the layered architecture of the human visual processing apparatus, several approaches for early visual processing have been recently proposed. Most promising in this field is the extraction of a saliency map to determine regions of current attention in the visual field. Such saliency can be computed in a bottom-up manner, i.e. the theory claims that static regions of attention emerge from a certain color footprint, and dynamic regions of attention emerge from connected blobs of textures moving in a uniform way in the visual field. Top-down saliency effects are either unconscious through inherent mechanisms like inhibition-of-return, i.e. within a period of time the attention level paid to a certain region automatically decreases if the properties of that region do not change, or volitional through cognitive feedback, e.g. if an object moves consistently in the visual field. These bottom-up and top-down saliency effects have been implemented and evaluated in a previous computer vision system for the project JAST. In this paper an extension applying evolutionary processes is proposed. The prior vision system utilized multiple threads to analyze the regions of attention delivered from the early processing mechanism. Here, in addition, multiple saliency units are used to produce these regions of attention. All of these saliency units have different parameter-sets. The idea is to let the population of saliency units create regions of attention, then evaluate the results with cognitive feedback and finally apply the genetic mechanism: mutation and cloning of the best performers and extinction of the worst performers considering computation of regions of attention. A fitness function can be derived by evaluating, whether relevant objects are found in the regions created. It can be seen from various experiments, that the approach significantly speeds up visual processing, especially regarding robust ealtime object recognition, compared to an approach not using saliency based preprocessing. Furthermore, the evolutionary algorithm improves the overall performance of the preprocessing system in terms of quality, as the system automatically and autonomously tunes the saliency parameters. The computational overhead produced by periodical clone/delete/mutation operations can be handled well within the realtime constraints of the experimental computer vision system. Nevertheless, limitations apply whenever the visual field does not contain any significant saliency information for some time, but the population still tries to tune the parameters - overfitting avoids generalization in this case and the evolutionary process may be reset by manual intervention.
Learned saliency transformations for gaze guidance
Eleonora Vig, Michael Dorr, Erhardt Barth
The saliency of an image or video region indicates how likely it is that the viewer of the image or video fixates that region due to its conspicuity. An intriguing question is how we can change the video region to make it more or less salient. Here, we address this problem by using a machine learning framework to learn from a large set of eye movements collected on real-world dynamic scenes how to alter the saliency level of the video locally. We derive saliency transformation rules by performing spatio-temporal contrast manipulations (on a spatio-temporal Laplacian pyramid) on the particular video region. Our goal is to improve visual communication by designing gaze-contingent interactive displays that change, in real time, the saliency distribution of the scene.
Attention and Gaze in Constructing the Visual World
icon_mobile_dropdown
Relationship between selective visual attention and visual consciousness
Naotsugu Tsuchiya, Christof Koch
The relationship between attention and consciousness is a close one, leading many scholars to conflate the two. However, recent research has slowly corroded a belief that selective attention and consciousness are so tightly entangled that they cannot be individually examined.
A gaze-contingent display to study contrast sensitivity under natural viewing conditions
Michael Dorr, Peter J. Bex
Contrast sensitivity has been extensively studied over the last decades and there are well-established models of early vision that were derived by presenting the visual system with synthetic stimuli such as sine-wave gratings near threshold contrasts. Natural scenes, however, contain a much wider distribution of orientations, spatial frequencies, and both luminance and contrast values. Furthermore, humans typically move their eyes two to three times per second under natural viewing conditions, but most laboratory experiments require subjects to maintain central fixation. We here describe a gaze-contingent display capable of performing real-time contrast modulations of video in retinal coordinates, thus allowing us to study contrast sensitivity when dynamically viewing dynamic scenes. Our system is based on a Laplacian pyramid for each frame that efficiently represents individual frequency bands. Each output pixel is then computed as a locally weighted sum of pyramid levels to introduce local contrast changes as a function of gaze. Our GPU implementation achieves real-time performance with more than 100 fps on high-resolution video (1920 by 1080 pixels) and a synthesis latency of only 1.5ms. Psychophysical data show that contrast sensitivity is greatly decreased in natural videos and under dynamic viewing conditions. Synthetic stimuli therefore only poorly characterize natural vision.
Analyzing complex gaze behavior in the natural world
Jeff B. Pelz, Thomas B. Kinsman, Karen M. Evans
The history of eye-movement research extends back at least to 1794, when Erasmus Darwin (Charles' grandfather) published Zoonomia, including descriptions of eye movements due to self-motion. But research on eye movements was restricted to the laboratory for 200 years, until Michael Land built the first wearable eyetracker at the University of Sussex and published the seminal paper "Where we look when we steer" [1]. In the intervening centuries, we learned a tremendous amount about the mechanics of the oculomotor system and how it responds to isolated stimuli, but virtually nothing about how we actually use our eyes to explore, gather information, navigate, and communicate in the real world. Inspired by Land's work, we have been working to extend knowledge in these areas by developing hardware, algorithms, and software that have allowed researchers to ask questions about how we actually use vision in the real world. Central to that effort are new methods for analyzing the volumes of data that come from the experiments made possible by the new systems. We describe a number of recent experiments and SemantiCode, a new program that supports assisted coding of eye-movement data collected in unrestricted environments.
What your visual system sees where you are not looking
Ruth Rosenholtz
What is the representation in early vision? Considerable research has demonstrated that the representation is not equally faithful throughout the visual field; representation appears to be coarser in peripheral and unattended vision, perhaps as a strategy for dealing with an information bottleneck in visual processing. In the last few years, a convergence of evidence has suggested that in peripheral and unattended regions, the information available consists of local summary statistics. Given a rich set of these statistics, many attributes of a pattern may be perceived, yet precise location and configuration information is lost in favor of the statistical summary. This representation impacts a wide range of visual tasks, including peripheral identification, visual search, and visual cognition of complex displays. This paper discusses the implications for understanding visual perception, as well as for imaging applications such as information visualization.
Attention as a Bayesian inference process
Sharat Chikkerur, Thomas Serre, Cheston Tan, et al.
David Marr famously defined vision as "knowing what is where by seeing". In the framework described here, attention is the inference process that solves the visual recognition problem of what is where. The theory proposes a computational role for attention and leads to a model that performs well in recognition tasks and that predicts some of the main properties of attention at the level of psychophysics and physiology. We propose an algorithmic implementation a Bayesian network that can be mapped into the basic functional anatomy of attention involving the ventral stream and the dorsal stream. This description integrates bottom-up, feature-based as well as spatial (context based) attentional mechanisms. We show that the Bayesian model predicts well human eye fixations (considered as a proxy for shifts of attention) in natural scenes, and can improve accuracy in object recognition tasks involving cluttered real world images. In both cases, we found that the proposed model can predict human performance better than existing bottom-up and top-down computational models.
Interactive Paper Session
icon_mobile_dropdown
Depth perception enhancement based on chromostereopsis
This study aims to promote the cubic effect by reproducing images with depth perception using chromostereopsis in human visual perception. From psychophysical experiments based on the theory that the cubic effect depends on the lightness of the background in the chromostereoptic effect and the chromostereoptic reversal effect, it was found that the luminous cubic effect differs depending on the lightness of the background and the hue combination of the neighboring colors. Also, the layer of the algorithm-enhancing cubic effect that was drawn from the result of the experiment was classified into the foreground, middle, and background layers according to the depth of the input image. For the respective classified layer, the color factors that were detected through the psychophysical experiments were adaptively controlled to produce an enhanced cubic effect that is appropriate for the properties of human visual perception and the characteristics of the input image.
An evaluation of perceived color break-up on field-sequential color displays
Masamitsu Kobayashi, Akiko Yoshida, Yasuhiro Yoshida
Field-Sequential Color (FSC) displays have been discussed for a long time. Its main concept is to remove a color filter so that we may increase the light transmittance of an LCD panel. However, FSC displays have a major problem: color break-up (CBU). Moreover, it is difficult to quantify the CBU in saccadic eye movements, because the phenomenon occurs as quickly as a flash in saccadic eye movements, and there are individual variations for perceiving the CBU. Some previous studies have presented assessments of saccadic CBU, but not indicated the detection and allowance thresholds of the target size in horizontal saccadic eye movements. Then, we conducted psychophysical experiments based on an FSC display driving with sub-frame frequency of 240Hz-1440Hz (each frame consist of red, green, and blue sub-frames). We employed a simple stimulus for our experiment, a static white bar with variable width. We tasked ten subjects a fixed saccade length of 58.4 visual degrees in horizontal eye movements, and a fixed target luminance of 15.25cd/m2. We examined PEST method to find detection and allowance thresholds of white bar width for saccadic CBU. This paper provides correlations between target sizes and sub-frame frequencies of an FSC display device, and proposes an easy evaluation method of perceiving saccadic CBU on FSC displays.
Text detection: effect of size and eccentricity
The issue of reading on electronic devices is getting important as the popularity of mobile devices, such as cell phones or PDAs, increases. In this study, we used the spatial summation paradigm to measure the spatial constraints for text detection. Four types of stimuli (real characters, non-characters, Jiagu and scrambled lines) were used in the experiments. All characters we used had two components in a left-right configuration. A non-character was constructed by swapping the left and right components of a real character in position to render it unpronounceable. The Jiagu characters were ancient texts and have the same left-right configuration as the modern Chinese characters, but contain no familiar components. Thus, the non-characters keep the components while destroy the spatial configuration between them and the Jaigu characters have no familiar component while keep the spatial configuration intact. The detection thresholds for the same stimulus size and the same eccentricity were the same for all types of stimuli. When the text-size is small, the detection threshold of a character decreased with the increase in its size, with a slope of -1/2 on log-log coordinates, up to a critical size at all eccentricities and for all stimulus types. The sensitivity for all types of stimuli was increased from peripheral to central vision. In conclusion, the detectability is based on local feature analysis regardless of character types. The cortical magnification, E2, is 0.82 degree visual angle. With this information, we can estimate the detectability of a character by its size and eccentricity.
Image enhancement of high digital magnification for patients with central vision loss
We have developed a mobile vision assistive device based on a head mounted display (HMD) with a video camera, which provides image magnification and contrast enhancement for patients with central field loss (CFL). Because the exposure level of the video camera is usually adjusted according to the overall luminance of the scene, the contrast of sub-images (to be magnified) may be low. We found that at high magnification levels, conventional histogram enhancement methods frequently result in over- or under-enhancement due to irregular histogram distribution of subimages. Furthermore, the histogram range of the sub-images may change dramatically when the camera moves, which may cause flickering. A piece-wise histogram stretching method based on a center emphasized histogram is proposed and evaluated by observers. The center emphasized histogram minimizes the histogram fluctuation due to image changes near the image boundary when the camera moves slightly, which therefore reduces flickering after enhancement. A piece-wise histogram stretching function is implemented by including a gain turnaround point to deal with very low contrast images and reduce the possibility of over enhancement. Six normally sighted subjects and a CFL patient were tested for their preference of images enhanced by the conventional and proposed methods as well as the original images. All subjects preferred the proposed enhancement method over the conventional method.
Quality versus intelligibility: studying human preferences for American Sign Language video
Frank M. Ciaramello, Sheila S Hemami
Real-time videoconferencing using cellular devices provides natural communication to the Deaf community. For this application, compressed American Sign Language (ASL) video must be evaluated in terms of the intelligibility of the conversation and not in terms of the overall aesthetic quality of the video. This work presents a paired comparison experiment to determine the subjective preferences of ASL users in terms of the trade-off between intelligibility and quality when varying the proportion of the bitrate allocated explicitly to the regions of the video containing the signer. A rate-distortion optimization technique, which jointly optimizes a quality criteria and an intelligibility criteria according to a user-specified parameter, generates test video pairs for the subjective experiment. Experimental results suggest that at sufficiently high bitrates, all users prefer videos in which the non-signer regions in the video are encoded with some nominal rate. As the total encoding bitrate decreases, users generally prefer video in which a greater proportion of the rate is allocated to the signer. The specific operating points preferred in the quality-intelligibility trade-off vary with the demographics of the users.
Metaphor progress report: image recall and blending
This paper discusses a simulation that was created of a model presented three years ago at this conference of a neuron as a micro machine for doing metaphor by cognitive blending. The model background is given, the difficulties of building such a model are discussed, and a description of the simulation is given based on texture synthesis structures and texture patches. These are glued together using Formal Concept Analysis. Because of this and because of the hyperbolic and Euclidean geometry intertwining and local activation, an interesting fundamental connection between analogical processing and glial and neural processing is discovered.
Space perception in pictures
Andrea J. van Doorn, Johan Wagemans, Huib de Ridder, et al.
A picture" is a at object covered with pigments in a certain pattern. Human observers, when looking "into" a picture (photograph, painting, drawing, . . . say) often report to experience a three-dimensional "pictorial space." This space is a mental entity, apparently triggered by so called pictorial cues. The latter are sub-structures of color patterns that are pre-consciously designated by the observer as "cues," and that are often considered to play a crucial role in the construction of pictorial space. In the case of the visual arts these structures are often introduced by the artist with the intention to trigger certain experiences in prospective viewers, whereas in the case of photographs the intentionality is limited to the viewer. We have explored various methods to operationalize geometrical properties, typically relative to some observer perspective. Here perspective" is to be understood in a very general, not necessarily geometric sense, akin to Gombrich's beholder's share". Examples include pictorial depth, either in a metrical, or a mere ordinal sense. We nd that different observers tend to agree remarkably well on ordinal relations, but show dramatic differences in metrical relations.