Proceedings Volume 6806

Human Vision and Electronic Imaging XIII

cover
Proceedings Volume 6806

Human Vision and Electronic Imaging XIII

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 15 February 2008
Contents: 14 Sessions, 56 Papers, 0 Presentations
Conference: Electronic Imaging 2008
Volume Number: 6806

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 6806
  • Keynote Session: Celebrating 20 Years of HVEI I
  • Keynote Session: Celebrating 20 Years of HVEI II
  • Cortical Modeling and Representation
  • Perception and High Dynamic Range Displays
  • Vision and Graphics
  • Next-generation Interactive Environments
  • Visual Attention and Gaze
  • Visual Perception in the Detection and Tracking of Objects
  • Art, Aesthetics, and Perception
  • Image Statistics, Quality, and Compression
  • Higher Level Issues in Image Quality
  • Perception, Resolution, and Display
  • Interactive Paper Session
Front Matter: Volume 6806
icon_mobile_dropdown
Front Matter: Volume 6806
This PDF file contains the front matter associated with SPIE Proceedings Volume 6806, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing.
Keynote Session: Celebrating 20 Years of HVEI I
icon_mobile_dropdown
Image statistics and surface perception
Images have characteristic statistics that can be characterized in terms of the responses of wavelet or Gabor-like filters. There has been a great deal of interest in the fact that images have sparse (kurtotic) statistics in the wavelet domain, with implications for efficient image encoding in biological and artificial systems. If we set aside the issue of efficiency, we are still left with the problem of seeing. We have been studying the ways in which filter statistics can reveal useful information about surfaces, including albedo, shading, and gloss. We find that odd order statistics such as skewness are quite useful in extracting information about reflectance and gloss, and we also find evidence that humans make use of this information. It is straightforward to compute skewness with physiological mechanisms.
The perception of simulated materials
Numerically modeling the interaction of light with materials is an essential step in generating realistic synthetic images. While there have been many studies of how people perceive physical materials, very little work has been done that facilitates efficient numerical modeling. Perceptual experiments and guidelines are needed for material measurement, specification and rendering. For measurement, many devices and methods have been developed for capturing spectral, directional and spatial variations of light/material interactions, but no guidelines exist for the accuracy required. For specification, only very preliminary work has been done to find meaningful parameters for users to search for and to select materials in software systems. For rendering, insight is needed on the perceptual impact of material models when combined with global illumination methods.
Keynote Session: Celebrating 20 Years of HVEI II
icon_mobile_dropdown
Single-photon imaging inspired by human vision
Single photon detectors are regarded as a key enabling technology in a wide range of medical, industrial, and military applications. However, the existing single photon detectors that can operate at or near room temperature have poor efficiency and high noise. Interestingly, the counterparts of these devices in nature, namely the rod cells, have amazingly high efficiency and low noise. In particular, the noise performance of the rod cells is five to six orders of magnitude better than the semiconductor based single photon detectors at room temperature. At Bio-inspired Sensors and Optoelectronics Laboratory, we explored the origin of such a high noise performance, and designed and implemented a novel semiconductor device based on the underlying detection mechanism in the rod cells. Our device shows very promising properties including orders of magnitude higher gain and lower noise compared with the existing devices. More interestingly, the low operating voltage of the device combined with high gain uniformity should allow, for the first time, realization of large imaging arrays with a high internal gain. Such imagers would open new opportunities for novel applications such as quantum ghost imaging.
The appearance of images
Karen K. De Valois, Tatsuto Takeuchi, Thomas D. Wickens
What makes an image appear to be a veridical representation of a real scene? Knowing what is necessary to produce a "good" image also aids in the design of more efficient compression algorithms. We review our earlier work on video compression and demonstrate the substantial savings and excellent image quality produced by spatial low-pass filtering of most (but not all) of the individual frames. Currently, we work with still images. An example will show that simple filtering can produce unexpected changes in the perceptual interpretation of a complex scene. I will describe and demonstrate a new compression method we are developing based on the assumption that the fine structure in the amplitude domain (and perhaps in phase, as well) can be of minimal importance in conveying the essence of a scene. We find that a complex image can be reproduced surprisingly well by compressing the entire spatial frequency amplitude spectrum to a very small number of terms.
Cortical Modeling and Representation
icon_mobile_dropdown
Statistics of natural scenes and the cortical representation of color
G. A. Cecchi, A. R. Rao, Y. Xiao, et al.
In this paper we investigate the spatial correlational structure of orientation and color information in natural images. We compare these with the spatial correlation structure of optical recordings of macaque monkey primary visual cortex, in response to oriented and color stimuli. We show that the correlation of orientation falls off rapidly over increasing distance. By using a color metric based on the a-b coordinates in the CIE-Lab color space, we show that color information, on the other hand, is more highly correlated over larger distances. We also show that orientation and color information are statistically independent in natural images. We perform a similar spatial correlation analysis of the cortical responses to orientation and color. We observe a similar behavior to that of natural images, in that the correlation of orientation-specific responses falls off; more rapidly than the correlation of color-specific responses. Our findings suggest that: (a) orientation and color information should be processed in separate channels, and (b) the organization of cortical color responses at a lower spatial frequency compared to orientation is a reflection of the statistical structure of visual world.
Combining MRI and VEP imaging to isolate the temporal response of visual cortical areas
The human brain has well over 30 cortical areas devoted to visual processing. Classical neuro-anatomical as well as fMRI studies have demonstrated that early visual areas have a retinotopic organization whereby adjacent locations in visual space are represented in adjacent areas of cortex within a visual area. At the 2006 Electronic Imaging meeting we presented a method using sprite graphics to obtain high resolution retinotopic visual evoked potential responses using multi-focal m-sequence technology (mfVEP). We have used this method to record mfVEPs from up to 192 non overlapping checkerboard stimulus patches scaled such that each patch activates about 12 mm2 of cortex in area V1 and even less in V2. This dense coverage enables us to incorporate cortical folding constraints, given by anatomical MRI and fMRI results from the same subject, to isolate the V1 and V2 temporal responses. Moreover, the method offers a simple means of validating the accuracy of the extracted V1 and V2 time functions by comparing the results between left and right hemispheres that have unique folding patterns and are processed independently. Previous VEP studies have been contradictory as to which area responds first to visual stimuli. This new method accurately separates the signals from the two areas and demonstrates that both respond with essentially the same latency. A new method is introduced which describes better ways to isolate cortical areas using an empirically determined forward model. The method includes a novel steady state mfVEP and complex SVD techniques. In addition, this evolving technology is put to use examining how stimulus attributes differentially impact the response in different cortical areas, in particular how fast nonlinear contrast processing occurs. This question is examined using both state triggered kernel estimation (STKE) and m-sequence "conditioned kernels". The analysis indicates different contrast gain control processes in areas V1 and V2. Finally we show that our m-sequence multi-focal stimuli have advantages for integrating EEG and MEG for improved dipole localization.
Mathematical modeling and the neuroscience of metaphor
We look at a characterization of metaphor from cognitive linguistics, extracting the salient features of metaphorical processing. We examine the neurobiology of dendrites, specifically spike timing-dependent plasticity (STDP), and the modulation of backpropagating action potentials (bAPs), to generate a neuropil-centric model of cortical processing based on signal timing and reverberation between regions. We show how this model supports the basic features of metaphorical processing previously extracted. Finally, we model this system using a combination of euclidean, projective, and hyperbolic geometries, and show how the resulting model accounts for this processing, and relates to other neural network models
Perception and High Dynamic Range Displays
icon_mobile_dropdown
Separating the effects of glare from simultaneous contrast
Alessandro Rizzi, Marzia Pezzetti, John J. McCann
Appearance in High Dynamic Range images is controlled by intraocular glare and physiological spatial contrast. Increasing the number of high luminance pixels in a display increases glare and reduces the dynamic range of luminances on the retina. Simultaneous contrast makes areas with higher glare related luminances look darker. Previous experiments measured the range needed for the appearance black in surrounds with variable percentage of white pixels in the background. In these test targets it was 2.0 log units with 100% white pixels, 2.3 log units with 50% white pixels, 2.9 log units with 8% white pixels, and 5.5 log units with 0% white pixels. We want to calculate the intensity of veiling glare in these test scenes and relate retinal luminances to the magnitude estimates of appearance reported by observers. This paper uses a glare spread function to calculate the retinal luminances after intraocular scatter. By modeling the actual luminances on the retina we can compare them with appearance.
Extending quality metrics to full luminance range images
Tunç O. Aydın, Rafal Mantiuk, Hans-Peter Seidel
Many quality metrics take as input gamma corrected images and assume that pixel code values are scaled perceptually uniform. Although this is a valid assumption for darker displays operating in the luminance range typical for CRT displays (from 0.1 to 80 cd/m2), it is no longer true for much brighter LCD displays (typically up to 500 cd/m2), plasma displays (small regions up to 1000 cd/m2) and HDR displays (up to 3000 cd/m2). The distortions that are barely visible on dark displays become clearly noticeable when shown on much brighter displays. To estimate quality of images shown on bright displays, we propose a straightforward extension to the popular quality metrics, such as PSNR and SSIM, that makes them capable of handling all luminance levels visible to the human eye without altering their results for typical CRT display luminance levels. Such extended quality metrics can be used to estimate quality of high dynamic range (HDR) images as well as account for display brightness.
Perception-based contrast enhancement model for complex images in high dynamic range
Contrast in image processing is typically scaled using a power function (gamma) where its exponent specifies the amount of the physical contrast change. While the exponent is normally constant for the whole image, we observe that such scaling leads to perceptual nonuniformity in the context of high dynamic range (HDR) images. This effect is mostly due to lower contrast sensitivity of the human eyes for the low luminance levels. Such levels can be reproduced by an HDR display while they can not be reproduced by standard display technology. We conduct two perceptual experiments on a complex image: contrast scaling and contrast discrimination threshold, and we derive a model which relates changes of physical and perceived contrasts at different luminance levels. We use the model to adjust the exponent value such that we obtain better perceptual uniformity of global and local contrast scaling in complex images.
Vision and Graphics
icon_mobile_dropdown
Perceived quality assessment of polygonal meshes using observer studies: a new extended protocol
Samuel Silva, Beatriz Sousa Santos, Joaquim Madeira, et al.
The complexity of a polygonal mesh is usually reduced by applying a simplification method, resulting in a similar mesh having less vertices and faces. Although several such methods have been developed, only a few observer studies are reported comparing the perceived quality of the simplified meshes, and it is not yet clear how the choice of a given method, and the level of simplification achieved, influence the quality of the resulting mesh, as perceived by the final users. Similar issues occur regarding other mesh processing methods such as smoothing. Mesh quality indices are the obvious less costly alternative to user studies, but it is also not clear how they relate to perceived quality, and which indices best describe the users behavior. This paper describes on going work concerning the evaluation of perceived quality of polygonal meshes using observer studies, while looking for a quality index which estimates user performance. In particular, given some results obtained in previous studies, a new experimental protocol was designed and a study involving 55 users was carried out, which allowed their validation, as well as further insight regarding mesh quality, as perceived by human observers.
Dimensionality of visual complexity in computer graphics scenes
How do human observers perceive visual complexity in images? This problem is especially relevant for computer graphics, where a better understanding of visual complexity can aid in the development of more advanced rendering algorithms. In this paper, we describe a study of the dimensionality of visual complexity in computer graphics scenes. We conducted an experiment where subjects judged the relative complexity of 21 high-resolution scenes, rendered with photorealistic methods. Scenes were gathered from web archives and varied in theme, number and layout of objects, material properties, and lighting. We analyzed the subject responses using multidimensional scaling of pooled subject responses. This analysis embedded the stimulus images in a two-dimensional space, with axes that roughly corresponded to "numerosity" and "material / lighting complexity". In a follow-up analysis, we derived a one-dimensional complexity ordering of the stimulus images. We compared this ordering with several computable complexity metrics, such as scene polygon count and JPEG compression size, and did not find them to be very correlated. Understanding the differences between these measures can lead to the design of more efficient rendering algorithms in computer graphics.
Next-generation Interactive Environments
icon_mobile_dropdown
Beyond image quality: designing engaging interactions with digital products
Huib de Ridder, Marco C. Rozendaal
Ubiquitous computing (or Ambient Intelligence) promises a world in which information is available anytime anywhere and with which humans can interact in a natural, multimodal way. In such world, perceptual image quality remains an important criterion since most information will be displayed visually, but other criteria such as enjoyment, fun, engagement and hedonic quality are emerging. This paper deals with engagement, the intrinsically enjoyable readiness to put more effort into exploring and/or using a product than strictly required, thus attracting and keeping user's attention for a longer period of time. The impact of the experienced richness of an interface, both visually and degree of possible manipulations, was investigated in a series of experiments employing game-like user interfaces. This resulted in the extension of an existing conceptual framework relating engagement to richness by means of two intermediating variables, namely experienced challenge and sense of control. Predictions from this revised framework are evaluated against results of an earlier experiment assessing the ergonomic and hedonic qualities of interactive media. Test material consisted of interactive CD-ROM's containing presentations of three companies for future customers.
Impact of sound on image-evoked emotions
In two experiments the effect of sound on visual information was investigated. In Experiment 1 the effect of the visual appearance of product types with an expensive deign and with an inexpensive design on the experience of the sound recordings of these products was investigated. Recordings and pictures were systematically interchanged. Thus, for example, the visual image of an expensive design was combined with a recording of the sound of an inexpensive and of an expensive design. It was found that product appearance did not affect the judgment on luxury, pleasantness, quality, and ease-of-use but that the experience of the sound dominated over the visual experience. In Experiment 2, pictures from the international affective pictures set were combined with frequency-modulated tones that varied in the amount of sensory pleasantness by manipulating the amount of roughness. The combination of sounds and pictures were rated on the valence and arousal dimensions of the circumplex model of core affect. It was found that the sounds only negatively affected the experience of the pictures on the valence dimension. The arousal level was not affected by the sounds. Both experiments show that sound can affect the perception and experience of pictures.
The impact of interactive manipulation on the recognition of objects
A new application for VR has emerged: product development, in which several stakeholders (from engineers to end users) use the same VR for development and communicate purposes. Various characteristics among these stakeholders vary considerably, which imposes potential constraints to the VR. The current paper discusses the influence of three types of exploration of objects (i.e., none, passive, active) on one of these characteristics: the ability to form mental representations or visuo-spatial ability (VSA). Through an experiment we found that all users benefit from exploring objects. Moreover, people with low VSA (e.g., end users) benefit from an interactive exploration of objects opposed to people with a medium or high VSA (e.g. engineers), who are not sensitive for the type of exploration. Hence, for VR environments in which multiple stakeholders participate (e.g. for product development), differences among their cognitive abilities (e.g., VSA) have to be taken into account to enable an efficient usage of VR.
Virtual hand: a 3D tactile interface to virtual environments
Bernice E. Rogowitz, Paul Borrel
We introduce a novel system that allows users to experience the sensation of touch in a computer graphics environment. In this system, the user places his/her hand on an array of pins, which is moved about space on a 6 degree-of-freedom robot arm. The surface of the pins defines a surface in the virtual world. This "virtual hand" can move about the virtual world. When the virtual hand encounters an object in the virtual world, the heights of the pins are adjusted so that they represent the object's shape, surface, and texture. A control system integrates pin and robot arm motions to transmit information about objects in the computer graphics world to the user. It also allows the user to edit, change and move the virtual objects, shapes and textures. This system provides a general framework for touching, manipulating, and modifying objects in a 3-D computer graphics environment, which may be useful in a wide range of applications, including computer games, computer aided design systems, and immersive virtual worlds.
Touch, tools, and telepresence: embodiment in mediated environments
We tend to think of our body image as fixed. However, human brains appear to support highly negotiable body images. As a result, our brains show a remarkable flexibility in incorporating non-biological elements (tools and technologies) into the body image, provided reliable, real-time intersensory correlations can be established, and artifacts can be plausibly mapped onto an already existing body image representation. A particularly interesting and relevant phenomenon in this respect is a recently reported crossmodal perceptual illusion known as the rubber-hand illusion (RHI). When a person is watching a fake hand being stroked and tapped in precise synchrony with his or her own unseen hand, the person will, within a few minutes of stimulation, start experiencing the fake hand as an actual part of his or her own body. In this paper, we will review recent work on the RHI and argue that such experimental transformation of the intimate ties between body morphology, proprioception and self-perception enhances our fundamental understanding of the phenomenal experience of self. Moreover, it will enable us to significantly improve the design of interactive media, including the design of avatars in virtual environments and digital games, as well as a range of human-like telerobotic devices.
Augmented reality in surgical procedures
E. Samset, D. Schmalstieg, J. Vander Sloten, et al.
Minimally invasive therapy (MIT) is one of the most important trends in modern medicine. It includes a wide range of therapies in videoscopic surgery and interventional radiology and is performed through small incisions. It reduces hospital stay-time by allowing faster recovery and offers substantially improved cost-effectiveness for the hospital and the society. However, the introduction of MIT has also led to new problems. The manipulation of structures within the body through small incisions reduces dexterity and tactile feedback. It requires a different approach than conventional surgical procedures, since eye-hand co-ordination is not based on direct vision, but more predominantly on image guidance via endoscopes or radiological imaging modalities. ARIS*ER is a multidisciplinary consortium developing a new generation of decision support tools for MIT by augmenting visual and sensorial feedback. We will present tools based on novel concepts in visualization, robotics and haptics providing tailored solutions for a range of clinical applications. Examples from radio-frequency ablation of liver-tumors, laparoscopic liver surgery and minimally invasive cardiac surgery will be presented. Demonstrators were developed with the aim to provide a seamless workflow for the clinical user conducting image-guided therapy.
Context-based pixelization model for the artificial retina using saliency map and skin color detection algorithm
S. M. Jin, I. B. Lee, J. M. Han, et al.
A key problem of artificial visual prosthesis is the low resolution due to the limited number of electrodes. Various methods such as edge detection, contrast enhancement have been studied as the solutions of the low resolution problem and these methods have been performed to face or object recognition in the close-up image. In this paper, we proposed the region-of-interest detection method using a context-based model, which is appropriate for real situations. The visually-salient region was detected by combining the saliency map with color information. In experiment, to evaluate the proposed model, gaze was estimated using an eye tracker when subjects watch the original image and two types of 10 × 10 pixelized images produced by conventional and saliency based method, respectively. Each gaze of pixelized images was compared with the gaze of the original image. The experiment showed that the gaze using the proposed context based model much more correlates with the gaze of the original image than that of conventional model.
Visual Attention and Gaze
icon_mobile_dropdown
Natural systems analysis
Wilson S. Geisler, Jeffrey S. Perry, Almon D. Ing
The environments we live in and the tasks we perform in those environments have shaped the design of our visual systems through evolution and experience. This is an obvious statement, but it implies three fundamental components of research we must have if we are going to gain a deep understanding of biological vision systems: (a) a rigorous science devoted to understanding natural environments and tasks, (b) mathematical and computational analysis of how to use such knowledge of the environment to perform natural tasks, and (c) experiments that allow rigorous measurement of behavioral and neural responses, either in natural tasks or in artificial tasks that capture the essence of natural tasks. This approach is illustrated with two example studies that combine measurements of natural scene statistics, derivation of Bayesian ideal observers that exploit those statistics, and psychophysical experiments that compare human and ideal performance in naturalistic tasks.
Hyperspectral image visualization based on a human visual model
Hyperspectral image data can provide very fine spectral resolution with more than 200 bands, yet presents challenges for visualization techniques for displaying such rich information on a tristimulus monitor. This study developed a visualization technique by taking advantage of both the consistent natural appearance of a true color image and the feature separation of a PCA image based on a biologically inspired visual attention model. The key part is to extract the informative regions in the scene. The model takes into account human contrast sensitivity functions and generates a topographic saliency map for both images. This is accomplished using a set of linear "center-surround" operations simulating visual receptive fields as the difference between fine and coarse scales. A difference map between the saliency map of the true color image and that of the PCA image is derived and used as a mask on the true color image to select a small number of interesting locations where the PCA image has more salient features than available in the visible bands. The resulting representations preserve hue for vegetation, water, road etc., while the selected attentional locations may be analyzed by more advanced algorithms.
Dynamic visual attention: motion direction versus motion magnitude
A. Bur, P. Wurtz, R. M. Müri, et al.
Defined as an attentive process in the context of visual sequences, dynamic visual attention refers to the selection of the most informative parts of video sequence. This paper investigates the contribution of motion in dynamic visual attention, and specifically compares computer models designed with the motion component expressed either as the speed magnitude or as the speed vector. Several computer models, including static features (color, intensity and orientation) and motion features (magnitude and vector) are considered. Qualitative and quantitative evaluations are performed by comparing the computer model output with human saliency maps obtained experimentally from eye movement recordings. The model suitability is evaluated in various situations (synthetic and real sequences, acquired with fixed and moving camera perspective), showing advantages and inconveniences of each method as well as preferred domain of application.
Motion saliency outweighs other low-level features while watching videos
Dwarikanath Mahapatra, Stefan Winkler, Shih-Cheng Yen
The importance of motion in attracting attention is well known. While watching videos, where motion is prevalent, how do we quantify the regions that are motion salient? In this paper, we investigate the role of motion in attention and compare it with the influence of other low-level features like image orientation and intensity. We propose a framework for motion saliency. In particular, we integrate motion vector information with spatial and temporal coherency to generate a motion attention map. The results show that our model achieves good performance in identifying regions that are moving and salient. We also find motion to have greater influence on saliency than other low-level features when watching videos.
Automatic video summarization driven by a spatio-temporal attention model
According to the literature, automatic video summarization techniques can be classified in two parts, following the output nature: "video skims", which are generated using portions of the original video and "key-frame sets", which correspond to the images, selected from the original video, having a significant semantic content. The difference between these two categories is reduced when we consider automatic procedures. Most of the published approaches are based on the image signal and use either pixel characterization or histogram techniques or image decomposition by blocks. However, few of them integrate properties of the Human Visual System (HVS). In this paper, we propose to extract keyframes for video summarization by studying the variations of salient information between two consecutive frames. For each frame, a saliency map is produced simulating the human visual attention by a bottom-up (signal-dependent) approach. This approach includes three parallel channels for processing three early visual features: intensity, color and temporal contrasts. For each channel, the variations of the salient information between two consecutive frames are computed. These outputs are then combined to produce the global saliency variation which determines the key-frames. Psychophysical experiments have been defined and conducted to analyze the relevance of the proposed key-frame extraction algorithm.
Visual Perception in the Detection and Tracking of Objects
icon_mobile_dropdown
Inhibitory surround and grouping effects in human and computational multiple object tracking
Ozgur Yilmaz, Sadiye Guler, Haluk Ogmen
Multiple Object Tracking (MOT) experiments show that human observers can track over several seconds up to five moving targets among several moving distractors. We extended these studies by designing modified MOT experiments to investigate the spatio-temporal characteristics of human visuo-cognitive mechanisms for tracking and applied the findings and insights obtained from these experiments in designing computational multiple object tracking algorithms. Recent studies indicate that attention both enhances the neural activity of relevant information and suppresses the irrelevant visual information in the surround. Results of our experiments suggest that the suppressive surround of attention extends up to 4 deg from the target stimulus, and it takes at least 100 ms to build it. We suggest that when the attentional windows corresponding to separate target regions are spatially close, they can be grouped to form a single attentional window to avoid interference originating from suppressive surrounds. The grouping experiment results indicate that the attentional windows are grouped into a single one when the distance between them is less than 1.5 deg. Preliminary implementation of the suppressive surround concept in our computational video object tracker resulted in less number of unnecessary object merges in computational video tracking experiments.
Quantifying the perceived interest of objects in images: effects of size, location, blur, and contrast
Vamsi Kadiyala, Srivani Pinneli, Eric C. Larson, et al.
This paper presents the results of two psychophysical experiments designed to investigate the effects of size, location, blur, and contrast on the perceived visual interest of objects within images. In the first experiment, digital composting was used to create images containing objects (humans, animals, and non-living objects) which varied in controlled increments of size, location, blur, and contrast. Ratings of perceived interest were then measured for each object. We found that: (1) As object size increases, perceived interest increases but exhibits diminished gains for larger sizes; (2) As an object moves from the center of the image toward the image's edge, perceived interest decreases nearly linearly with distance; (3) Blurring imposes a substantial initial decrease in perceived interest, but this drop is relatively lessened for highly blurred objects; (4) As an object's RMS contrast is increased, perceived interest increases nearly linearly. Furthermore, these trends were quite similar for all three categories (human, animal, non-living object). To determine whether these data can predict the perceived interest of objects in real, non-composited images, a second experiment was performed in which subjects rated the visual interest of each of 562 objects in 150 images. Based on these results, an algorithm is presented which, given a segmented image, attempts to generate an object-level interest map.
The pupil dilation response to visual detection
Claudio M. Privitera, Laura W. Renninger, Thom Carney, et al.
The pupil dilation reflex is mediated by inhibition of the parasympathetic Edinger-Westphal oculomotor complex and sympathetic activity. It has long been documented that emotional and sensory events elicit a pupillary reflex dilation. Is the pupil response a reliable marker of a visual detection event? In two experiments where viewers were asked to report the presence of a visual target during rapid serial visual presentation (RSVP), pupil dilation was significantly associated with target detection. The amplitude of the dilation depended on the frequency of targets and the time of the detection. Larger dilations were associated with trials having fewer targets and with targets viewed earlier during the trial. We also found that dilation was strongly influenced by the visual task.
The influence of image compression on target acquisition
O. Hadar, E. Goldberg, E. Topchik
With the increased use of multimedia technologies, image compression has become increasingly popular. Compression decreases the high demands for storage capacity and transmission bandwidth. However, when compressing an image, some part of the information is lost, since the compression smoothes high frequencies thereby distorting small details. This issue is crucial, especially in military, spying and medical systems. When planning these kinds of systems, the image compression quality must be considered as well as how it affects the mission performance carried out by the user. Our goal is to examine the behavior of the human eye during image scanning and try to quantify the effect of image compression on observer tasks such as target acquisition. For this task, we used the standard JPEG2000 in order to compress the images at different compression ratios ranging from 10% (the highest) to 100% (the original image). It was found that animation images were more influenced by compression than thermal images. In general, as the compression ratio increased the ability to acquire the targets decreased.
Adapting images to observers
Kyle C. McDermott, Igor Juricevic, George Bebis, et al.
Adaptation exerts a continuous influence on visual coding, altering both sensitivity and appearance whenever there is a change in the patterns of stimulation the observer is exposed to. These adaptive changes are thought to improve visual performance by optimizing both discrimination and recognition, but may take substantial time to fully adjust the observer to a new stimulus context. Here we explore the advantages of instead adapting the image to the observer, obviating the need for sensitivity changes within the observer. Adaptation in color vision adjusts to both the average color and luminance and to the variations in color and luminance within the scene. We modeled these adjustments as gain changes in the cones and in multiple post-receptoral mechanisms tuned to stimulus contrasts along different color-luminance directions. Responses within these mechanisms were computed for a range of different environments, based on images sampled from a range of natural outdoor settings. Images were then adapted for different environments by scaling the responses so that for each mechanism the average response equaled the response to a reference environment. Transforming images in this way can increase the discriminability of different colors and the salience of novel colors. It also provides a way to simulate how the world might look to an observer in different environments or to different observers in the same environment. Such images thus provide a novel tool for exploring color appearance and the perceptual and functional consequences of adaptation.
Art, Aesthetics, and Perception
icon_mobile_dropdown
Peceptual rendering of HDR in painting and photography
Pictures can be drawn by hand, or imaged by optical means. Over time, pictures have changed from being rare and unique to ubiquitous and common. They have changed from treasures to transients. This paper summarizes many picture technologies, and discusses their dynamic range, their color and tone-scale rendering and their spatial image processing. High Dynamic Range (HDR) image capture and display has long been an interest for artists and photographers. The discipline of reproducing scenes with a high range of luminances has a 5-century history that includes painting, photography, electronic imaging and image processing. HDR images render high-range scene information into lowrange reproductions. This paper studies the artistic techniques and scientific issues that control HDR image capture and reproduction. Both the artist and the scientist synthesize HDR reproductions with spatial image processing. The artists paints, or dodges and burns, the image he visualizes based on his human visual processing. The scientist, using algorithms that mimic vision, calculates perceptually correct renditions with inaccurate reproductions of scene radiances. The paper will discuss artists' techniques used in both painting and photography for HDR compression. It will also describe how optical veiling glare severely limits the range of luminance that can be captured and seen. The improvement in quality in digital HDR reproductions, as in HDR in art, depends on the spatial rendering of details in the highlights and shadows.
The art of non-photographic imaging
As computer graphics continues to progress towards photo-realism, one branch of computer graphics has begun to consider non-photorealistic rendering1,2 as a dedicated research domain. Digital imaging has long been tied to photography, both digital and analog, and has long been focused on achieving, maintaining, demonstrating or characterizing photographic quality. There has generally been limited effort in the area of non-photographic imaging. This paper proposes that art or the artistic process can be used to inspire additional directions in imaging. More specifically, a number of examples, both personal and from established artists, will be used to demonstrate a range of non-photographic imaging techniques. The discussion broadly covers experimental imaging processes, use of text, multiple image constructions and algorithmic cartooning.
Aesthetics versus utility in electronic imaging
It is often believed that modern viewers of visually presented information need to be pleased or kept concentrated by feeding them several types of input simultaneously: the primary information and, moreover, what are regarded as embellishments such as figuratively structured instead of plain uniform background of folders and slides in presentations. However, there are many cases whereby the utility or efficiency of transmission of presented information and aesthetical aspects inherent to this presentation are opposed. Examples for static images are: color combinations of foreground and background in text and figures such as graphs that impede legibility; the use of low-contrast secondary information in the form of figures or text in the same plane as the intended primary information; and gloss, causing specular reflection and sometimes glare, applied to bezels of visual displays or to the face of the display itself. Aesthetically intended aspects of dynamic images, such as flashing parts, may even cause health hazards, for example photosensitive seizures. Being aware of the possible opposition of utility and attractiveness means that a sensible choice can be made for the relative strengths of the information-bearing and the aesthetic factors - including a 'strength zero' of the latter, if need be.
Image Statistics, Quality, and Compression
icon_mobile_dropdown
On the performance of human visual system based image quality assessment metric using wavelet domain
Most of the efficient objective image or video quality metrics are based on properties and models of the Human Visual System (HVS). This paper is dealing with two major drawbacks related to HVS properties used in such metrics applied in the DWT domain : subband decomposition and masking effect. The multi-channel behavior of the HVS can be emulated applying a perceptual subband decomposition. Ideally, this can be performed in the Fourier domain but it requires too much computation cost for many applications. Spatial transform such as DWT is a good alternative to reduce computation effort but the correspondence between the perceptual subbands and the usual wavelet ones is not straightforward. Advantages and limitations of the DWT are discussed, and compared with models based on a DFT. Visual masking is a sensitive issue. Several models exist in literature. Simplest models can only predict visibility threshold for very simple cue while for natural images one should consider more complex approaches such as entropy masking. The main issue relies on finding a revealing measure of the surround influences and an adaptation: should we use the spatial activity, the entropy, the type of texture, etc.? In this paper, different visual masking models using DWT are discussed and compared.
Using gaze information to improve image difference metrics
We have used image difference metrics to measure the quality of a set of images to know how well they predict perceived image difference. We carried out a psychophysical experiment with 25 observers along with a recording of the observers gaze position. The image difference metrics used were CIELAB ΔEab, S-CIELAB, the hue angle algorithm, iCAM and SSIM. A frequency map from the eye tracker data was applied as a weighting to the image difference metrics. The results indicate an improvement in correlation between the predicted image difference and the perceived image difference.
The effect of lightness scaling on the perceived color quality of compressed digital videos
Chin Chye Koh, John M. Foley, Sanjit K. Mitra
In this work, we studied how video compression and lightness scaling interact to affect the overall video quality and the color quality attributes. We examined three subjective attributes: perceived color preference, perceived color naturalness, and overall annoyance as digital videos were subjected to compression and lightness scaling. Psychophysical experiments were carried out in which naïve subjects made numerical judgments of the three subjective attributes. We found that preference and naturalness scores are concave down functions of mean lightness with an associated maximum, while annoyance scores are concave up with an associated minimum. As compression increases, both preference and naturalness scores decrease and vary less with mean lightness. Maximum preference, naturalness, and annoyance scores generally occur at similar mean lightness values. Preference, naturalness, and annoyance scores for individual videos, are approximated relatively well by Gaussian functions of mean lightness. Preference and naturalness scores decreases while annoyance scores increase as an S-shaped function of the logarithm of the total squared error. A three-parameter model is shown to provide a good description of how each attribute depends on lightness and compression for individual videos. Model parameters vary with video content.
Image group compression using texture databases
An image compression approach capable of exploiting redundancies in groups of images is introduced. The approach is based on image segmentation, texture analysis and texture synthesis. The proposed algorithm extracts textured regions from an image and merges them with similar texture data from other images, in order to take advantage of textural re-occurrences between the images. The texture extraction is done by taking overlapping rectangular texture parameter samples from the input image(s), and using a clustering algorithm to merge them into spatially connected regions, resulting in a polygonal texture map. The textures of that map are henceforth analysed by extracting various features from the texture regions. Using a metric defined on these features, the textures are then merged with entries from a central database, which consists of all the textures in all the images of the image collection, so that for each image, only a polygonal segmentation map and references into this texture database need to be stored. Decoding (decompression) works by extracting the polygonal texture map followed by filling the map regions with patterns generated using texture synthesis based on the texture feature vectors from the database.
Image mapping using local and global statistics
We describe a set of techniques for mapping one image to another based on the statistics of a training set. We apply these techniques to the problems of image denoising and superresolution, but they should also be useful for many vision problems where training data are available. Given a local feature vector computed from an input image patch, we learn to estimate a subband coefficient of the output image conditioned on the patch. This entails approximating a multidimensional function, which we make tractable by nested binning and linear regression within bins. This method performs as well as nearest neighbor techniques, but is much faster. After attaining this local (patch based) estimate, we force the marginal subband histograms to match a set of target histograms, in the style of Heeger and Bergen.1 The target histograms are themselves estimated from the training data. With the combined techniques, denoising performance is similar to state of the art techniques in terms of PSNR, and is slightly superior in subjective quality. In the case of superresolution, our techniques produce higher subjective quality than the competing methods, allowing us to attain large increases in apparent resolution. Thus, for these two tasks, our method is very fast and very effective.
Analyzing the role of visual structure in the recognition of natural image content with multi-scale SSIM
David M. Rouse, Sheila S. Hemami
Natural images are meaningful to humans - the physical world exhibits statistical regularities that permit the human visual system (HVS) to infer useful interpretations. These regularities communicate the visual structure of the physical world and govern the statistics of images (image structure). A signal processing framework is sought to analyze image characteristics for a relationship with human interpretation. This work investigates the first step toward an objective visual information evaluation: predicting the recognition threshold of different image representations. Given a image sequence, whose images begin as unrecognizable and are gradually refined to include more information according to some measure, the recognition threshold corresponds to first the image in the sequence in which an observer accurately identifies the content. Sequences are produced using two types of image representations: signal-based and visual structure preserving. Signal-based representations add information as dictated by conventional mathematical characterizations of images based on models of low-level HVS processing and use basis functions as the basic image components. Visual structure preserving representations add information to images attributed to visual structure and attempt to mimic higher-level HVS processing by considering the scene's objects as the basic image components. An experiment is conducted to identify the recognition threshold image. Several full-reference perceptual quality assessment algorithms are evaluated in terms of their ability to predict the recognition threshold of different image representations. The cross-correlation component of a modified version of the multi-scale structural similarity (MS-SSIM) metric, denoted MS-SSIM*, exhibits a better overall correlation with the signal-based and visual structure preserving representations' average recognition thresholds than the standard MS-SSIM cross-correlation component. These findings underscore the significance of visual structure in recognition and advocate a multi-scale image structure analysis for a rudimentary evaluation of visual information.
A psychovisual experiment on the use of Gibbs potential for the quality assessment of geometrically distorted images
Angela D'Angelo, Mirco Pacitto, Mauro Barni
Human perception of image distortions has been widely explored in recent years, however, research has not dealt with distortions due to geometric operations. As a consequence, there is a lack of objective visual quality measures for this class of distortions. In this paper we propose a method of objectively assessing the perceptual quality of geometrically distorted images. Our approach is based on the theory of Markov Random Fields. The idea is that the potential function of the Markov Random Field describing the distortion gives an indication of the degradation of the distorted image. This work can be seen as the first step toward the definition of an objective metric for geometric distortions in images.
Structure-preserving properties of bilevel image compression
Matthew G. Reyes, Xiaonan Zhao, David L. Neuhoff, et al.
We discuss a new approach for lossy compression of bilevel images based on Markov random fields (MRFs). The goal is to preserve key structural information about the image, and then reconstruct the smoothest image that is consistent with this information. The image is compressed by losslessly coding the pixels in a square grid of lines and adding bits when needed to preserve structural information. The decoder uses the MRF model to reconstruct the interior of each block bounded by the grid, based on the pixels on its boundary, plus the extra bits provided for certain blocks. The idea is that, as long as the key structural information is preserved, then the smooth contours of the block having highest probability with respect to the MRF provides acceptable reconstructions. We propose and consider objective criteria for both encoding and evaluating the quality and structure preserving properties of the coded bilevel images. These include mean-squared error, MRF energy (smoothness), and connected components (topology). We show that overall, for comparable mean-squared error, the new approach provides perceptually superior reconstructions than existing lossy compression techniques at lower encoding rates.
Higher Level Issues in Image Quality
icon_mobile_dropdown
Subjective responses to constant and variable quality video
David S. Hands, Kennedy Cheng
This paper describes an experiment examining subjective ratings in response to variations in the reproduction quality of a video signal. Additionally, the test was designed to examine if pricing affected subjective judgements. Test materials were created with either constant quality or variable quality where quality was manipulated by reference to the video frame rate. Subjects were required to provide both quality and acceptability ratings for each test sequence. Two levels of variable quality were created: one in which the quality varied between medium and high quality (low variability), the other being variability between low and high quality (high variability). Subjects were assigned to one of three price bands prior to beginning the test. The test found that, for equivalent average quality sequences, subjects preferred constant quality to high variability. There was no difference in ratings for constant quality and low variability sequences. The results indicate that video encoding methods may take advantage of some variation in video quality provided the perceptual impact of changes in quality are not marked.
Improving visual content accessibility for low-vision users in the MPEG-21 multimedia framework
Seungji Yang, Jongsoo Choi, Yong Man Ro, et al.
An image content adaptation for visually impaired people based on the MPEG-21 Digital Item Adaptation (DIA) standard is proposed. The content adaptation mainly considers spatial contrast vision characteristic of users, which is represented by a contrast sensitivity function (CSF). There are three key contributions of the paper. First, the visual perception of users who have different spatial contrast vision abilities is simulated by incorporating the HVS model proposed by Pattanaik et al. Second, to measure spatial contrast vision, and thus realizing personalized content adaptation depending on the severity of the visual ability of individual user, CSF is measured on computer-based environment. The measured spatial contrast vision symptom and its severity, is represented in an interoperable way by using an example of extended description tool provided by the MPEG-21 DIA specification. Third, the content adaption is also proposed, which is personalized in a sense that the adapted content would be optimized to the given description of a particular symptom and its severity. To assess the effectiveness of the proposed methods, we performed a number of experiments targeting users with a low vision and showed how to determine and describe the CSF parameters. Furthermore, statistical experiment is performed to verify the effectiveness of the proposed adaptation process for users with the low vision symptom.
The colour preference control based on two-colour combinations
This paper proposes a framework of colour preference control to satisfy the consumer's colour related emotion. A colour harmony algorithm based on two-colour combinations is developed for displaying the images with several complementary colour pairs as the relationship of two-colour combination. The colours of pixels belonging to complementary colour areas in HSV colour space are shifted toward the target hue colours and there is no colour change for the other pixels. According to the developed technique, dynamic emotions by the proposed hue conversion can be improved and the controlled output image shows improved colour emotions in the preference of the human viewer. The psychophysical experiments are conducted to investigate the optimal model parameters to produce the most pleasant image to the users in the respect of colour emotions.
Effect of blackness level on visual impression of color images
Tetsuya Eda, Yoshiki Koike, Sakurako Matsushima, et al.
In this study, two experiments were conducted to clarify the relation between RGB values and perceived blackness. In the first experiment, the average RGB values of black surface areas in the test stimuli where observers begin to perceive the areas 'black', and further another average RGB values where observers perceive the areas 'really black' were determined. Results indicate that to realize a 'really black' surface, RGB values should be lower than those of the original image in some pictures. In the second experiment, how and to what degree the RGB values of black area affect the visual impression of artistic picture was investigated. Three dimensions, "high-quality axis", "mysterious axis", and "feeling of material axis", were extracted by factor analysis. Results indicate that the Art students seem to be more sensitive in the evaluations along the "high-quality axis" and "mysterious axis" than the Engineering students, while the opposite tendency is shown in the evaluation along the "feeling of material axis".
Perception, Resolution, and Display
icon_mobile_dropdown
Adaptation of document images to display constraints
The variety of displays used to browse and view images has created a need to adapt an image representation to constraints given by the viewing environment. In this paper various methods of adaptation to a small display size are introduced with focus on adaptation of document images. Compared to photographic images, document images pose an even greater challenge to represent on small size displays. If a typical down-sampling of image data is performed, we not only loose some high-resolution data, but also semantic information, such as readability, recognizability, and distinguishability of features. We explore various ways of controlling document information such as readable text or distinguishable layout features in different visualizations applying specific content-dependent scaling methods. Readability is preserved in "SmartNails" via automatic content-dependent cropping, scaling and pasting. Content-dependent iconification is proposed to provide distinguishability between layout features of document images. In the case of multi-page document content a rendering in form of a video clip is proposed that performs content-dependent navigation through the image data given display size and time constraints.
Representative image thumbnails: automatic and manual
Ramin Samadani, Tim Mauer, David Berfanger, et al.
Image thumbnails are used in most imaging products and applications, where they allow quick preview of the content of the underlying high resolution images. The question: "How would you best represent a high resolution original image given a fixed number of thumbnail pixels?" is addressed using both automatically and manually generated thumbnails. Automatically generated thumbnails that preserve the image quality of the high resolution originals are first reviewed and subjectively evaluated. These thumbnails allow interactive identification of image quality, while simultaneously allowing the viewer's knowledge to select desired subject matter. Images containing textures are, however, difficult for the automatic algorithm. Textured images are further studied by using photo editing to manually generate representative thumbnails. The automatic thumbnails are subjectively compared to standard (filter and subsample) thumbnails using clean, blurry, noisy, and textured images. Results using twenty subjects find the automatic thumbnails more representative of their originals for blurry images. In addition, as desired, there is little difference between the automatic and standard thumbnails for clean images. The noise component improves the results for noisy images, but degrades the results for textured images. Further studying textured images, the manual thumbnails were subjectively compared to standard thumbnails for four images. Evaluation using forty judgments found a bimodal distribution for preference between the standard and the manual thumbnails, with some observers preferring manual thumbnails and others preferring standard thumbnails.
Influence of camera and in-scene motion on perceived video quality in MPEG-2 adaptive coding
Nele Van den Ende, Carmen Wijermans, Lydia Meesters, et al.
This paper describes an experiment that studies perceived video quality, with the goal to get a better understanding of whether a temporal or a spatial MPEG-2 based adaptation method should be used for video transmission over variable bandwidth. The research focused on the relation between in-scene motion and camera motion on spatial as well as temporal distortions in video sequences. Participants were tested on their sensitivity and appreciation for spatial and temporal distortions using the scale paradigm of direct comparison. Footage was shot to create video material of three scenes with a systematic manipulation of in-scene motion and camera motion, which produced twelve different video sequences. Results show a relation trend between the two types of motion and the two types of distortion in video sequences. The main result indicates that participants generally rated spatial distortions as better video quality than the same video sequence containing temporal distortions; even though video sequences containing spatial distortions were coded at an overall lower bitrate than video sequences containing temporal distortions.
A quality metric for use with frame-rate based bandwidth adaptation algorithms
Matthias Krause, Michael van Hartskamp, Emile Aarts
Despite the growth in network capacity of wireless in-home networks, these networks often have insufficient capacity to support multiple simultaneous Audio/Video streams. Unpredictable behavior of these networks results in a drop of video quality for the end-user. A method for reducing the claim of an individual A/V stream on the network capacity is controlled frame dropping. However, controlled frame dropping will only be accepted if its effect on the quality that endusers experience is minimized. In this paper, we define an objective quality metric for frame dropping methods, to determine when frame dropping is not effective any more. The quality metric, a fraction between 0 and 1, is related to the characteristics of frame dropping. A quality level below 0.9 indicates that a detectable amount of frames has been dropped. A quality level above 0.98 indicates that no significant frame drops occurred recently. The metric is validated with simulations.
Perceptual limit to display resolution of images as per visual acuity
Kenichiro Masaoka, Takahiro Niida, Miya Murakami, et al.
Achieving ultimate visual realness of natural images on a display requires high resolution, so that artifacts due to finite image resolution are undetectable. An image resolution of 30 cycles/degree (cpd) or one pixel/arc-minute is often used as the criterion for viewing conditions when assessing displayed image quality. It is reasoned that if the pixel size is smaller than the separable angle of normal vision (20/20), the pixel structure is invisible and doesn't negatively affect image quality. However, it is not clear whether 30 cpd resolution is adequate to prevent seeing artifacts, especially for observers with better than 20/20 vision. We conducted experiments to find the threshold resolution of natural images and its dependence on visual acuity. Three objects were used; each object was presented 60 times at 5 resolutions (19.5, 26, 39, 52, or 78 cpd) next to the same image at a resolution of 156 cpd. Forty-five observers with visual acuity of 20/20 or better were asked to make a forced-choice distinction between the image pair in regard to resolution. Each observer indicated which image of the pair appeared at a higher resolution. The results show that the mean resolution for 75% correct responses for each of the visual acuity groups increased from more than 30 cpd as visual acuity increased and reached a plateau at 40-50 cpd at -0.3 logMAR.
Interactive Paper Session
icon_mobile_dropdown
Unsupervised color image segmentation using a dynamic color gradient thresholding algorithm
We propose a novel algorithm for unsupervised segmentation of color images. The proposed approach utilizes a dynamic color gradient thresholding scheme that guides the region growing process. Given a color image, a weighted vectorbased color gradient map is generated. Seeds are identified and a dynamic threshold is then used to perform reliable growing of regions on the weighted gradient map. Over-segmentation, if any, is addressed by a Similarity Measurebased region merging stage to produce the final segmented image. Comparative results demonstrate the effectiveness of this algorithm for color image segmentation.
Comparison of eye tracking devices used on printed images
Eye tracking as a quantitative method for collecting eye movement data, requires the accurate knowledge of the eye position, where eye movements can provide indirect evidence about what the subject sees. In this study two eye tracking devices have been compared, a Head-mounted Eye Tracking Device (HED) and a Remote Eye Tracking Device (RED). The precision of both devices has been evaluated, in terms of gaze position accuracy and stability of the calibration. For the HED it has been investigated how to register data to real-world coordinates. This is needed since coordinates collected by the HED eye tracker are relative to the position of the subject's head and not relative to the actual stimuli as it is the case for the RED device. Result Results show that the precision gets worse with time for both eye tracking devices. The precision of RED is better than the HED and the difference between them is around 10 - 16 pixels (5.584 mm). The distribution of gaze positions for HED and RED devices was expressed by a percentage representation of the point of regard in areas defined by the viewing angle. For both eye tracking devices the gaze position accuracy has been 95-99% at 1.5-2° viewing angle. The stability of the calibration was investigated at the end of the experiment and the obtained result was not statistically significant. But the distribution of the gaze position is larger at the end of the experiment than at the beginning.
Evaluation of video quality models for multimedia
The Video Quality Experts Group (VQEG) is a group of experts from industry, academia, government and standards organizations working in the field of video quality assessment. Over the last 10 years, VQEG has focused its efforts on the evaluation of objective video quality metrics for digital video. Objective video metrics are mathematical models that predict the picture quality as perceived by an average observer. VQEG has completed validation tests for full reference objective metrics for the Standard Definition Television (SDTV) format. From this testing, two ITU Recommendations were produced. This standardization effort is of great relevance to the video industries because objective metrics can be used for quality control of the video at various stages of the delivery chain. Currently, VQEG is undertaking several projects in parallel. The most mature project is concerned with objective measurement of multimedia content. This project is probably the largest coordinated set of video quality testing ever embarked upon. The project will involve the collection of a very large database of subjective quality data. About 40 subjective assessment experiments and more than 160,000 opinion scores will be collected. These will be used to validate the proposed objective metrics. This paper describes the test plan for the project, its current status, and one of the multimedia subjective tests.
Designing caption production rules based on face, text, and motion detection
C. Chapdelaine, M. Beaulieu, L. Gagnon
Producing off-line captions for the deaf and hearing impaired people is a labor-intensive task that can require up to 18 hours of production per hour of film. Captions are placed manually close to the region of interest but it must avoid masking human faces, texts or any moving objects that might be relevant to the story flow. Our goal is to use image processing techniques to reduce the off-line caption production process by automatically placing the captions on the proper consecutive frames. We implemented a computer-assisted captioning software tool which integrates detection of faces, texts and visual motion regions. The near frontal faces are detected using a cascade of weak classifier and tracked through a particle filter. Then, frames are scanned to perform text spotting and build a region map suitable for text recognition. Finally, motion mapping is based on the Lukas-Kanade optical flow algorithm and provides MPEG-7 motion descriptors. The combined detected items are then fed to a rule-based algorithm to determine the best captions localization for the related sequences of frames. This paper focuses on the defined rules to assist the human captioners and the results of a user evaluation for this approach.
Human-centered content-based image retrieval
A breakthrough is needed in order to achieve a substantial progress in the field of Content-Based Image Retrieval (CBIR). This breakthrough can be enforced by: 1) optimizing user-system interaction, 2) combining the wealth of techniques from text-based Information Retrieval with CBIR techniques, 3) exploiting human cognitive characteristics, especially human color processing, and 4) conducting benchmarks with users for evaluating new CBIR techniques. In this paper, these guidelines are illustrated by findings from our research conducted the last five years, which have lead to the development of the online Multimedia for Art ReTrieval (M4ART) system: http://www.m4art.org. The M4ART system follows the guidelines on all four issues and is assessed on benchmarks using 5730 queries on a database of 30,000 images. Therefore, M4ART can be considered as a first step into a new era of CBIR.
Extension of a human visual system model for display simulation
Cédric Marchessoux, Alexis Rombaut, Tom Kimpe, et al.
In the context of medical display validation, a simulation chain has been developed to facilitate display design and image quality validation. One important part is the human visual observer model to quantify the quality perception of the simulated images. Since several years, multiple research groups are modeling the various aspects of human perception to integrate them in a complete Human Visual System (HVS) and developing visible image difference metrics. In our framework, the JNDmetrix is used. It reflects the human subjective assessment of images or video fidelity. Nevertheless, the system is limited and not suitable for our accurate simulations. There is a limitation to RGB 8 bits integer images and the model takes into account display parameters like gamma, black offset, ambient light... It needs to be extended. The solutions proposed to extend the HVS model are: precision enhancement to overcome the 8 bit limit, color space conversion between XYZ and RGB and adaptation to the display parameters. The preprocessing does not introduce any kind of perceived distortion caused for example by precision enhancement. With this extension the model is used in a daily basis in the display simulation chain.