Proceedings Volume 9027

Imaging and Multimedia Analytics in a Web and Mobile World 2014

cover
Proceedings Volume 9027

Imaging and Multimedia Analytics in a Web and Mobile World 2014

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 4 March 2014
Contents: 8 Sessions, 23 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2014
Volume Number: 9027

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 9027
  • Online Photo and Imaging Services
  • Text Recognition in Mobile Applications
  • Web and Social Media
  • Image, Video, and Multimedia Analytics I
  • Image, Video, and Multimedia Analytics II
  • Face/Human Body Recognition and Detection
  • Interactive Paper Session
Front Matter: Volume 9027
icon_mobile_dropdown
Front Matter: Volume 9027
This PDF file contains the front matter associated with SPIE Proceedings Volume 9027, including the Title Page, Copyright information, Table of Contents, and Conference Committee listing.
Online Photo and Imaging Services
icon_mobile_dropdown
Representing videos in tangible products
Reiner Fageth, Ralf Weiting
Videos can be taken with nearly every camera, digital point and shoot cameras, DSLRs as well as smartphones and more and more with so-called action cameras mounted on sports devices. The implementation of videos while generating QR codes and relevant pictures out of the video stream via a software implementation was contents in last years’ paper. This year we present first data about what contents is displayed and how the users represent their videos in printed products, e.g. CEWE PHOTOBOOKS and greeting cards. We report the share of the different video formats used, the number of images extracted out of the video in order to represent the video, the positions in the book and different design strategies compared to regular books.
Aesthetic quality inference for online fashion shopping
On-line fashion communities in which participants post photos of personal fashion items for viewing and possible purchase by others are becoming increasingly popular. Generally, these photos are taken by individuals who have no training in photography with low-cost mobile phone cameras. It is desired that photos of the products have high aesthetic quality to improve the users’ online shopping experience. In this work, we design features for aesthetic quality inference in the context of online fashion shopping. Psychophysical experiments are conducted to construct a database of the photos’ aesthetic evaluation, specifically for photos from an online fashion shopping website. We then extract both generic low-level features and high-level image attributes to represent the aesthetic quality. Using a support vector machine framework, we train a predictor of the aesthetic quality rating based on the feature vector. Experimental results validate the efficacy of our approach. Metadata such as the product type are also used to further improve the result.
Full-color visibility model using CSF which varies spatially with local luminance
Alastair Reed, David Berfanger, Yang Bai, et al.
A full color visibility model has been developed that uses separate contrast sensitivity functions (CSFs) for contrast variations in luminance and chrominance (red-green and blue-yellow) channels. The width of the CSF in each channel is varied spatially depending on the luminance of the local image content. The CSF is adjusted so that more blurring occurs as the luminance of the local region decreases. The difference between the contrast of the blurred original and marked image is measured using a color difference metric. This spatially varying CSF performed better than a fixed CSF in the visibility model, approximating subjective measurements of a set of test color patches ranked by human observers for watermark visibility. The effect of using the CIEDE2000 color difference metric compared to CIEDE1976 (i.e., a Euclidean distance in CIELAB) was also compared.
Text Recognition in Mobile Applications
icon_mobile_dropdown
Text recognition and correction for automated data collection by mobile devices
Suleyman Ozarslan, P. Erhan Eren
Participatory sensing is an approach which allows mobile devices such as mobile phones to be used for data collection, analysis and sharing processes by individuals. Data collection is the first and most important part of a participatory sensing system, but it is time consuming for the participants. In this paper, we discuss automatic data collection approaches for reducing the time required for collection, and increasing the amount of collected data. In this context, we explore automated text recognition on images of store receipts which are captured by mobile phone cameras, and the correction of the recognized text. Accordingly, our first goal is to evaluate the performance of the Optical Character Recognition (OCR) method with respect to data collection from store receipt images. Images captured by mobile phones exhibit some typical problems, and common image processing methods cannot handle some of them. Consequently, the second goal is to address these types of problems through our proposed Knowledge Based Correction (KBC) method used in support of the OCR, and also to evaluate the KBC method with respect to the improvement on the accurate recognition rate. Results of the experiments show that the KBC method improves the accurate data recognition rate noticeably.
Text vectorization based on character recognition and character stroke modeling
Zhigang Fan, Bingfeng Zhou, Francis Tse, et al.
In this paper, a text vectorization method is proposed using OCR (Optical Character Recognition) and character stroke modeling. This is based on the observation that for a particular character, its font glyphs may have different shapes, but often share same stroke structures. Like many other methods, the proposed algorithm contains two procedures, dominant point determination and data fitting. The first one partitions the outlines into segments and second one fits a curve to each segment. In the proposed method, the dominant points are classified as “major” (specifying stroke structures) and “minor” (specifying serif shapes). A set of rules (parameters) are determined offline specifying for each character the number of major and minor dominant points and for each dominant point the detection and fitting parameters (projection directions, boundary conditions and smoothness). For minor points, multiple sets of parameters could be used for different fonts. During operation, OCR is performed and the parameters associated with the recognized character are selected. Both major and minor dominant points are detected as a maximization process as specified by the parameter set. For minor points, an additional step could be performed to test the competing hypothesis and detect degenerated cases.
Visual improvement for bad handwriting based on Monte-Carlo method
Cao Shi, Jianguo Xiao, Canhui Xu, et al.
A visual improvement algorithm based on Monte Carlo simulation is proposed in this paper, in order to enhance visual effects for bad handwriting. The whole improvement process is to use well designed typeface so as to optimize bad handwriting image. In this process, a series of linear operators for image transformation are defined for transforming typeface image to approach handwriting image. And specific parameters of linear operators are estimated by Monte Carlo method. Visual improvement experiments illustrate that the proposed algorithm can effectively enhance visual effect for handwriting image as well as maintain the original handwriting features, such as tilt, stroke order and drawing direction etc. The proposed visual improvement algorithm, in this paper, has a huge potential to be applied in tablet computer and Mobile Internet, in order to improve user experience on handwriting.
Image processing for drawing recognition
Rustem Feyzkhanov, Irina Zhelavskaya
The task of recognizing edges of rectangular structures is well known. Still, almost all of them work with static images and has no limit on work time. We propose application of conducting homography for the video stream which can be obtained from the webcam. We propose algorithm which can be successfully used for this kind of application. One of the main use cases of such application is recognition of drawings by person on the piece of paper before webcam.
Web and Social Media
icon_mobile_dropdown
A web-based video annotation system for crowdsourcing surveillance videos
Neeraj J. Gadgil, Khalid Tahboub, David Kirsh, et al.
Video surveillance systems are of a great value to prevent threats and identify/investigate criminal activities. Manual analysis of a huge amount of video data from several cameras over a long period of time often becomes impracticable. The use of automatic detection methods can be challenging when the video contains many objects with complex motion and occlusions. Crowdsourcing has been proposed as an effective method for utilizing human intelligence to perform several tasks. Our system provides a platform for the annotation of surveillance video in an organized and controlled way. One can monitor a surveillance system using a set of tools such as training modules, roles and labels, task management. This system can be used in a real-time streaming mode to detect any potential threats or as an investigative tool to analyze past events. Annotators can annotate video contents assigned to them for suspicious activity or criminal acts. First responders are then able to view the collective annotations and receive email alerts about a newly reported incident. They can also keep track of the annotators’ training performance, manage their activities and reward their success. By providing this system, the process of video analysis is made more efficient.
A Markov chain model for image ranking system in social networks
Thi Thi Zin, Pyke Tin, Takashi Toriu, et al.
In today world, different kinds of networks such as social, technological, business and etc. exist. All of the networks are similar in terms of distributions, continuously growing and expanding in large scale. Among them, many social networks such as Facebook, Twitter, Flickr and many others provides a powerful abstraction of the structure and dynamics of diverse kinds of inter personal connection and interaction. Generally, the social network contents are created and consumed by the influences of all different social navigation paths that lead to the contents. Therefore, identifying important and user relevant refined structures such as visual information or communities become major factors in modern decision making world. Moreover, the traditional method of information ranking systems cannot be successful due to their lack of taking into account the properties of navigation paths driven by social connections. In this paper, we propose a novel image ranking system in social networks by using the social data relational graphs from social media platform jointly with visual data to improve the relevance between returned images and user intentions (i.e., social relevance). Specifically, we propose a Markov chain based Social-Visual Ranking algorithm by taking social relevance into account. By using some extensive experiments, we demonstrated the significant and effectiveness of the proposed social-visual ranking method.
Video quality assessment for web content mirroring
Ye He, Kevin Fei, Gustavo A. Fernandez, et al.
Due to the increasing user expectation on watching experience, moving web high quality video streaming content from the small screen in mobile devices to the larger TV screen has become popular. It is crucial to develop video quality metrics to measure the quality change for various devices or network conditions. In this paper, we propose an automated scoring system to quantify user satisfaction. We compare the quality of local videos with the videos transmitted to a TV. Four video quality metrics, namely Image Quality, Rendering Quality, Freeze Time Ratio and Rate of Freeze Events are used to measure video quality change during web content mirroring. To measure image quality and rendering quality, we compare the matched frames between the source video and the destination video using barcode tools. Freeze time ratio and rate of freeze events are measured after extracting video timestamps. Several user studies are conducted to evaluate the impact of each objective video quality metric on the subjective user watching experience.
Image, Video, and Multimedia Analytics I
icon_mobile_dropdown
Evolving background recovery in lecture videos
Efficient utilization of videos of lectures and presentations requires indexing based on extracted background which includes slides and / or handwritten notes. Since the background in such videos is constantly evolving there is a need for special techniques for background recovery. The objective of this paper is a method for automatically extracting the evolving background in such videos. In contrast to general background subtraction techniques which aim at extracting foreground objects, the goal here is to extract the background and complete it where the foreground is removed. Experimental results comparing the proposed approach to other known techniques demonstrate improved performance when using the proposed approach.
An HEVC compressed domain content-based video signature for copy detection and video retrieval
Khalid Tahboub, Neeraj J. Gadgil, Mary L. Comer, et al.
Video sharing platforms and social networks have been growing very rapidly for the past few years. The rapid increase in the amount of video content introduces many challenges in terms of copyright violation detection and video search and retrieval. Generating and matching content-based video signatures, or fingerprints, is an effective method to detect copies or “near-duplicate” videos. Video signatures should be robust to changes in the video features used to characterize the signature caused by common signal processing operations. Recent work has focused on generating video signatures based on the uncompressed domain. However, decompression is a computationally intensive operation. In large video databases, it becomes advantageous to create robust signatures directly from the compressed domain. The High Efficiency Video Coding (HEVC) standard has been recently ratified as the latest video coding standard and wide spread adoption is anticipated. We propose a method in which a content-based video signature is generated directly from the HEVC-coded bitstream. Motion vectors from the HEVC-coded bitstream are used as the features. A robust hashing function based on projection on random matrices is used to generate the hashing bits. A sequence of these bits serves as the signature for the video. Our experimental results show that our proposed method generates a signature robust to common signal processing techniques such as resolution scaling, brightness scaling and compression.
Image, Video, and Multimedia Analytics II
icon_mobile_dropdown
Technology survey on video face tracking
With the pervasiveness of monitoring cameras installed in public areas, schools, hospitals, work places and homes, video analytics technologies for interpreting these video contents are becoming increasingly relevant to people’s lives. Among such technologies, human face detection and tracking (and face identification in many cases) are particularly useful in various application scenarios. While plenty of research has been conducted on face tracking and many promising approaches have been proposed, there are still significant challenges in recognizing and tracking people in videos with uncontrolled capturing conditions, largely due to pose and illumination variations, as well as occlusions and cluttered background. It is especially complex to track and identify multiple people simultaneously in real time due to the large amount of computation involved. In this paper, we present a survey on literature and software that are published or developed during recent years on the face tracking topic. The survey covers the following topics: 1) mainstream and state-of-the-art face tracking methods, including features used to model the targets and metrics used for tracking; 2) face identification and face clustering from face sequences; and 3) software packages or demonstrations that are available for algorithm development or trial. A number of publically available databases for face tracking are also introduced.
Textural discrimination in unconstrained environment
Fatema A. Albalooshi, Vijayan K. Asari
Object region segmentation for object detection and identi cation in images captured in a complex background environment is one of the most challenging tasks in image processing and computer vision areas especially for objects that have non-homogeneous body textures. This paper presents an object segmentation technique in an unconstrained environment based on textural descriptors to extract the object region of interest from other surrounding objects and backgrounds in order to get an accurate identi cation of the segmented area precisely. The proposed segmentation method is developed on a textural based analysis and employs Seeded Region Growing (SRG) segmentation algorithm to accomplish the process. In our application of obtaining the region of a chosen object for further manipulation through data mining, human input is used to choose the object of interest through which seed points are identi ed and employed. User selection of the object of interest could be achieved in di erent ways, one of which is using mouse based point and click procedure. Therefore, the proposed system provides the user with the choice to select the object of interest that will be segmented out from other background regions and objects. It is important to note that texture information gives better description of objects and plays an important role for the characterization of regions. In region growing segmentation, three key factors are satis ed such as choice of similarity criteria, selection of seed points, and stopping rule. The choice of similarity criteria is accomplished through texture descriptors and connectivity properties. The selection of seed points is determined interactively by the user when they choose the object of interest. The de nition of a stopping rule is achieved using a test for homogeneity and connectivity measures, therefore, a region would stop growing when there are no further pixels that satisfy the homogeneity and connectivity criteria. The segmentation region is iteratively grown by comparing all unallocated neighbouring pixels to that region. The di erence between seed pixels' mean intensity value and the region's textural descriptors is used as a measure of similarity of pixels. The pixel with the smallest di erence measured would be allocated to the particular segmentation region. Seeded region growing factors would change interactively according to the intensity levels of the chosen object of interest. The algorithm automatically computes segmentation thresholds based on local feature analysis. The system starts by measuring the intensity level of the selected object and accordingly adapts growing and stopping rules of the segmented region. The proposed segmentation method has been tested on a relatively large variety of databases with di erent objects of varying textures. The experimental results show that this simple framework is capable of achieving high quality performance and that this method can better handle the problem of segmenting objects of non-homogeneous textural bodies and correctly separate those objects from other objects and complex backgrounds. This framework can also be easily adapted to di erent applications by substituting suitable image feature de nitions.
Image denoising with multiple layer block matching and 3D filtering
Block Matching and 3-D Filtering (BM3D) algorithm is currently considered as one of the most successful denoising algorithms. Despite its excellent results, BM3D still has room for improvements. Image details and sharp edges, such as text in document images are challenging, as they usually do not produce sparse representations under the linear transformations. Various artifacts such as ringing and blurring can be introduced as a result. This paper proposes a Multiple Layer BM3D (MLBM3D) denoising algorithm. The basic idea is to decompose image patches that contain high contrast details into multiple layers, and each layer is then collaboratively filtered separately. The algorithm contains a Basic Estimation step and a Final Estimation step. The first (Basic Estimation) step is identical to the one in BM3D. In the second (Final Estimation) step, image groups are determined to be single-layer or multilayer. A single-layer group is filtered in a same manner as in BM3D. For a multi-layer group, each image patch within the group is decomposed with the three-layer model. All the top layers in the group are stacked and collaboratively filtered. So are the bottom layers. The filtered top and bottom layers are re-assembled to form the estimation of the blocks. The final estimation of the image is obtained by aggregating estimations of all blocks, including both single-layer and multi-layer ones. The proposed algorithm shows excellent results, particularly for the images containing high contrast edges.
Compact binary hashing for music retrieval
With the huge volume of music clips available for protection, browsing, and indexing, there is an increased attention to retrieve the information contents of the music archives. Music-similarity computation is an essential building block for browsing, retrieval, and indexing of digital music archives. In practice, as the number of songs available for searching and indexing is increased, so the storage cost in retrieval systems is becoming a serious problem. This paper deals with the storage problem by extending the supervector concept with the binary hashing. We utilize the similarity-preserving binary embedding in generating a hash code from the supervector of each music clip. Especially we compare the performance of the various binary hashing methods for music retrieval tasks on the widely-used genre dataset and the in-house singer dataset. Through the evaluation, we find an effective way of generating hash codes for music similarity estimation which improves the retrieval performance.
Face/Human Body Recognition and Detection
icon_mobile_dropdown
Efficient eye detection using HOG-PCA descriptor
Andreas Savakis, Riti Sharma, Mrityunjay Kumar
Eye detection is becoming increasingly important for mobile interfaces and human computer interaction. In this paper, we present an efficient eye detector based on HOG-PCA features obtained by performing Principal Component Analysis (PCA) on Histogram of Oriented Gradients (HOG). The Histogram of Oriented Gradients is a dense descriptor computed on overlapping blocks along a grid of cells over regions of interest. The HOG-PCA offers an efficient feature for eye detection by applying PCA on the HOG vectors extracted from image patches corresponding to a sliding window. The HOG-PCA descriptor significantly reduces feature dimensionality compared to the dimensionality of the original HOG feature or the eye image region. Additionally, we introduce the HOG-RP descriptor by utilizing Random Projections as an alternative to PCA for reducing the dimensionality of HOG features. We develop robust eye detectors by utilizing HOG-PCA and HOG-RP features of image patches to train a Support Vector Machine (SVM) classifier. Testing is performed on eye images extracted from the FERET and BioID databases.
Adaptive weighted local textural features for illumination, expression, and occlusion invariant face recognition
Biometric features such as fingerprints, iris patterns, and face features help to identify people and restrict access to secure areas by performing advanced pattern analysis and matching. Face recognition is one of the most promising biometric methodologies for human identification in a non-cooperative security environment. However, the recognition results obtained by face recognition systems are a affected by several variations that may happen to the patterns in an unrestricted environment. As a result, several algorithms have been developed for extracting different facial features for face recognition. Due to the various possible challenges of data captured at different lighting conditions, viewing angles, facial expressions, and partial occlusions in natural environmental conditions, automatic facial recognition still remains as a difficult issue that needs to be resolved. In this paper, we propose a novel approach to tackling some of these issues by analyzing the local textural descriptions for facial feature representation. The textural information is extracted by an enhanced local binary pattern (ELBP) description of all the local regions of the face. The relationship of each pixel with respect to its neighborhood is extracted and employed to calculate the new representation. ELBP reconstructs a much better textural feature extraction vector from an original gray level image in different lighting conditions. The dimensionality of the texture image is reduced by principal component analysis performed on each local face region. Each low dimensional vector representing a local region is now weighted based on the significance of the sub-region. The weight of each sub-region is determined by employing the local variance estimate of the respective region, which represents the significance of the region. The final facial textural feature vector is obtained by concatenating the reduced dimensional weight sets of all the modules (sub-regions) of the face image. Experiments conducted on various popular face databases show promising performance of the proposed algorithm in varying lighting, expression, and partial occlusion conditions. Four databases were used for testing the performance of the proposed system: Yale Face database, Extended Yale Face database B, Japanese Female Facial Expression database, and CMU AMP Facial Expression database. The experimental results in all four databases show the effectiveness of the proposed system. Also, the computation cost is lower because of the simplified calculation steps. Research work is progressing to investigate the effectiveness of the proposed face recognition method on pose-varying conditions as well. It is envisaged that a multilane approach of trained frameworks at different pose bins and an appropriate voting strategy would lead to a good recognition rate in such situation.
Research on the face pattern space division in images based on their different views
Different face views project different face topology in 2D images. The unified processing of face images with less topology different related to smaller range of face view angles is more convenient, and vice versa. Thus many researches divided the entire face pattern space form multiview face images into many subspaces with small range of view angles. However, large number of subspaces is computationally demanding, and different face processing algorithms take different strategies to handle the view changing. Therefore, the research of proper division of face pattern space is needed to ensure good performance. Different from other researches, this paper proposes an optimal view angle range criterion of face pattern space division in theory by careful analysis on the structure differences of multiview faces and on the influence to face processing algorithms. Then a face pattern space division method is proposed. Finally, this paper uses the proposed criterion and method to divide the face pattern space for face detection and compares with other division results. The final results show the proposed criterion and method can satisfy the processing performance with minimum number of subspace. The study in this paper can also help other researches which need to divide pattern space of other objects based on their different views.
Interactive Paper Session
icon_mobile_dropdown
Agglomerative clustering using hybrid features for image categorization
Karina Damico, Roxanne L. Canosa
This research project describes an agglomerative image clustering technique that is used for the purpose of automating image categorization. The system is implemented in two stages: feature vector formation, and feature space clustering. The features that we selected are based on texture salience (Gabor filters and a binary pattern descriptor). Global properties are encoded via a hierarchical spatial pyramid and local structure is encoded as a bit string, retained via a set of histograms. The transform can be computed efficiently – it involves only 16 operations (8 comparisons and 8 additions) per 3x3 region. A disadvantage is that it is not invariant to rotation or scale changes; however, the spatial pyramid representing global structure helps to ameliorate this problem. An agglomerative clustering technique is implemented and evaluated based on ground-truth values and a human subjective rating.
A comparison of histogram distance metrics for content-based image retrieval
Qianwen Zhang, Roxanne L. Canosa
The type of histogram distance metric selected for a CBIR query varies greatly and will affect the accuracy of the retrieval results. This paper compares the retrieval results of a variety of commonly used CBIR distance metrics: the Euclidean distance, the Manhattan distance, the vector cosine angle distance, histogram intersection distance, χ2 distance, Jensen-Shannon divergence, and the Earth Mover’s distance. A training set of ground-truth labeled images is used to build a classifier for the CBIR system, where the images were obtained from three commonly used benchmarking datasets: the WANG dataset (http://savvash.blogspot.com/2008/12/benchmark-databases-for-cbir.html), the Corel Subset dataset (http://vision.stanford.edu/resources_links.html), and the CalTech dataset (http://www.vision.caltech.edu/htmlfiles/). To implement the CBIR system, we use the Tamura texture features of coarseness, contrast, and directionality. We create texture histograms of the training set and the query images, and then measure the difference between a randomly selected query and the corresponding retrieved image using a k-nearest-neighbors approach. Precision and recall is used to evaluate the retrieval performance of the system, given a particular distance metric. Then, given the same query image, the distance metric is changed and performance of the system is evaluated once again.
Video salient event classification using audio features
Silvia Corchs, Gianluigi Ciocca, Massimiliano Fiori, et al.
The aim of this work is to detect the events in video sequences that are salient with respect to the audio signal. In particular, we focus on the audio analysis of a video, with the goal of finding which are the significant features to detect audio-salient events. In our work we have extracted the audio tracks from videos of different sport events. For each video, we have manually labeled the salient audio-events using the binary markings. On each frame, features in both time and frequency domains have been considered. These features have been used to train different classifiers: Classification and Regression Trees, Support Vector Machine, and k-Nearest Neighbor. The classification performances are reported in terms of confusion matrices.