Proceedings Volume 6073

Multimedia Content Analysis, Management, and Retrieval 2006

cover
Proceedings Volume 6073

Multimedia Content Analysis, Management, and Retrieval 2006

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 15 January 2006
Contents: 11 Sessions, 38 Papers, 0 Presentations
Conference: Electronic Imaging 2006 2006
Volume Number: 6073

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Editorial Paper
  • Video Analysis I
  • Audio and Video Retrieval
  • Image Retrieval
  • Applications I
  • Image Classification
  • Special Session: Evaluating Video Summarization, Browsing, and Retrieval Techniques
  • Feature Extraction
  • Video Analysis II
  • Applications II
  • Poster Session
Editorial Paper
icon_mobile_dropdown
Multimedia content analysis, management and retrieval: trends and challenges
Recent advances in computing, communications and storage technology have made multimedia data become prevalent. Multimedia has gained enormous potential in improving the processes in a wide range of fields, such as advertising and marketing, education and training, entertainment, medicine, surveillance, wearable computing, biometrics, and remote sensing. Rich content of multimedia data, built through the synergies of the information contained in different modalities, calls for new and innovative methods for modeling, processing, mining, organizing, and indexing of this data for effective and efficient searching, retrieval, delivery, management and sharing of multimedia content, as required by the applications in the abovementioned fields. The objective of this paper is to present our views on the trends that should be followed when developing such methods, to elaborate on the related research challenges, and to introduce the new conference, Multimedia Content Analysis, Management and Retrieval, as a premium venue for presenting and discussing these methods with the scientific community. Starting from 2006, the conference will be held annually as a part of the IS&T/SPIE Electronic Imaging event.
Video Analysis I
icon_mobile_dropdown
Blind summarization: content-adaptive video summarization using time-series analysis
Severe complexity constraints on consumer electronic devices motivate us to investigate general-purpose video summarization techniques that are able to apply a common hardware setup to multiple content genres. On the other hand, we know that high quality summaries can only be produced with domain-specific processing. In this paper, we present a time-series analysis based video summarization technique that provides a general core to which we are able to add small content-specific extensions for each genre. The proposed time-series analysis technique consists of unsupervised clustering of samples taken through sliding windows from the time series of features obtained from the content. We classify content into two broad categories, scripted content such as news and drama, and unscripted content such as sports and surveillance. The summarization problem then reduces to finding either finding semantic boundaries of the scripted content or detecting highlights in the unscripted content. The proposed technique is essentially an event detection technique and is thus best suited to unscripted content, however, we also find applications to scripted content. We thoroughly examine the trade-off between content-neutral and content-specific processing for effective summarization for a number of genres, and find that our core technique enables us to minimize the complexity of the content-specific processing and to postpone it to the final stage. We achieve the best results with unscripted content such as sports and surveillance video in terms of quality of summaries and minimizing content-specific processing. For other genres such as drama, we find that more content-specific processing is required. We also find that judicious choice of key audio-visual object detectors enables us to minimize the complexity of the content-specific processing while maintaining its applicability to a broad range of genres. We will present a demonstration of our proposed technique at the conference.
Multilevel analysis of sports video sequences
Jungong Han, Dirk Farin, Peter H. N. de With
We propose a fully automatic and flexible framework for analysis and summarization of tennis broadcast video sequences, using visual features and specific game-context knowledge. Our framework can analyze a tennis video sequence at three levels, which provides a broad range of different analysis results. The proposed framework includes novel pixel-level and object-level tennis video processing algorithms, such as a moving-player detection taking both the color and the court (playing-field) information into account, and a player-position tracking algorithm based on a 3-D camera model. Additionally, we employ scene-level models for detecting events, like service, base-line rally and net-approach, based on a number real-world visual features. The system can summarize three forms of information: (1) all court-view playing frames in a game, (2) the moving trajectory and real-speed of each player, as well as relative position between the player and the court, (3) the semantic event segments in a game. The proposed framework is flexible in choosing the level of analysis that is desired. It is effective because the framework makes use of several visual cues obtained from the real-world domain to model important events like service, thereby increasing the accuracy of the scene-level analysis. The paper presents attractive experimental results highlighting the system efficiency and analysis capabilities.
Automated editing of medical training video via content analysis
Kevon Andrews, Daniel Ring, Anil Kokaram, et al.
Physicians in the early part of their training inevitably undertake a course in Anatomy. Unfortunately, the amount of training medical students have with real bodies has decreased. This is further exacerbated with the increasing gap between numbers of medical students and resources available. Medical faculties worldwide are increasingly turning to video training sessions as a complement to practical sessions. This paper presents a number of automated content access and enhancement tools which have been designed to alleviate the difficulty of editing these sessions. The system is being deployed at the Royal College of Surgeons in Ireland.
Audio and Video Retrieval
icon_mobile_dropdown
Statistical model and error analysis of a proposed audio fingerprinting algorithm
In this paper we present a statistical analysis of a particular audio fingerprinting method proposed by Haitsma et al.1 Due to the excellent robustness and synchronisation properties of this particular fingerprinting method, we would like to examine its performance for varying values of the parameters involved in the computation and ascertain its capabilities. For this reason, we pursue a statistical model of the fingerprint (also known as a hash, message digest or label). Initially we follow the work of a previous attempt made by Doets and Lagendijk2-4 to obtain such a statistical model. By reformulating the representation of the fingerprint as a quadratic form, we present a model in which the parameters derived by Doets and Lagendijk may be obtained more easily. Furthermore, our model allows further insight into certain aspects of the behaviour of the fingerprinting algorithm not previously examined. Using our model, we then analyse the probability of error (Pe) of the hash. We identify two particular error scenarios and obtain an expression for the probability of error in each case. We present three methods of varying accuracy to approximate Pe following Gaussian noise addition to the signal of interest. We then analyse the probability of error following desynchronisation of the signal at the input of the hashing system and provide an approximation to Pe for different parameters of the algorithm under varying degrees of desynchronisation.
An application of weighted transducers to music information retrieval
In this paper it is proposed a methodology for retrieving music documents using a query by example paradigm. The basic idea is that a collection of music documents can be indexed by the set of melodic contours of its documents, and retrieval is carried out using an approximate matching between query and document contours. The approximate matching is based on the use of Weighted Transducers, which model the document contours and are used to compute their similarity with the query. The methodology has been evaluated on a collection of documents and with a set of audio queries.
Video scene retrieval with symbol sequence based on integrated audio and visual features
In this paper, we propose a method to retrieve semantically similar scenes to a query video from large scale video databases at high speed. Our method uses the audio features and the color histogram as the visual feature because the audio signal is closely related with the semantic content of videos and the color is an extensively used feature for content-based image retrieval systems. The feature vectors are extracted from video segments called packets and clustered in the feature vector space and transformed into symbols that represent the cluster IDs. Consequently, a video is expressed as a symbol sequence based on audio and visual features. Quick retrieval of similar scenes can be realized by symbol sequence matching. We conduct some experiments using audio, visual, and both features, and examine the effect of each feature on videos of various genres.
Físchlár-DiamondTouch: collaborative video searching on a table
Alan F. Smeaton, Hyowon Lee, Colum Foley, et al.
In this paper we present the system we have developed for our participation in the annual TRECVid benchmarking activity, specifically the system we have developed, Físchlár-DT, for participation in the interactive search task of TRECVid 2005. Our back-end search engine uses a combination of a text search which operates over the automatic speech recognised text, and an image search which uses low-level image features matched against video keyframes. The two novel aspects of our work are the fact that we are evaluating collaborative, team-based search among groups of users working together, and that we are using a novel touch-sensitive tabletop interface and interaction device known as the DiamondTouch to support this collaborative search. The paper summarises the backend search systems as well as presenting the interface we have developed, in detail.
Image Retrieval
icon_mobile_dropdown
Mind the gap: another look at the problem of the semantic gap in image retrieval
Jonathon S. Hare, Paul H. Lewis, Peter G. B. Enser, et al.
This paper attempts to review and characterise the problem of the semantic gap in image retrieval and the attempts being made to bridge it. In particular, we draw from our own experience in user queries, automatic annotation and ontological techniques. The first section of the paper describes a characterisation of the semantic gap as a hierarchy between the raw media and full semantic understanding of the media's content. The second section discusses real users' queries with respect to the semantic gap. The final sections of the paper describe our own experience in attempting to bridge the semantic gap. In particular we discuss our work on auto-annotation and semantic-space models of image retrieval in order to bridge the gap from the bottom up, and the use of ontologies, which capture more semantics than keyword object labels alone, as a technique for bridging the gap from the top down.
Evaluation of strategies for multiple sphere queries with local image descriptors
Nouha Bouteldja, Valérie Gouet-Brunet, Michel Scholl
In this paper, we are interested in the fast retrieval, in a large collection of points in high-dimensional space, of points close to a set of m query points (a multiple query): we want to efficiently find the sequence Ai,iε1,m} where Ai is the set of points within a sphere of center query point pi,{1,m} and radius ε (a sphere query). It has been argued that beyond a rather small dimension (d ⩾ 10) for such sphere queries as well as for other similarity queries, sequentially scanning the collection of points is faster than crossing a tree structure indexing the collection (the so-called curse of dimensionality phenomenon). Our first contribution is to experimentally assess whether the curse of dimensionality is reached with various points distributions. We compare the performance of a single sphere query when the collection is indexed by a tree structure (an SR-tree in our experiments) to that of a sequential scan. The second objective of this paper is to propose and evaluate several algorithms for multiple queries in a collection of points indexed by a tree structure. We compare the performance of these algorithms to that of a naive one consisting in sequentially running the m queries. This study is applied to content-based image retrieval where images are described by local descriptors based on points of interest. Such descriptors involve a relatively small dimension (8 to 30) justifying that the collection of points be indexed by a tree structure; similarity search with local descriptors implies multiple sphere queries that are usually time expensive, justifying the proposal of new strategies.
2+2=5: painting by numbers
Colin C. Venters, Richard J. Hartley, William T. Hewitt
Is query by visual example an intuitive method for visual query formulation or merely a prototype framework for visual information retrieval research that cannot support the rich variety of visual search strategies required for effective image retrieval? This paper reports the results of an investigation that aimed to explore the usability of the query by paint method in supporting a range of information problems. While the results show that there was no significant difference, p>0.001, on all four measures of usability, query by paint was considered by this sample not to support visual query expression. It was also observed that the usability of the query method combined with the mental model of the information problem affected both visual query expression and retrieval results. This has important implications for the efficacy and utility of content-based image retrieval as a whole and there is an increasing need to examine the usefulness of query methods and retrieval features in context.
Applications I
icon_mobile_dropdown
Structuring continuous video recordings of everyday life using time-constrained clustering
As personal wearable devices become more powerful and ubiquitous, soon everyone will be capable to continuously record video of everyday life. The archive of continuous recordings need to be segmented into manageable units so that they can be efficiently browsed and indexed by any video retrieval systems. Many researchers approach the problem in two-pass methods: segmenting the continuous recordings into chunks, followed by clustering chunks. In this paper we propose a novel one-pass algorithm to accomplish both tasks at the same time by imposing time constraints on the K-Means clustering algorithm. We evaluate the proposed algorithm on 62.5 hours of continuous recordings, and the experiment results show that time-constrained clustering algorithm substantially outperforms the unconstrained version.
Practical life log video indexing based on content and context
Today, multimedia information has gained an important role in daily life and people can use imaging devices to capture their visual experiences. In this paper, we present our personal Life Log system to record personal experiences in form of wearable video and environmental data; in addition, an efficient retrieval system is demonstrated to recall the desirable media. We summarize the practical video indexing techniques based on Life Log content and context to detect talking scenes by using audio/visual cues and semantic key frames from GPS data. Voice annotation is also demonstrated as a practical indexing method. Moreover, we apply body media sensors to record continuous life style and use body media data to index the semantic key frames. In the experiments, we demonstrated various video indexing results which provided their semantic contents and showed Life Log visualizations to examine personal life effectively.
Multimedia for mobile environment: image enhanced navigation
Shantanu Gautam, Gabi Sarkis, Edwin Tjandranegara, et al.
As mobile systems (such as laptops and mobile telephones) continue growing, navigation assistance and location-based services are becoming increasingly important. Existing technology allow mobile users to access Internet services (e.g. email and web surfing), simple multimedia services (e.g. music and video clips), and make telephone calls. However, the potential of advanced multimedia services has not been fully developed, especially multimedia for navigation or location based services. At Purdue University, we are developing an image database, known as LAID, in which every image is annotated with its location, compass heading, acquisition time, and weather conditions. LAID can be used to study several types of navigation problems: A mobile user can take an image and transmit the image to the LAID sever. The server compares the image with the images stored in the database to determine where the user is located. We refer to this as the "forward" navigation problem. The second type of problem is to provide a "virtual tour on demand". A user inputs a starting and an ending addresses and LAID retrieves the images along a route that connects the two addresses. This is a generalization of route planning. Our database currently contains over 20000 images and covers approximately 25% of the city of West Lafayette, Indiana.
Image Classification
icon_mobile_dropdown
Semantic classification of business images
Berna Erol, Jonathan J. Hull
Digital cameras are becoming increasingly common for capturing information in business settings. In this paper, we describe a novel method for classifying images into the following semantic classes: document, whiteboard, business card, slide, and regular images. Our method is based on combining low-level image features, such as text color, layout, and handwriting features with high-level OCR output analysis. Several Support Vector Machine Classifiers are combined for multi-class classification of input images. The system yields 95% accuracy in classification.
Region labelling using a point-based coherence criterion
Query By Visual Example (QBVE) has been widely exploited in image retrieval. Global visual similarity as well as points of interest matching have proven their efficiency when example image/region is available. If starting image is missing, the Query By Visual Thesaurus (QBVT) paradigm offsets it by allowing the user to compose his mental query image through visual patches summarizing the region database. In this paper, we propose to enrich the paradigm of mental image search by constructing a reliable visual thesaurus of the regions provided by a new coherence criterion. Our criterion encapsulates the local distribution of detected points of interest within a region. It leads to semantic labelling of regions categories using points spatial topology. Our point-based criterion has been validated on a generic image database combining homogenous regions as well as irregularly and fully textured patterns.
BlobContours: adapting Blobworld for supervised color- and texture-based image segmentation
Thomas Vogel, Dinh Quyen Nguyen, Jana Dittmann
Extracting features is the first and one of the most crucial steps in recent image retrieval process. While the color features and the texture features of digital images can be extracted rather easily, the shape features and the layout features depend on reliable image segmentation. Unsupervised image segmentation, often used in image analysis, works on merely syntactical basis. That is, what an unsupervised segmentation algorithm can segment is only regions, but not objects. To obtain high-level objects, which is desirable in image retrieval, human assistance is needed. Supervised image segmentations schemes can improve the reliability of segmentation and segmentation refinement. In this paper we propose a novel interactive image segmentation technique that combines the reliability of a human expert with the precision of automated image segmentation. The iterative procedure can be considered a variation on the Blobworld algorithm introduced by Carson et al. from EECS Department, University of California, Berkeley. Starting with an initial segmentation as provided by the Blobworld framework, our algorithm, namely BlobContours, gradually updates it by recalculating every blob, based on the original features and the updated number of Gaussians. Since the original algorithm has hardly been designed for interactive processing we had to consider additional requirements for realizing a supervised segmentation scheme on the basis of Blobworld. Increasing transparency of the algorithm by applying usercontrolled iterative segmentation, providing different types of visualization for displaying the segmented image and decreasing computational time of segmentation are three major requirements which are discussed in detail.
Special Session: Evaluating Video Summarization, Browsing, and Retrieval Techniques
icon_mobile_dropdown
Subjective assessment of consumer video summarization
Clifton Forlines, Kadir A. Peker, Ajay Divakaran
The immediate availability of a vast amount of multimedia content has created a growing need for improvements in the field of content analysis and summarization. While researchers have been rapidly making contributions and improvements to the field, we must never forget that content analysis and summarization themselves are not the user's goals. Users' primary interests fall into one of two categories; they normally either want to be entertained or want to be informed (or both). Summarization is therefore just another tool for improving the entertainment value or the information gathering value of the video watching experience. In this paper, we first explore the relationship between the viewer, the interface, and the summarization algorithms. Through an understanding of the user's goals and concerns, we present means for measuring the success of the summarization tools. Guidelines for the successful use of summarization in consumer video devices are also discussed.
Evaluation of automatic video summarization systems
Compact representations of video, or video summaries, data greatly enhances efficient video browsing. However, rigorous evaluation of video summaries generated by automatic summarization systems is a complicated process. In this paper we examine the summary evaluation problem. Text summarization is the oldest and most successful summarization domain. We show some parallels between these to domains and introduce methods and terminology. Finally, we present results for a comprehensive evaluation summary that we have performed.
Subjective evaluation criterion for selecting affective features and modeling highlights
Liyuan Xing, Hua Yu, Qingming Huang, et al.
In this paper, we propose a subjective evaluation criterion which is a guide for selecting affective features and modeling highlights. Firstly, the database of highlights ground truth is established, and both the randomness of the data set and the preparation of the subjects are considered. Secondly, commonly used affective features including visual, audio and editing features are extracted to express the highlights. Thirdly, subjective evaluation criterion is proposed based on the analysis of the average error method and pairwise comparisons method, especially the rationality of this criterion in our specific application is explained clearly according to the three detailed issues. Finally, evaluation experiments are designed on tennis and table tennis as examples. Based on the experiments, we prove that previous works on affective features and linear model highlights are effective. Furthermore, 82.0% (79.3%) affective accuracy is obtained fully automatically by computer which is a marvelous highlights ranking result. This result shows the subjective evaluation criterion is well designed for selecting affective features and modeling highlights.
Evaluation and user studies with respect to video summarization and browsing
The Informedia group at Carnegie Mellon University has since 1994 been developing and evaluating surrogates, summary interfaces, and visualizations for accessing digital video collections containing thousands of documents, millions of shots, and terabytes of data. This paper surveys the Informedia user studies that have taken place through the years, reporting on how these studies can provide a user pull complementing the technology push as automated video processing advances. The merits of discount usability techniques for iterative improvement and evaluation are presented, as well as the structure of formal empirical investigations with end users that have ecological validity while addressing the human computer interaction metrics of efficiency, effectiveness, and satisfaction. The difficulties in evaluating video summarization and browsing interfaces are discussed. Lessons learned from Informedia user studies are reported with respect to video summarization and browsing, ranging from the simplest portrayal of a single thumbnail to represent video stories, to collections of thumbnails in storyboards, to playable video skims, to video collages with multiple synchronized information perspectives.
Feature Extraction
icon_mobile_dropdown
Semantic feature extraction with multidimensional hidden Markov model
Conventional block-based classification is based on the labeling of individual blocks of an image, disregarding any adjacency information. When analyzing a small region of an image, it is sometimes difficult even for a person to tell what the image is about. Hence, the drawback of context-free use of visual features is recognized up front. This paper studies a context-dependant classifier based on a two dimensional Hidden Markov Model. In particular we explore how the balance between structural information and content description affect the precision in a semantic feature extraction scenario. We train a set of semantic classes using the development video archive annotated by the TRECVid 2005 participants. To extract semantic features the classes with maximum a posteriori probability are searched jointly for all blocks. Preliminary results indicate that the performance of the system can be increased by varying the block size.
Rotation and translation invariant feature extraction using angular projection in frequency domain
This paper presents a new approach to translation and rotation invariant texture feature extraction for image texture retrieval. For the rotation invariant feature extraction, we invent angular projection along angular frequency in Polar coordinate system. The translation and rotation invariant feature vector for representing texture images is constructed by the averaged magnitude and the standard deviations of the magnitude of the Fourier transform spectrum obtained by the proposed angular projection. In order to easily implement the angular projection, the Radon transform is employed to obtain the Fourier transform spectrum of images in the Polar coordinate system. Then, angular projection is applied to extract the feature vector. We present our experimental results to show the robustness against the image rotation and the discriminatory capability for different texture images using MPEG-7 data set. Our Experiment result shows that the proposed rotation and translation invariant feature vector is effective in retrieval performance for the texture images with homogeneous, isotropic and local directionality.
Invariant region descriptors for robust shot segmentation
Anjulan Arasanathan, Nishan Canagarajah
This paper describes a novel method for automatic shot cut detection, based on local region descriptors. We define a new similarity measure for temporally adjacent frames and demonstrate the advantages with accurate shot cut detection. Compared to previous approaches the proposed method is highly robust to camera and object motions and can withstand severe illumination changes and spatial editing. We apply the proposed method to different kinds of video sequences and demonstrate superior performance compared to existing methods.
Video Analysis II
icon_mobile_dropdown
A video processing method for convenient mobile reading of printed barcodes with camera phones
Christer Bäckström, Caj Södergård, Sture Udd
Efficient communication requires an appropriate choice and combination of media. The print media has succeeded to attract audiences also in our electronic age because of its high usability. However, the limitations of print are self evident. By finding ways of combining printed and electronic information into so called hybrid media, the strengths of both media can be obtained. In hybrid media, paper functions as an interface to the web, integrating printed products into the connected digital world. This is a "reinvention" of printed matter making it into a more communicative technology. Hybrid media means that printed products can be updated in real time. Multimedia clips, personalization and e-shopping can be added as a part of the interactive medium. The concept of enhancing print with interactive features has been around for years. However, the technology has been so far too restricting - people don't want to be tied in front of their PC's reading newspapers. Our solution is communicative and totally mobile. A code on paper or electronic media constitutes the link to mobility.
Flexible surveillance system architecture for prototyping video content analysis algorithms
Many proposed video content analysis algorithms for surveillance applications are very computationally intensive, which limits the integration in a total system, running on one processing unit (e.g. PC). To build flexible prototyping systems of low cost, a distributed system with scalable processing power is therefore required. This paper discusses requirements for surveillance systems, considering two example applications. From these requirements, specifications for a prototyping architecture are derived. An implementation of the proposed architecture is presented, enabling mapping of multiple software modules onto a number of processing units (PCs). The architecture enables fast prototyping of new algorithms for complex surveillance applications without considering resource constraints.
Motion based parsing for video from observational psychology
Anil Kokaram, Erika Doyle, Daire Lennon, et al.
In Psychology it is common to conduct studies involving the observation of humans undertaking some task. The sessions are typically recorded on video and used for subjective visual analysis. The subjective analysis is tedious and time consuming, not only because much useless video material is recorded but also because subjective measures of human behaviour are not necessarily repeatable. This paper presents tools using content based video analysis that allow automated parsing of video from one such study involving Dyslexia. The tools rely on implicit measures of human motion that can be generalised to other applications in the domain of human observation. Results comparing quantitative assessment of human motion with subjective assessment are also presented, illustrating that the system is a useful scientific tool.
Applications II
icon_mobile_dropdown
Occlusion costing for multimedia object layout in a constrained window
In this paper, we propose a novel method for applying image analysis techniques, such as saliency map generation and face detection, to the creation of compelling image layouts. The layouts are designed to maximize the use of available real estate by permitting images to partially occlude one another or extend beyond the boundaries of the window, while retaining the majority of visual interest within the photo and deliberately avoiding objectionable visual incongruities. Optimal layouts are chosen from a candidate set through the calculation of a cost function called the occlusion cost. The basic form of the occlusion cost is applied to candidate layout sets where the sizes of the images are fixed with respect to the window. The area-compensated form of the occlusion cost permits a more general solution by relaxing the fixed-size constraint, and allowing each image to scale with respect to both the frame and the other images. Finally, a number of results for laying out one or two images within a frame are presented.
Using CART to segment road images
The 2005 DARPA Grand Challenge is a 132 mile race through the desert with autonomous robotic vehicles. Lasers mounted on the car roof provide a map of the road up to 20 meters ahead of the car but the car needs to see further in order to go fast enough to win the race. Computer vision can extend that map of the road ahead but desert road is notoriously similar to the surrounding desert. The CART algorithm (Classification and Regression Trees) provided a machine learning boost to find road while at the same time measuring when that road could not be distinguished from surrounding desert.
Poster Session
icon_mobile_dropdown
A tree-based paradigm for content-based video retrieval and management
H. Fang, Y. Yin, J. Jiang
As video databases become increasingly important for full exploitation of multimedia resources, this paper aims at describing our recent efforts in feasibility studies towards building up a content-based and high-level video retrieval/management system. The study is focused on constructing a semantic tree structure via combination of low-level image processing techniques and high-level interpretation of visual content. Specifically, two separate algorithms were developed to organise input videos in terms of two layers: the shot layer and the key-frame layer. While the shot layer is derived by developing a multi-featured shot cut detection, the key frame layer is extracted automatically by a genetic algorithm. This paves the way for applying pattern recognition techniques to analyse those key frames and thus extract high level information to interpret the visual content or objects. Correspondingly, content-based video retrieval can be conducted in three stages. The first stage is to browse the digital video via the semantic tree at structural level, the second stage is match the key frame in terms of low-level features, such as colour, shape of objects, and texture etc. Finally, the third stage is to match the high-level information, such as conversation with indoor background, moving vehicles along a seaside road etc. Extensive experiments are reported in this paper for shot cut detection and key frame extraction, enabling the tree structure to be constructed.
Tangible interactive system for document browsing and visualisation of multimedia data
Yuriy Rytsar, Sviatoslav Voloshynovskiy, Oleksiy Koval, et al.
In this paper we introduce and develop a framework for document interactive navigation in multimodal databases. First, we analyze the main open issues of existing multimodal interfaces and then discuss two applications that include interaction with documents in several human environments, i.e., the so-called smart rooms. Second, we propose a system set-up dedicated to the efficient navigation in the printed documents. This set-up is based on the fusion of data from several modalities that include images and text. Both modalities can be used as cover data for hidden indexes using data-hiding technologies as well as source data for robust visual hashing. The particularities of the proposed robust visual hashing are described in the paper. Finally, we address two practical applications of smart rooms for tourism and education and demonstrate the advantages of the proposed solution.
Performance evaluation of a contextual news story segmentation algorithm
The problem of semantic video structuring is vital for automated management of large video collections. The goal is to automatically extract from the raw data the inner structure of a video collection; so that a whole new range of applications to browse and search video collections can be derived out of this high-level segmentation. To reach this goal, we exploit techniques that consider the full spectrum of video content; it is fundamental to properly integrate technologies from the fields of computer vision, audio analysis, natural language processing and machine learning. In this paper, a multimodal feature vector providing a rich description of the audio, visual and text modalities is first constructed. Boosted Random Fields are then used to learn two types of relationships: between features and labels and between labels associated with various modalities for improved consistency of the results. The parameters of this enhanced model are found iteratively by using two successive stages of Boosting. We experimented using the TRECvid corpus and show results that validate the approach over existing studies.
Annotating 3D contents with MPEG-7 for reuse purposes
Ioan Marius Bilasco, Jérôme Gensel, Marlène Villanova-Oliver, et al.
The progress and the continuous evolution of computer capacities, as well as the emergence of the X3D standard have recently boosted the 3D domain. Still, 3D data management remains expensive. Lot of (human and materiel) resources are needed to properly create 3D objects. In order to speed up the construction of new 3D objects, the reuse of existing objects is to be considered. Associating some semantics with 3D contents becomes a major issue specially for reusing such contents or pieces of content after having extracted them from existing 3D scenes. In this paper, we address this issue by proposing a generic semantic annotation model for 3D, called 3DSEAM (3D Semantics Annotation Model). 3DSEAM aims at indexing 3D contents considering visual, geometric, topologic and semantic aspects. We extend MPEG-7 in order to support the localisation of 3D objects. With this extension MPEG-7 can be used in order to instantiate the 3DSEAM model. Thus; through specific 3D locators 3DSEAM can attach some visual, geometric and semantic features with 3D objects defined within X3D fragments. These features can be indexed and then combined in order to formulate complex queries.
Multimedia for Art ReTrieval (M4ART)
The prototype of an online Multimedia for Art ReTrieval (M4ART) system is introduced, which provides entrance to the digitized collection of the National Gallery of the Netherlands (the Rijksmuseum). The current online system of the Rijksmuseum is text-based and requires expert knowledge concerning the work searched for, else it fails in retrieving it. M4ART extends this system with querying by an example image that can be uploaded to the system or can be selected through browsing the collection. The global color distribution and (optionally) a set of texture features of the example image are extracted and compared with those of the images in the collection. Hence, based on either text or content-based features, the collection can be queried. Moreover, the matching process of M4ART can be inspected. With the latter feature, M4ART not only integrates the means to inspect collections by both experts and laypersons in one system but also provides the means to let the user to understand its working. These characteristics make M4ART a unique system to access, enhance, and retrieve the knowledge available in digitized art collections.
Application of image visual characterization and soft feature selection in content-based image retrieval
Kambiz Jarrah, Matthew Kyan, Ivan Lee, et al.
Fourier descriptors (FFT) and Hu's seven moment invariants (HSMI) are among the most popular shape-based image descriptors and have been used in various applications, such as recognition, indexing, and retrieval. In this work, we propose to use the invariance properties of Hu's seven moment invariants, as shape feature descriptors, for relevance identification in content-based image retrieval (CBIR) systems. The purpose of relevance identification is to find a collection of images that are statistically similar to, or match with, an original query image from within a large visual database. An automatic relevance identification module in the search engine is structured around an unsupervised learning algorithm, the self-organizing tree map (SOTM). In this paper we also proposed a new ranking function in the structure of the SOTM that exponentially ranks the retrieved images based on their similarities with respect to the query image. Furthermore, we propose to extend our studies to optimize the contribution of individual feature descriptors for enhancing the retrieval results. The proposed CBIR system is compatible with the different architectures of other CBIR systems in terms of its ability to adapt to different similarity matching algorithms for relevance identification purposes, whilst offering flexibility of choice for alternative optimization and weight estimation techniques. Experimental results demonstrate the satisfactory performance of the proposed CBIR system.
Video shot retrieval using a kernel derived from a continuous HMM
In this paper, we propose a discriminative approach for retrieval of video shots characterized by a sequential structure. The task of retrieving shots similar in content to a few positive example shots is more close to a binary classification problem. Hence, this task can be solved by a discriminative learning approach. For a content-based retrieval task the twin characteristics of rare positive example occurrence and a sequential structure in the positive examples make it attractive for us to use a learning approach based on a generative model like HMM. To make use of the positive aspects of both discriminative and generative models, we derive Fisher and Modified score kernels for a Continuous HMM and incorporate them into SVM classification framework. The training set video shots are used to learn SVM classifier. A test set video shot is ranked based on its proximity to the positive class side of hyperplane. We evaluate the performance of the derived kernels by retrieving video shots of airplane takeoff. The retrieval performance using the derived kernels is found to be much better compared to linear and RBF kernels.
Moving camera moving object segmentation in an MPEG-2 compressed video sequence
Jinsong Wang, Nilesh Patel, William Grosky
In the paper, we addresses the problem of camera and object motion detection in compressed domain. The estimation of camera motion and the moving object segmentation have been widely stated in a variety of context for video analysis, because they are capable of providing essential clues for interpreting high-level semantic meanings of video sequences. A novel compressed domain motion estimation and segmentation scheme is presented and applied in this paper. The proposed algorithm uses MPEG-2 compressed motion vectors to undergo a spatial and temporal interpolation over several adjacent frames. An iterative rejection scheme based upon the affine model is exploited to effect global camera motion detection. The foreground spatiotemporal objects are separated from the background using the temporal consistency check to the output of the iterative segmentation. This consistency check process can help conglomerate the resulting foreground blocks and weed out unqualified blocks. Illustrative examples are provided to demonstrate the efficacy of the proposed approach.
Visual object categorization with indefinite kernels in discriminant analysis framework
The major focus of this work is on the application of indefinite kernels in multimedia processing applications illustrated on the problem of content-based digital image analysis and retrieval. The term "indefinite" here relates to kernel functions associated with non-metric distance measures that are known in many applications to better capture perceptual similarity defining relations among higher level semantic concepts. This paper describes a kernel extension of distance-based discriminant analysis method whose formulation remains convex irrespective of the definiteness property of the underlying kernel. The presented method deploys indefinite kernels rendered as unrestricted linear combinations of hyperkernels to approach the problem of visual object categorization. The benefits of the proposed technique are demonstrated empirically on a real-world image data set, showing an improvement in categorization accuracy.