Proceedings Volume 9021

Document Recognition and Retrieval XXI

cover
Proceedings Volume 9021

Document Recognition and Retrieval XXI

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 2 February 2014
Contents: 9 Sessions, 29 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2014
Volume Number: 9021

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 9021
  • Handwriting
  • Form Classification
  • Text Recognition
  • Handwritten Text Line Segmentation
  • Layout Analysis
  • Information Retrieval
  • Data Sets and Ground-Truthing
  • Interactive Paper Session
Front Matter: Volume 9021
icon_mobile_dropdown
Front Matter: Volume 9021
This PDF file contains the front matter associated with SPIE Proceedings Volume 9021 including the Title Page, Copyright information, Table of Contents, Introduction, and Conference Committee listing.
Handwriting
icon_mobile_dropdown
Writer identification on historical Glagolitic documents
Stefan Fiel, Fabian Hollaus, Melanie Gau, et al.
This work aims at automatically identifying scribes of historical Slavonic manuscripts. The quality of the ancient documents is partially degraded by faded-out ink or varying background. The writer identification method used is based on image features, which are described with Scale Invariant Feature Transform (SIFT) features. A visual vocabulary is used for the description of handwriting characteristics, whereby the features are clustered using a Gaussian Mixture Model and employing the Fisher kernel. The writer identification approach is originally designed for grayscale images of modern handwritings. But contrary to modern documents, the historical manuscripts are partially corrupted by background clutter and water stains. As a result, SIFT features are also found on the background. Since the method shows also good results on binarized images of modern handwritings, the approach was additionally applied on binarized images of the ancient writings. Experiments show that this preprocessing step leads to a significant performance increase: The identification rate on binarized images is 98.9%, compared to an identification rate of 87.6% gained on grayscale images.
Probabilistic modeling of children's handwriting
Mukta Puri, Sargur N. Srihari, Lisa Hanson
There is little work done in the analysis of children's handwriting, which can be useful in developing automatic evaluation systems and in quantifying handwriting individuality. We consider the statistical analysis of children's handwriting in early grades. Samples of handwriting of children in Grades 2-4 who were taught the Zaner-Bloser style were considered. The commonly occurring word "and" written in cursive style as well as hand-print were extracted from extended writing. The samples were assigned feature values by human examiners using a truthing tool. The human examiners looked at how the children constructed letter formations in their writing, looking for similarities and differences from the instructions taught in the handwriting copy book. These similarities and differences were measured using a feature space distance measure. Results indicate that the handwriting develops towards more conformity with the class characteristics of the Zaner-Bloser copybook which, with practice, is the expected result. Bayesian networks were learnt from the data to enable answering various probabilistic queries, such as determining students who may continue to produce letter formations as taught during lessons in school and determining the students who will develop a different and/or variation of the those letter formations and the number of different types of letter formations.
Variational dynamic background model for keyword spotting in handwritten documents
Gaurav Kumar, Safwan Wshah, Venu Govindaraju
We propose a bayesian framework for keyword spotting in handwritten documents. This work is an extension to our previous work where we proposed dynamic background model, DBM for keyword spotting that takes into account the local character level scores and global word level scores to learn a logistic regression classifier to separate keywords from non-keywords. In this work, we add a bayesian layer on top of the DBM called the variational dynamic background model, VDBM. The logistic regression classifier uses the sigmoid function to separate keywords from non-keywords. The sigmoid function being neither convex nor concave, exact inference of VDBM becomes intractable. An expectation maximization step is proposed to do approximate inference. The advantage of VDBM over the DBM is multi-fold. Firstly, being bayesian, it prevents over-fitting of data. Secondly, it provides better modeling of data and an improved prediction of unseen data. VDBM is evaluated on the IAM dataset and the results prove that it outperforms our prior work and other state of the art line based word spotting system.
Boosting bonsai trees for handwritten/printed text discrimination
Yann Ricquebourg, Christian Raymond, Baptiste Poirriez, et al.
Boosting over decision-stumps proved its efficiency in Natural Language Processing essentially with symbolic features, and its good properties (fast, few and not critical parameters, not sensitive to over-fitting) could be of great interest in the numeric world of pixel images. In this article we investigated the use of boosting over small decision trees, in image classification processing, for the discrimination of handwritten/printed text. Then, we conducted experiments to compare it to usual SVM-based classification revealing convincing results with very close performance, but with faster predictions and behaving far less as a black-box. Those promising results tend to make use of this classifier in more complex recognition tasks like multiclass problems.
Form Classification
icon_mobile_dropdown
Form similarity via Levenshtein distance between ortho-filtered logarithmic ruling-gap ratios
Geometric invariants are combined with edit distance to compare the ruling configuration of noisy filled-out forms. It is shown that gap-ratios used as features capture most of the ruling information of even low-resolution and poorly scanned form images, and that the edit distance is tolerant of missed and spurious rulings. No preprocessing is required and the potentially time-consuming string operations are performed on a sparse representation of the detected rulings. Based on edit distance, 158 Arabic forms are classified into 15 groups with 89% accuracy. Since the method was developed for an application that precludes public dissemination of the data, it is illustrated on public-domain death certificates.
Form classification and retrieval using bag of words with shape features of line structures
Florian Kleber, Markus Diem, Robert Sablatnig
In this paper a document form classification and retrieval method using Bag of Words and newly introduced local shape features of form lines is proposed. In a preprocessing step the document is binarized and the form lines (solid and dotted) are detected. The shape features are based on the line information describing local line structures, e.g. line endings, crossings, boxes. The dominant line structures build a vocabulary for each form class. According to the vocabulary an occurrence histogram of structures of form documents can be calculated for the classification and retrieval. The proposed method has been tested on a set of 489 documents and 9 different form classes.
Text Recognition
icon_mobile_dropdown
Utilizing web data in identification and correction of OCR errors
In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.
How well does multiple OCR error correction generalize?
William B. Lund, Eric K. Ringger, Daniel D. Walker
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.
Video text localization using wavelet and shearlet transforms
Purnendu Banerjee, B. B. Chaudhuri
Text in video is useful and important in indexing and retrieving the video documents efficiently and accurately. In this paper, we present a new method of text detection using a combined dictionary consisting of wavelets and a recently introduced transform called shearlets. Wavelets provide optimally sparse expansion for point-like structures and shearlets provide optimally sparse expansions for curve-like structures. By combining these two features we have computed a high frequency sub-band to brighten the text part. Then K-means clustering is used for obtaining text pixels from the Standard Deviation (SD) of combined coefficient of wavelets and shearlets as well as the union of wavelets and shearlets features. Text parts are obtained by grouping neighboring regions based on geometric properties of the classified output frame of unsupervised K-means classification. The proposed method tested on a standard as well as newly collected database shows to be superior to some of the existing methods.
Handwritten Text Line Segmentation
icon_mobile_dropdown
A Markov chain based line segmentation framework for handwritten character recognition
Yue Wu, Shengxin Zha, Huaigu Cao, et al.
In this paper, we present a novel text line segmentation framework following the divide-and-conquer paradigm: we iteratively identify and re-process regions of ambiguous line segmentation from an input document image until there is no ambiguity. To detect ambiguous line segmentation, we introduce the use of two complimentary line descriptors, referred as to the underline and highlight line descriptors, and identify ambiguities when their patterns mismatch. As a result, we can easily identify already good line segmentations, and largely simplify the original line segmentation problem by only reprocessing ambiguous regions. We evaluate the performance of the proposed line segmentation framework using the ICDAR 2009 handwritten document dataset, and it is close to top-performing systems submitted to the competition. Moreover, the proposed method is also robust against skewness, noise, variable line heights and touching characters. The proposed idea can also be applied to other text analysis tasks such as word segmentation and page layout analysis.
Handwritten text segmentation using blurred image
In this paper, we present our new method for the segmentation of handwritten text pages into lines, which has been submitted to ICDAR'2013 handwritten segmentation competition. This method is based on two levels of perception of the image: a rough perception based on a blurred image, and a precise perception based on the presence of connected components. The combination of those two levels of perception enables to deal with the difficulties of handwritten text segmentation: curvature, irregular slope and overlapping strokes. Thus, the analysis of the blurred image is efficient in images with high density of text, whereas the use of connected components enables to connect the text lines in the pages with low text density. The combination of those two kinds of data is implemented with a grammatical description, which enables to externalize the knowledge linked to the page model. The page model contains a strategy of analysis that can be associated to an applicative goal. Indeed, the text line segmentation is linked to the kind of data that is analysed: homogeneous text pages, separated text blocks or unconstrained text. This method obtained a recognition rate of more than 98% on last ICDAR'2013 competition.
Layout Analysis
icon_mobile_dropdown
Optical music recognition on the International Music Score Library Project
Christopher Raphael, Rong Jin
A system is presented for optical recognition of music scores. The system processes a document page in three main phases. First it performs a hierarchical decomposition of the page, identifying systems, staves and measures. The second phase, which forms the heart of the system, interprets each measure found in the previous phase as a collection of non-overlapping symbols including both primitive symbols (clefs, rests, etc.) with fixed templates, and composite symbols (chords, beamed groups, etc.) constructed through grammatical composition of primitives (note heads, ledger lines, beams, etc.). This phase proceeds by first building separate top-down recognizers for the symbols of interest. Then, it resolves the inevitable overlap between the recognized symbols by exploring the possible assignment of overlapping regions, seeking globally optimal and grammatically consistent explanations. The third phase interprets the recognized symbols in terms of pitch and rhythm, focusing on the main challenge of rhythm. We present results that compare our system to the leading commercial OMR system using MIDI ground truth for piano music.
Document flow segmentation for business applications
The aim of this paper is to propose a document flow supervised segmentation approach applied to real world heterogeneous documents. Our algorithm treats the flow of documents as couples of consecutive pages and studies the relationship that exists between them. At first, sets of features are extracted from the pages where we propose an approach to model the couple of pages into a single feature vector representation. This representation will be provided to a binary classifier which classifies the relationship as either segmentation or continuity. In case of segmentation, we consider that we have a complete document and the analysis of the flow continues by starting a new document. In case of continuity, the couple of pages are assimilated to the same document and the analysis continues on the flow. If there is an uncertainty on whether the relationship between the couple of pages should be classified as a continuity or segmentation, a rejection is decided and the pages analyzed until this point are considered as a "fragment". The first classification already provides good results approaching 90% on certain documents, which is high at this level of the system.
LearnPos: a new tool for interactive learning positioning
The analysis of 2D structured documents often requires localizing data inside of a document during the recognition process. In this paper we present LearnPos a new generic tool, independent of any document recognition system. LearnPos models and evaluates positioning from a learning set of documents. Thanks to LearnPos, the user is helped to define the physical structure of the document. He then can concentrate his efforts on the definition of the logical structure of the documents. LearnPos is able to furnish spatial information for both absolute and relative spatial relations, in interaction with the user. Our method can handle spatial relations compose of distinct zones and is able to furnish appropriate order and point of view to minimize errors. We prove that resulting models can be successfully used for structured document recognition, while reducing the manual exploration of the data set of documents.
Document page structure learning for fixed-layout e-books using conditional random fields
Xin Tao, Zhi Tang, Canhui Xu
In this paper, a model is proposed to learn logical structure of fixed-layout document pages by combining support vector machine (SVM) and conditional random fields (CRF). Features related to each logical label and their dependencies are extracted from various original Portable Document Format (PDF) attributes. Both local evidence and contextual dependencies are integrated in the proposed model so as to achieve better logical labeling performance. With the merits of SVM as local discriminative classifier and CRF modeling contextual correlations of adjacent fragments, it is capable of resolving the ambiguities of semantic labels. The experimental results show that CRF based models with both tree and chain graph structures outperform the SVM model with an increase of macro-averaged F1 by about 10%.
Automatic comic page image understanding based on edge segment analysis
Dong Liu, Yongtao Wang, Zhi Tang, et al.
Comic page image understanding aims to analyse the layout of the comic page images by detecting the storyboards and identifying the reading order automatically. It is the key technique to produce the digital comic documents suitable for reading on mobile devices. In this paper, we propose a novel comic page image understanding method based on edge segment analysis. First, we propose an efficient edge point chaining method to extract Canny edge segments (i.e., contiguous chains of Canny edge points) from the input comic page image; second, we propose a top-down scheme to detect line segments within each obtained edge segment; third, we develop a novel method to detect the storyboards by selecting the border lines and further identify the reading order of these storyboards. The proposed method is performed on a data set consisting of 2000 comic page images from ten printed comic series. The experimental results demonstrate that the proposed method achieves satisfactory results on different comics and outperforms the existing methods.
Information Retrieval
icon_mobile_dropdown
Scalable ranked retrieval using document images
Rajiv Jain, Douglas W. Oard, David Doermann
Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user’s information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness.
A contour-based shape descriptor for biomedical image classification and retrieval
Daekeun You, Sameer Antani, Dina Demner-Fushman, et al.
Contours, object blobs, and specific feature points are utilized to represent object shapes and extract shape descriptors that can then be used for object detection or image classification. In this research we develop a shape descriptor for biomedical image type (or, modality) classification. We adapt a feature extraction method used in optical character recognition (OCR) for character shape representation, and apply various image preprocessing methods to successfully adapt the method to our application. The proposed shape descriptor is applied to radiology images (e.g., MRI, CT, ultrasound, X-ray, etc.) to assess its usefulness for modality classification. In our experiment we compare our method with other visual descriptors such as CEDD, CLD, Tamura, and PHOG that extract color, texture, or shape information from images. The proposed method achieved the highest classification accuracy of 74.1% among all other individual descriptors in the test, and when combined with CSD (color structure descriptor) showed better performance (78.9%) than using the shape descriptor alone.
Semi-automated document image clustering and retrieval
Markus Diem, Florian Kleber, Stefan Fiel, et al.
In this paper a semi-automated document image clustering and retrieval is presented to create links between different documents based on their content. Ideally the initial bundling of shuffled document images can be reproduced to explore large document databases. Structural and textural features, which describe the visual similarity, are extracted and used by experts (e.g. registrars) to interactively cluster the documents with a manually defined feature subset (e.g. checked paper, handwritten). The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.
Fast structural matching for document image retrieval through spatial databases
Hongxing Gao, Maçal Rusiñol, Dimosthenis Karatzas, et al.
The structure of document images plays a significant role in document analysis thus considerable efforts have been made towards extracting and understanding document structure, usually in the form of layout analysis approaches. In this paper, we first employ Distance Transform based MSER (DTMSER) to efficiently extract stable document structural elements in terms of a dendrogram of key-regions. Then a fast structural matching method is proposed to query the structure of document (dendrogram) based on a spatial database which facilitates the formulation of advanced spatial queries. The experiments demonstrate a significant improvement in a document retrieval scenario when compared to the use of typical Bag of Words (BoW) and pyramidal BoW descriptors.
Data Sets and Ground-Truthing
icon_mobile_dropdown
The Lehigh Steel Collection: a new open dataset for document recognition research
Barri Bruno, Daniel Lopresti
Document image analysis is a data-driven discipline. For a number of years, research was focused on small, homogeneous datasets such as the University of Washington corpus of scanned journal pages. More recently, library digitization efforts have raised many interesting problems with respect to historical documents and their recognition. In this paper, we present the Lehigh Steel Collection (LSC), a new open dataset we are currently assembling which will be, in many ways, unique to the field. LSC is an extremely large, heterogeneous set of documents dating from the 1960's through the 1990's relating to the wide-ranging research activities of Bethlehem Steel, a now-bankrupt company that was once the second-largest steel producer and the largest shipbuilder in the United States. As a result of the bankruptcy process and the disposition of the company's assets, an enormous quantity of documents (we estimate hundreds of thousands of pages) were left abandoned in buildings recently acquired by Lehigh University. Rather than see this history destroyed, we stepped in to preserve a portion of the collection via digitization. Here we provide an overview of LSC, including our efforts to collect and scan the documents, a preliminary characterization of what the collection contains, and our plans to make this data available to the research community for non-commercial purposes.
Interactive Paper Session
icon_mobile_dropdown
Two-stage approach to keyword spotting in handwritten documents
Mehdi Haji, Mohammad R. Ameri, Tien D. Bui, et al.
Separation of keywords from non-keywords is the main problem in keyword spotting systems which has traditionally been approached by simplistic methods, such as thresholding of recognition scores. In this paper, we analyze this problem from a machine learning perspective, and we study several standard machine learning algorithms specifically in the context of non-keyword rejection. We propose a two-stage approach to keyword spotting and provide a theoretical analysis of the performance of the system which gives insights on how to design the classifier in order to maximize the overall performance in terms of F-measure.
Extraction and labeling high-resolution images from PDF documents
Accuracy of content-based image retrieval is affected by image resolution among other factors. Higher resolution images enable extraction of image features that more accurately represent the image content. In order to improve the relevance of search results for our biomedical image search engine, Open-I, we have developed techniques to extract and label high-resolution versions of figures from biomedical articles supplied in the PDF format. Open-I uses the open-access subset of biomedical articles from the PubMed Central repository hosted by the National Library of Medicine. Articles are available in XML and in publisher supplied PDF formats. As these PDF documents contain little or no meta-data to identify the embedded images, the task includes labeling images according to their figure number in the article after they have been successfully extracted. For this purpose we use the labeled small size images provided with the XML web version of the article. This paper describes the image extraction process and two alternative approaches to perform image labeling that measure the similarity between two images based upon the image intensity projection on the coordinate axes and similarity based upon the normalized cross-correlation between the intensities of two images. Using image identification based on image intensity projection, we were able to achieve a precision of 92.84% and a recall of 82.18% in labeling of the extracted images.
Structure analysis for plane geometry figures
Tianxiao Feng, Xiaoqing Lu, Lu Liu, et al.
As there are increasing numbers of digital documents for education purpose, we realize that there is not a retrieval application for mathematic plane geometry images. In this paper, we propose a method for retrieving plane geometry figures (PGFs), which often appear in geometry books and digital documents. First, detecting algorithms are applied to detect common basic geometry shapes from a PGF image. Based on all basic shapes, we analyze the structural relationships between two basic shapes and combine some of them to a compound shape to build the PGF descriptor. Afterwards, we apply matching function to retrieve candidate PGF images with ranking. The great contribution of the paper is that we propose a structure analysis method to better describe the spatial relationships in such image composed of many overlapped shapes. Experimental results demonstrate that our analysis method and shape descriptor can obtain good retrieval results with relatively high effectiveness and efficiency.
On-line signature verification method by Laplacian spectral analysis and dynamic time warping
Changting Li, Liangrui Peng, Changsong Liu, et al.
As smartphones and touch screens are more and more popular, on-line signature verification technology can be used as one of personal identification means for mobile computing. In this paper, a novel Laplacian Spectral Analysis (LSA) based on-line signature verification method is presented and an integration framework of LSA and Dynamic Time Warping (DTW) based methods for practical application is proposed. In LSA based method, a Laplacian matrix is constructed by regarding the on-line signature as a graph. The signature’s writing speed information is utilized in the Laplacian matrix of the graph. The eigenvalue spectrum of the Laplacian matrix is analyzed and used for signature verification. The framework to integrate LSA and DTW methods is further proposed. DTW is integrated at two stages. First, it is used to provide stroke matching results for the LSA method to construct the corresponding graph better. Second, the on-line signature verification results by DTW are fused with that of the LSA method. Experimental results on public signature database and practical signature data on mobile phones proved the effectiveness of the proposed method.
A slant removal technique for document page
The slant removal is a necessary preprocessing task in many document image processing systems. In this paper, we describe a technique for removing the slant from the entire page, avoiding the segmentation procedure. The presented technique could be combined with the most existed slant removal algorithms. Experimental results are presented on two databases.
Robust binarization of degraded document images using heuristics
Jon Parker, Ophir Frieder, Gideon Frieder
Historically significant documents are often discovered with defects that make them difficult to read and analyze. This fact is particularly troublesome if the defects prevent software from performing an automated analysis. Image enhancement methods are used to remove or minimize document defects, improve software performance, and generally make images more legible. We describe an automated, image enhancement method that is input page independent and requires no training data. The approach applies to color or greyscale images with hand written script, typewritten text, images, and mixtures thereof. We evaluated the image enhancement method against the test images provided by the 2011 Document Image Binarization Contest (DIBCO). Our method outperforms all 2011 DIBCO entrants in terms of average F1 measure – doing so with a significantly lower variance than top contest entrants. The capability of the proposed method is also illustrated using select images from a collection of historic documents stored at Yad Vashem Holocaust Memorial in Israel.
A machine learning based lecture video segmentation and indexing algorithm
Video segmentation and indexing are important steps in multi-media document understanding and information retrieval. This paper presents a novel machine learning based approach for automatic structuring and indexing of lecture videos. By indexing video content, we can support both topic indexing and semantic querying of multimedia documents. In this paper, our proposed approach extracts features from video images and then uses these features to construct a model to label video frames. Using this model, we are able to segment and indexing videos with accuracy of 95% on our test collection.