Proceedings Volume 9402

Document Recognition and Retrieval XXII

cover
Proceedings Volume 9402

Document Recognition and Retrieval XXII

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 14 January 2015
Contents: 8 Sessions, 23 Papers, 0 Presentations
Conference: SPIE/IS&T Electronic Imaging 2015
Volume Number: 9402

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 9402
  • Document Layout Analysis and Understanding
  • Document Structure Semantics, Forms, and Tables
  • Text Analysis
  • Handwriting I
  • Quality and Compression
  • Graphics and Structure
  • Handwriting II
Front Matter: Volume 9402
icon_mobile_dropdown
Front Matter: Volume 9402
This PDF file contains the front matter associated with SPIE Proceedings Volume 9402, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and Conference Committee listing.
Document Layout Analysis and Understanding
icon_mobile_dropdown
Ground truth model, tool, and dataset for layout analysis of historical documents
Kai Chen, Mathias Seuret, Hao Wei, et al.
In this paper, we propose a new dataset and a ground-truthing methodology for layout analysis of historical documents with complex layouts. The dataset is based on a generic model for ground-truth presentation of the complex layout structure of historical documents. For the purpose of extracting uniformly the document contents, our model defines five types of regions of interest: page, text block, text line, decoration, and comment. Unconstrained polygons are used to outline the regions. A performance metric is proposed in order to evaluate various page segmentation methods based on this model. We have analysed four state-of-the-art ground-truthing tools: TRUVIZ, GEDI, WebGT, and Aletheia. From this analysis, we conceptualized and developed Divadia, a new tool that overcomes some of the drawbacks of these tools, targeting the simplicity and the efficiency of the layout ground truthing process on historical document images. With Divadia, we have created a new public dataset. This dataset contains 120 pages from three historical document image collections of different styles and is made freely available to the scientific community for historical document layout analysis research.
Use of SLIC superpixels for ancient document image enhancement and segmentation
Maroua Mehri, Nabil Sliti, Pierre Héroux, et al.
Designing reliable and fast segmentation algorithms of ancient documents has been a topic of major interest for many libraries and the prime issue of research in the document analysis community. Thus, we propose in this article a fast ancient document enhancement and segmentation algorithm based on using Simple Linear Iterative Clustering (SLIC) superpixels and Gabor descriptors in a multi-scale approach. Firstly, in order to obtain enhanced backgrounds of noisy ancient documents, a novel foreground/background segmentation algorithm based on SLIC superpixels, is introduced. Once, the SLIC technique is carried out, the background and foreground superpixels are classified. Then, an enhanced and non-noisy background is achieved after processing the background superpixels. Subsequently, Gabor descriptors are only extracted from the selected foreground superpixels of the enhanced gray-level ancient book document images by adopting a multi-scale approach. Finally, for ancient document image segmentation, a foreground superpixel clustering task is performed by partitioning Gabor-based feature sets into compact and well-separated clusters in the feature space. The proposed algorithm does not assume any a priori information regarding document image content and structure and provides interesting results on a large corpus of ancient documents. Qualitative and numerical experiments are given to demonstrate the enhancement and segmentation quality.
Software workflow for the automatic tagging of medieval manuscript images (SWATI)
Swati Chandna, Danah Tonne, Thomas Jejkal, et al.
Digital methods, tools and algorithms are gaining in importance for the analysis of digitized manuscript collections in the arts and humanities. One example is the BMBF-funded research project “eCodicology” which aims to design, evaluate and optimize algorithms for the automatic identification of macro- and micro-structural layout features of medieval manuscripts. The main goal of this research project is to provide better insights into high-dimensional datasets of medieval manuscripts for humanities scholars. The heterogeneous nature and size of the humanities data and the need to create a database of automatically extracted reproducible features for better statistical and visual analysis are the main challenges in designing a workflow for the arts and humanities. This paper presents a concept of a workflow for the automatic tagging of medieval manuscripts. As a starting point, the workflow uses medieval manuscripts digitized within the scope of the project Virtual Scriptorium St. Matthias". Firstly, these digitized manuscripts are ingested into a data repository. Secondly, specific algorithms are adapted or designed for the identification of macro- and micro-structural layout elements like page size, writing space, number of lines etc. And lastly, a statistical analysis and scientific evaluation of the manuscripts groups are performed. The workflow is designed generically to process large amounts of data automatically with any desired algorithm for feature extraction. As a result, a database of objectified and reproducible features is created which helps to analyze and visualize hidden relationships of around 170,000 pages. The workflow shows the potential of automatic image analysis by enabling the processing of a single page in less than a minute. Furthermore, the accuracy tests of the workflow on a small set of manuscripts with respect to features like page size and text areas show that automatic and manual analysis are comparable. The usage of a computer cluster will allow the highly performant processing of large amounts of data. The software framework itself will be integrated as a service into the DARIAH infrastructure to make it adaptable for wider range of communities.
Document Structure Semantics, Forms, and Tables
icon_mobile_dropdown
Math expression retrieval using an inverted index over symbol pairs
David Stalnaker, Richard Zanibbi
We introduce a new method for indexing and retrieving mathematical expressions, and a new protocol for evaluating math formula retrieval systems. The Tangent search engine uses an inverted index over pairs of symbols in math expressions. Each key in the index is a pair of symbols along with their relative distance and vertical displacement within an expression. Matched expressions are ranked by the harmonic mean of the percentage of symbol pairs matched in the query, and the percentage of symbol pairs matched in the candidate expression. We have found that our method is fast enough for use in real time and finds partial matches well, such as when subexpressions are re-arranged (e.g. expressions moved from the left to the right of an equals sign) or when individual symbols (e.g. variables) differ from a query expression. In an experiment using expressions from English Wikipedia, student and faculty participants (N=20) found expressions returned by Tangent significantly more similar than those from a text-based retrieval system (Lucene) adapted for mathematical expressions. Participants provided similarity ratings using a 5-point Likert scale, evaluating expressions from both algorithms one-at-a-time in a randomized order to avoid bias from the position of hits in search result lists. For the Lucene-based system, precision for the top 1 and 10 hits averaged 60% and 39% across queries respectively, while for Tangent mean precision at 1 and 10 were 99% and 60%. A demonstration and source code are publicly available.
Min-cut segmentation of cursive handwriting in tabular documents
Brian L. Davis, William A. Barrett, Scott D. Swingle
Handwritten tabular documents, such as census, birth, death and marriage records, contain a wealth of information vital to genealogical and related research. Much work has been done in segmenting freeform handwriting, however, segmentation of cursive handwriting in tabular documents is still an unsolved problem. Tabular documents present unique segmentation challenges caused by handwriting overlapping cell-boundaries and other words, both horizontally and vertically, as “ascenders” and “descenders” overlap into adjacent cells. This paper presents a method for segmenting handwriting in tabular documents using a min-cut/max-flow algorithm on a graph formed from a distance map and connected components of handwriting. Specifically, we focus on line, word and first letter segmentation. Additionally, we include the angles of strokes of the handwriting as a third dimension to our graph to enable the resulting segments to share pixels of overlapping letters. Word segmentation accuracy is 89.5% evaluating lines of the data set used in the ICDAR2013 Handwriting Segmentation Contest. Accuracy is 92.6% for a specific application of segmenting first and last names from noisy census records. Accuracy for segmenting lines of names from noisy census records is 80.7%. The 3D graph cutting shows promise in segmenting overlapping letters, although highly convoluted or overlapping handwriting remains an ongoing challenge.
Cross-reference identification within a PDF document
Sida Li, Liangcai Gao, Zhi Tang, et al.
Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.
Intelligent indexing: a semi-automated, trainable system for field labeling
Robert Clawson, William Barrett
We present Intelligent Indexing: a general, scalable, collaborative approach to indexing and transcription of non-machinereadable documents that exploits visual consensus and group labeling while harnessing human recognition and domain expertise. In our system, indexers work directly on the page, and with minimal context switching can navigate the page, enter labels, and interact with the recognition engine. Interaction with the recognition engine occurs through preview windows that allow the indexer to quickly verify and correct recommendations. This interaction is far superior to conventional, tedious, inefficient post-correction and editing. Intelligent Indexing is a trainable system that improves over time and can provide benefit even without prior knowledge. A user study was performed to compare Intelligent Indexing to a basic, manual indexing system. Volunteers report that using Intelligent Indexing is less mentally fatiguing and more enjoyable than the manual indexing system. Their results also show that it reduces significantly (30.2%) the time required to index census records, while maintaining comparable accuracy. (a video demonstration is available at http://youtube.com/gqdVzEPnBEw)
Text Analysis
icon_mobile_dropdown
Re-typograph phase I: a proof-of-concept for typeface parameter extraction from historical documents
Bart Lamiroy, Thomas Bouville, Julien Blégean, et al.
This paper reports on the first phase of an attempt to create a full retro-engineering pipeline that aims to construct a complete set of coherent typographic parameters defining the typefaces used in a printed homogenous text. It should be stressed that this process cannot reasonably be expected to be fully automatic and that it is designed to include human interaction. Although font design is governed by a set of quite robust and formal geometric rulesets, it still heavily relies on subjective human interpretation. Furthermore, different parameters, applied to the generic rulesets may actually result in quite similar and visually difficult to distinguish typefaces, making the retro-engineering an inverse problem that is ill conditioned once shape distortions (related to the printing and/or scanning process) come into play. This work is the first phase of a long iterative process, in which we will progressively study and assess the techniques from the state-of-the-art that are most suited to our problem and investigate new directions when they prove to not quite adequate. As a first step, this is more of a feasibility proof-of-concept, that will allow us to clearly pinpoint the items that will require more in-depth research over the next iterations.
Clustering of Farsi sub-word images for whole-book recognition
Mohammad Reza Soheili, Ehsanollah Kabir, Didier Stricker
Redundancy of word and sub-word occurrences in large documents can be effectively utilized in an OCR system to improve recognition results. Most OCR systems employ language modeling techniques as a post-processing step; however these techniques do not use important pictorial information that exist in the text image. In case of large-scale recognition of degraded documents, this information is even more valuable. In our previous work, we proposed a subword image clustering method for the applications dealing with large printed documents. In our clustering method, the ideal case is when all equivalent sub-word images lie in one cluster. To overcome the issues of low print quality, the clustering method uses an image matching algorithm for measuring the distance between two sub-word images. The measured distance with a set of simple shape features were used to cluster all sub-word images. In this paper, we analyze the effects of adding more shape features on processing time, purity of clustering, and the final recognition rate. Previously published experiments have shown the efficiency of our method on a book. Here we present extended experimental results and evaluate our method on another book with totally different font face. Also we show that the number of the new created clusters in a page can be used as a criteria for assessing the quality of print and evaluating preprocessing phases.
Gaussian process style transfer mapping for historical Chinese character recognition
Jixiong Feng, Liangrui Peng, Franck Lebourgeois
Historical Chinese character recognition is very important to larger scale historical document digitalization, but is a very challenging problem due to lack of labeled training samples. This paper proposes a novel non-linear transfer learning method, namely Gaussian Process Style Transfer Mapping (GP-STM). The GP-STM extends traditional linear Style Transfer Mapping (STM) by using Gaussian process and kernel methods. With GP-STM, existing printed Chinese character samples are used to help the recognition of historical Chinese characters. To demonstrate this framework, we compare feature extraction methods, train a modified quadratic discriminant function (MQDF) classifier on printed Chinese character samples, and implement the GP-STM model on Dunhuang historical documents. Various kernels and parameters are explored, and the impact of the number of training samples is evaluated. Experimental results show that accuracy increases by nearly 15 percentage points (from 42.8% to 57.5%) using GP-STM, with an improvement of more than 8 percentage points (from 49.2% to 57.5%) compared to the STM approach.
Boost OCR accuracy using iVector based system combination approach
Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot address all factors of real document images. In this paper, we describe an approach to combine diverse recognition systems by using iVector based features, which is a newly developed method in the field of speaker verification. Prior to system combination, document images are preprocessed and text line images are extracted with different approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic document database where the proposed method is compared against the single best OCR system using word error rate (WER) metric.
Handwriting I
icon_mobile_dropdown
Exploring multiple feature combination strategies with a recurrent neural network architecture for off-line handwriting recognition
L. Mioulet, G. Bideault, C. Chatelain, et al.
The BLSTM-CTC is a novel recurrent neural network architecture that has outperformed previous state of the art algorithms in tasks such as speech recognition or handwriting recognition. It has the ability to process long term dependencies in temporal signals in order to label unsegmented data. This paper describes different ways of combining features using a BLSTM-CTC architecture. Not only do we explore the low level combination (feature space combination) but we also explore high level combination (decoding combination) and mid-level (internal system representation combination). The results are compared on the RIMES word database. Our results show that the low level combination works best, thanks to the powerful data modeling of the LSTM neurons.
Spotting handwritten words and REGEX using a two stage BLSTM-HMM architecture
Gautier Bideault, Luc Mioulet, Clément Chatelain, et al.
In this article, we propose a hybrid model for spotting words and regular expressions (REGEX) in handwritten documents. The model is made of the state-of-the-art BLSTM (Bidirectional Long Short Time Memory) neural network for recognizing and segmenting characters, coupled with a HMM to build line models able to spot the desired sequences. Experiments on the Rimes database show very promising results.
A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic
In this paper, we present an Arabic handwriting recognition method based on recurrent neural network. We use the Long Short Term Memory (LSTM) architecture, that have proven successful in different printed and handwritten OCR tasks. Applications of LSTM for handwriting recognition employ the two-dimensional architecture to deal with the variations in both vertical and horizontal axis. However, we show that using a simple pre-processing step that normalizes the position and baseline of letters, we can make use of 1D LSTM, which is faster in learning and convergence, and yet achieve superior performance. In a series of experiments on IFN/ENIT database for Arabic handwriting recognition, we demonstrate that our proposed pipeline can outperform 2D LSTM networks. Furthermore, we provide comparisons with 1D LSTM networks trained with manually crafted features to show that the automatically learned features in a globally trained 1D LSTM network with our normalization step can even outperform such systems.
Aligning transcript of historical documents using dynamic programming
Irina Rabaev, Rafi Cohen, Jihad El-Sana, et al.
We present a simple and accurate approach for aligning historical documents with their corresponding transcription. First, a representative of each letter in the historical document is cropped. Then, the transcription is transformed to synthetic word images by representing the letters in the transcription by the cropped letters. These synthetic word images are aligned to groups of connected components in the original text, along each line, using dynamic programming. For measuring image similarities we experimented with a variety of feature extraction and matching methods. The presented alignment algorithm was tested on two historical datasets and provided excellent results.
Offline handwritten word recognition using MQDF-HMMs
Sitaram Ramachandrula, Mangesh Hambarde, Ajay Patial, et al.
We propose an improved HMM formulation for offline handwriting recognition (HWR). The main contribution of this work is using modified quadratic discriminant function (MQDF) [1] within HMM framework. In an MQDF-HMM the state observation likelihood is calculated by a weighted combination of MQDF likelihoods of individual Gaussians of GMM (Gaussian Mixture Model). The quadratic discriminant function (QDF) of a multivariate Gaussian can be rewritten by avoiding the inverse of covariance matrix by using the Eigen values and Eigen vectors of it. The MQDF is derived from QDF by substituting few of badly estimated lower-most Eigen values by an appropriate constant. The estimation errors of non-dominant Eigen vectors and Eigen values of covariance matrix for which the training data is insufficient can be controlled by this approach. MQDF has been successfully shown to improve the character recognition performance [1]. The usage of MQDF in HMM improves the computation, storage and modeling power of HMM when there is limited training data. We have got encouraging results on offline handwritten character (NIST database) and word recognition in English using MQDF HMMs.
Quality and Compression
icon_mobile_dropdown
Separation of text and background regions for high performance document image compression
Wei Fan, Jun Sun, Satoshi Naoi
We describe a document image segmentation algorithm to classify a scanned document into different regions such as text/line drawings, pictures, and smooth background. The proposed scheme is relatively independent of variations in text font style, size, intensity polarity and of string orientation. It is intended for use in an adaptive system for document image compression. The principal parts of the algorithm are the generation of the foreground and background layers and the application of hierarchical singular value decomposition (SVD) in order to smoothly fill the blank regions of both layers so that the high compression ratio can be achieved. The performance of the algorithm, both in terms of its effectiveness and computational efficiency, was evaluated using several test images and showed superior performance compared to other techniques.
Metric-based no-reference quality assessment of heterogeneous document images
Nibal Nayef, Jean-Marc Ogier
No-reference image quality assessment (NR-IQA) aims at computing an image quality score that best correlates with either human perceived image quality or an objective quality measure, without any prior knowledge of reference images. Although learning-based NR-IQA methods have achieved the best state-of-the-art results so far, those methods perform well only on the datasets on which they were trained. The datasets usually contain homogeneous documents, whereas in reality, document images come from different sources. It is unrealistic to collect training samples of images from every possible capturing device and every document type. Hence, we argue that a metric-based IQA method is more suitable for heterogeneous documents. We propose a NR-IQA method with the objective quality measure of OCR accuracy. The method combines distortion-specific quality metrics. The final quality score is calculated taking into account the proportions of, and the dependency among different distortions. Experimental results show that the method achieves competitive results with learning-based NR-IQA methods on standard datasets, and performs better on heterogeneous documents.
Graphics and Structure
icon_mobile_dropdown
Clustering header categories extracted from web tables
George Nagy, David W. Embley, Mukkai Krishnamoorthy, et al.
Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column) headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.
A diagram retrieval method with multi-label learning
Songping Fu, Xiaoqing Lu, Lu Liu, et al.
In recent years, the retrieval of plane geometry figures (PGFs) has attracted increasing attention in the fields of mathematics education and computer science. However, the high cost of matching complex PGF features leads to the low efficiency of most retrieval systems. This paper proposes an indirect classification method based on multi-label learning, which improves retrieval efficiency by reducing the scope of compare operation from the whole database to small candidate groups. Label correlations among PGFs are taken into account for the multi-label classification task. The primitive feature selection for multi-label learning and the feature description of visual geometric elements are conducted individually to match similar PGFs. The experiment results show the competitive performance of the proposed method compared with existing PGF retrieval methods in terms of both time consumption and retrieval quality.
Detection of electrical circuit elements from documents images
Paramita De, Sekhar Mandal, Amit Das, et al.
In this paper a method to detect the electrical circuit elements from the scanned images of electrical drawings is proposed. The method, based on histogram analysis and mathematical morphology, detects the circuit elements, for example, circuit components, wires, and generates a connectivity matrix which may be used to find similar, but spatially different looking circuit using graph isomorphism. The work may also be used for vectorization of the circuit drawings utilising the information on the segmented circuit elements and corresponding connectivity matrix. The novelty of the method lies in its simplicity and adaptability to work with a tolerable skewed image and the capability to segment symbols irrespective of their orientation. The proposed method is tested over a data-set containing more than one hundred scanned images of a variety of electrical drawings. Some of the results are presented in this paper to show the efficacy and robustness of the proposed method.
Handwriting II
icon_mobile_dropdown
Missing value imputation: with application to handwriting data
Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.