Proceedings Volume 8297

Document Recognition and Retrieval XIX

Christian Viard-Gaudin, Richard Zanibbi
cover
Proceedings Volume 8297

Document Recognition and Retrieval XIX

Christian Viard-Gaudin, Richard Zanibbi
View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 27 December 2011
Contents: 8 Sessions, 34 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2012
Volume Number: 8297

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 8297
  • Region Labeling
  • Handwriting Recognition
  • Graphics Recognition
  • Information Retrieval
  • Human Computer Interaction
  • Style or Writer Identification
  • Interactive Paper Session
Front Matter: Volume 8297
icon_mobile_dropdown
Front Matter: Volume 8297
This PDF file contains the front matter associated with SPIE Proceedings Volume 8297, including the Title Page, Copyright information, Table of Contents, and the Conference Committee listing.
Region Labeling
icon_mobile_dropdown
Graphical image classification combining an evolutionary algorithm and binary particle swarm optimization
Beibei Cheng, Renzhong Wang, Sameer Antani, et al.
Biomedical journal articles contain a variety of image types that can be broadly classified into two categories: regular images, and graphical images. Graphical images can be further classified into four classes: diagrams, statistical figures, flow charts, and tables. Automatic figure type identification is an important step toward improved multimodal (text + image) information retrieval and clinical decision support applications. This paper describes a feature-based learning approach to automatically identify these four graphical figure types. We apply Evolutionary Algorithm (EA), Binary Particle Swarm Optimization (BPSO) and a hybrid of EA and BPSO (EABPSO) methods to select an optimal subset of extracted image features that are then classified using a Support Vector Machine (SVM) classifier. Evaluation performed on 1038 figure images extracted from ten BioMedCentral® journals with the features selected by EABPSO yielded classification accuracy as high as 87.5%.
Combining SVM classifiers to identify investigator name zones in biomedical articles
Jongwoo Kim, Daniel X. Le, George R. Thoma
This paper describes an automated system to label zones containing Investigator Names (IN) in biomedical articles, a key item in a MEDLINE® citation. The correct identification of these zones is necessary for the subsequent extraction of IN from these zones. A hierarchical classification model is proposed using two Support Vector Machine (SVM) classifiers. The first classifier is used to identify an IN zone with highest confidence, and the other classifier identifies the remaining IN zones. Eight sets of word lists are collected to train and test the classifiers, each set containing collections of words ranging from 100 to 1,200. Experiments based on a test set of 105 journal articles show a Precision of 0.88, 0.97 Recall, 0.92 F-Measure, and 0.99 Accuracy.
Comprehensive color segmentation system for noisy digitized documents to enhance text extraction
Asma Ouji, Yann Leydier, Frank LeBourgeois
This paper presents a novel, general purpose and multi-applications color segmentation system providing optimal chromatic and achromatic layers and filtering the hue and illumination distortions, with minimal information loss. A text extraction method based on the resulting segmentation is proposed to illustrate the usefulness of the method. The system is validated through the evaluation of a well-known commercial OCR line segmentation performances on the processed images.
Ensemble methods with simple features for document zone classification
Document layout analysis is of fundamental importance for document image understanding and information retrieval. It requires the identification of blocks extracted from a document image via features extraction and block classification. In this paper, we focus on the classification of the extracted blocks into five classes: text (machine printed), handwriting, graphics, images, and noise. We propose a new set of features for efficient classifications of these blocks. We present a comparative evaluation of three ensemble based classification algorithms (boosting, bagging, and combined model trees) in addition to other known learning algorithms. Experimental results are demonstrated for a set of 36503 zones extracted from 416 document images which were randomly selected from the tobacco legacy document collection. The results obtained verify the robustness and effectiveness of the proposed set of features in comparison to the commonly used Ocropus recognition features. When used in conjunction with the Ocropus feature set, we further improve the performance of the block classification system to obtain a classification accuracy of 99.21%.
Handwriting Recognition
icon_mobile_dropdown
A robust omnifont open-vocabulary Arabic OCR system using pseudo-2D-HMM
Abdullah M. Rashwan, Mohsen A. Rashwan, Ahmed Abdel-Hameed, et al.
Recognizing old documents is highly desirable since the demand for quickly searching millions of archived documents has recently increased. Using Hidden Markov Models (HMMs) has been proven to be a good solution to tackle the main problems of recognizing typewritten Arabic characters. These attempts however achieved a remarkable success for omnifont OCR under very favorable conditions, they didn't achieve the same performance in practical conditions, i.e. noisy documents. In this paper we present an omnifont, large-vocabulary Arabic OCR system using Pseudo Two Dimensional Hidden Markov Model (P2DHMM), which is a generalization of the HMM. P2DHMM offers a more efficient way to model the Arabic characters, such model offer both minimal dependency on the font size/style (omnifont), and high level of robustness against noise. The evaluation results of this system are very promising compared to a baseline HMM system and best OCRs available in the market (Sakhr and NovoDynamics). The recognition accuracy of the P2DHMM classifier is measured against the classic HMM classifier, the average word accuracy rates for P2DHMM and HMM classifiers are 79% and 66% respectively. The overall system accuracy is measured against Sakhr and NovoDynamics OCR systems, the average word accuracy rates for P2DHMM, NovoDynamics, and Sakhr are 74%, 71%, and 61% respectively.
Variable length and context-dependent HMM letter form models for Arabic handwritten word recognition
We present in this paper an HMM-based recognizer for the recognition of unconstrained Arabic handwritten words. The recognizer is a context-dependent HMM which considers variable topology and contextual information for a better modeling of writing units. We propose an algorithm to adapt the topology of each HMM to the character to be modeled. For modeling the contextual units, a state-tying process based on decision tree clustering is introduced which significantly reduces the number of parameters. Decision trees are built according to a set of expert-based questions on how characters are written. Questions are divided into global questions yielding larger clusters and precise questions yielding smaller ones. We apply this modeling to the recognition of Arabic handwritten words. Experiments conducted on the OpenHaRT2010 database show that variable length topology and contextual information significantly improves the recognition rate.
Post processing for offline Chinese handwritten character string recognition
Offline Chinese handwritten character string recognition is one of the most important research fields in pattern recognition. Due to the free writing style, large variability in character shapes and different geometric characteristics, Chinese handwritten character string recognition is a challenging problem to deal with. However, among the current methods over-segmentation and merging method which integrates geometric information, character recognition information and contextual information, shows a promising result. It is found experimentally that a large part of errors are segmentation error and mainly occur around non-Chinese characters. In a Chinese character string, there are not only wide characters namely Chinese characters, but also narrow characters like digits and letters of the alphabet. The segmentation error is mainly caused by uniform geometric model imposed on all segmented candidate characters. To solve this problem, post processing is employed to improve recognition accuracy of narrow characters. On one hand, multi-geometric models are established for wide characters and narrow characters respectively. Under multi-geometric models narrow characters are not prone to be merged. On the other hand, top rank recognition results of candidate paths are integrated to boost final recognition of narrow characters. The post processing method is investigated on two datasets, in total 1405 handwritten address strings. The wide character recognition accuracy has been improved lightly and narrow character recognition accuracy has been increased up by 10.41% and 10.03% respectively. It indicates that the post processing method is effective to improve recognition accuracy of narrow characters.
Complexity reduction with recognition rate maintained for online handwritten Japanese text recognition
Jinfeng Gao, Bilan Zhu, Masaki Nakagawa
The paper presents complexity reduction of an on-line handwritten Japanese text recognition system by selecting an optimal off-line recognizer in combination with an on-line recognizer, geometric context evaluation and linguistic context evaluation. The result is that a surprisingly small off-line recognizer, which alone is weak, produces nearly the best recognition rate in combination with other evaluation factors in remarkably small space and time complexity. Generally speaking, lower dimensions with less principle components produce a smaller set of prototypes, which reduce memory-cost and time-cost. It degrades the recognition rate, however, so that we need to compromise them. In an evaluation function with the above-mentioned multiple factors combined, the configuration of only 50 dimensions with as little as 5 principle components for the off-line recognizer keeps almost the best accuracy 97.87% (the best accuracy 97.92%) for text recognition while it suppresses the total memory-cost from 99.4 MB down to 32 MB and the average time-cost of character recognition for text recognition from 0.1621 ms to 0.1191 ms compared with the traditional offline recognizer with 160 dimensions and 50 principle components.
Improving isolated and in-context classication of handwritten characters
Vadim Mazalov, Stephen M. Watt
Earlier work has shown how to recognize handwritten characters by representing coordinate functions or integral invariants as truncated orthogonal series. The series basis functions are orthogonal polynomials defined by a Legendre-Sobolev inner product. It has been shown that the free parameter in the inner product, the 'jet scale', has an impact on recognition both using coordinate functions and integral invariants. This paper develops methods of improving series-based recognition. For isolated classification, the first consideration is to identify optimal values for the jet scale in different settings. For coordinate functions, we find the optimum to be in a small interval with the precise value not strongly correlated to the geometric complexity of the character. For integral invariants, used in orientation-independent recognition, we find the optimal value of the jet scale for each invariant. Furthermore, we examine the optimal degree for the truncated series. For in-context classification, we develop a rotation-invariant algorithm that takes advantage of sequences of samples that are subject to similar distortion. The algorithm yields significant improvement over orientation-independent isolated recognition and can be extended to shear and, more generally, affine transformations.
Graphics Recognition
icon_mobile_dropdown
Using specific evaluation for comparing and combining competing algorithms: applying it to table column detection
Ana Costa e Silva
It is a commonly used evaluation strategy to run competing algorithms on a test dataset and state which performs better in average on the whole set. We call this generic evaluation. Although it is important, we believe this type of evaluation is incomplete. In this paper, we propose a methodology for algorithm comparison, which we call specific evaluation. This approach attempts to identify subsets of the data where one algorithm is better than the other. This allows not only knowing each algorithm's strengths and weaknesses better but also constitutes a simple way to develop a combination policy that allows enjoying the best of both. We shall be applying specific evaluation to an experiment that aims at grouping pre-obtained table cells into columns; we demonstrate how it identifies a subset of data for which the on-average least good but faster algorithm is equivalent or better, and how it then manages to create a policy for combining the two competing table column delimitation algorithms.
Identification of embedded mathematical formulas in PDF documents using SVM
Xiaoyan Lin, Liangcai Gao, Zhi Tang, et al.
With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.
Chemical structure recognition: a rule-based approach
Noureddin M. Sadawi, Alan P. Sexton, Volker Sorge
In chemical literature much information is given in the form of diagrams depicting molecules. In order to access this information diagrams have to be recognised and translated into a processable format. We present an approach that models the principal recognition steps for molecule diagrams in a strictly rule based system, providing rules to identify the main components - atoms and bonds - as well as to resolve possible ambiguities. The result of the process is a translation into a graph representation that can be used for further processing. We show the effectiveness of our approach by describing its embedding into a full recognition system and present an experimental evaluation that demonstrates how our current implementation outperforms the leading open source system currently available.
Quantify spatial relations to discover handwritten graphical symbols
Jinpeng Li, Harold Mouchère, Christian Viard-Gaudin
To model a handwritten graphical language, spatial relations describe how the strokes are positioned in the 2-dimensional space. Most of existing handwriting recognition systems make use of some predefined spatial relations. However, considering a complex graphical language, it is hard to express manually all the spatial relations. Another possibility would be to use a clustering technique to discover the spatial relations. In this paper, we discuss how to create a relational graph between strokes (nodes) labeled with graphemes in a graphical language. Then we vectorize spatial relations (edges) for clustering and quantization. As the targeted application, we extract the repetitive sub-graphs (graphical symbols) composed of graphemes and learned spatial relations. On two handwriting databases, a simple mathematical expression database and a complex flowchart database, the unsupervised spatial relations outperform the predefined spatial relations. In addition, we visualize the frequent patterns on two text-lines containing Chinese characters.
Information Retrieval
icon_mobile_dropdown
Automatic indexing of scanned documents: a layout-based approach
Daniel Esser, Daniel Schuster, Klemens Muthmann, et al.
Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender's name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of index terms. For this purpose we apply the knowledge of document templates stored in a common full text search index to find index positions that were successfully extracted in the past.
Layout-based substitution tree indexing and retrieval for mathematical expressions
Thomas Schellenberg, Bo Yuan, Richard Zanibbi
We introduce a new system for layout-based (LATEX) indexing and retrieval of mathematical expressions using substitution trees. Substitution trees can efficiently store and find expressions based on the similarity of their symbols, symbol layout, sub-expressions and size. We describe our novel implementation and some of our modifications to the substitution tree indexing and retrieval algorithms. We provide an experiment testing our system against the TF-IDF keyword-based system of Zanibbi and Yuan and demonstrate that, in many cases, the quality of search results returned by both systems is comparable (overall means, substitution tree vs. keywordbased: 100% vs. 89% for top 1; 48% vs. 51% for top 5; 22% vs. 28% for top 20). Overall, we present a promising first attempt at layout-based substitution tree indexing and retrieval for mathematical expressions and believe that this method will prove beneficial to the field of mathematical information retrieval.
Human Computer Interaction
icon_mobile_dropdown
Efficient cost-sensitive human-machine collaboration for offline signature verification
Johannes Coetzer, Jacques Swanepoel, Robert Sabourin
We propose a novel strategy for the optimal combination of human and machine decisions in a cost-sensitive environment. The proposed algorithm should be especially beneficial to financial institutions where off-line signatures, each associated with a specific transaction value, require authentication. When presented with a collection of genuine and fraudulent training signatures, produced by so-called guinea pig writers, the proficiency of a workforce of human employees and a score-generating machine can be estimated and represented in receiver operating characteristic (ROC) space. Using a set of Boolean fusion functions, the majority vote decision of the human workforce is combined with each threshold-specific machine-generated decision. The performance of the candidate ensembles is estimated and represented in ROC space, after which only the optimal ensembles and associated decision trees are retained. When presented with a questioned signature linked to an arbitrary writer, the system first uses the ROC-based cost gradient associated with the transaction value to select the ensemble that minimises the expected cost, and then uses the corresponding decision tree to authenticate the signature in question. We show that, when utilising the entire human workforce, the incorporation of a machine streamlines the authentication process and decreases the expected cost for all operating conditions.
Questioned document workflow for handwriting with automated tools
During the last few years many document recognition methods have been developed to determine whether a handwriting specimen can be attributed to a known writer. However, in practice, the work-flow of the document examiner continues to be manual-intensive. Before a systematic or computational, approach can be developed, an articulation of the steps involved in handwriting comparison is needed. We describe the work flow of handwritten questioned document examination, as described in a standards manual, and the steps where existing automation tools can be used. A well-known ransom note case is considered as an example, where one encounters testing for multiple writers of the same document, determining whether the writing is disguised, known writing is formal while questioned writing is informal, etc. The findings for the particular ransom note case using the tools are given. Also observations are made for developing a more fully automated approach to handwriting examination.
Iterative analysis of document collections enables efficient human-initiated interaction
Joseph Chazalon, Bertrand Coüasnon
Document analysis and recognition systems often fail to produce results with a sufficient quality level when processing old and damaged documents sets, and require manual corrections to improve results. This paper presents how, using the iterative analysis of document pages we recently proposed, we can implement a spontaneous interaction model, suitable for mass document processing. It enables human operators to detect and correct errors made by the automatic system, and reintegrates the corrections they made into subsequent analysis steps of the iterative analysis process. Thus, a page analyzer can reprocess erroneous parts and those which depend on them, avoiding the necessity to manually fix during post-processing all the consequences of errors made by the automatic system. After presenting the global system architecture and a prototype implementation of our proposal, we show that document model can be simply enriched to enable the spontaneous interaction model we propose. We present how to use it in a practical example to correct under-segmentation issues during the localization of numbers in documents from the 18th century. Evaluations we conducted on the example case show, on 50 pages containing 1637 numbers to localize, that the interaction model we propose can reduce human workload (29.8% less elements to provide) for a given target quality level when compared to a manual post-processing.
VeriClick: an efficient tool for table format verification
George Nagy, Mangesh Tamhankar
The essential layout attributes of a visual table can be defined by the location of four critical grid cells. Although these critical cells can often be located by automated analysis, some means of human interaction is necessary for correcting residual errors. VeriClick is a macro-enabled spreadsheet interface that provides ground-truthing, confirmation, correction, and verification functions for CSV tables. All user actions are logged. Experimental results of seven subjects on one hundred tables suggest that VeriClick can provide a ten- to twenty-fold speedup over performing the same functions with standard spreadsheet editing commands.
Asymptotic cost in document conversion
In spite of a hundredfold decrease in the cost of relevant technologies, the role of document image processing systems is gradually declining due to the transition to an on-line world. Nevertheless, in some high-volume applications, document image processing software still saves millions of dollars by accelerating workflow, and similarly large savings could be realized by more effective automation of the multitude of low-volume personal document conversions. While potential cost savings, based on estimates of costs and values, are a driving force for new developments, quantifying such savings is difficult. The most important trend is that the cost of computing resources for DIA is becoming insignificant compared to the associated labor costs. An econometric treatment of document processing complements traditional performance evaluation, which focuses on assessing the correctness of the results produced by document conversion software. Researchers should look beyond the error rate for advancing both production and personal document conversion.
Style or Writer Identification
icon_mobile_dropdown
Style comparisons in calligraphy
Calligraphic style is considered, for this research, visual attributes of images of calligraphic characters sampled randomly from a "work" created by a single artist. It is independent of page layout or textual content. An experimental design is developed to investigate to what extent the source of a single, or of a few pairs, of character images can be assigned to the either same work or to two different works. The experiments are conducted on the 13,571 segmented and labeled 600-dpi character images of the CADAL database. The classifier is not trained on the works tested, only on other works. Even when only a few samples of same-class pairs are available, the difference-vector of a few simple features extracted from each image of a pair yields over 80% classification accuracy for a same-work vs. different-work dichotomy. When many pairs of different classes are available for each pair, the accuracy, using the same features, is almost the same. These style-verification experiments are part of our larger goal of style identification and forgery detection.
An Oracle-based co-training framework for writer identification in offline handwriting
Utkarsh Porwal, Sreeranga Rajan, Venu Govindaraju
State-of-the-art techniques for writer identification have been centered primarily on enhancing the performance of the system for writer identification. Machine learning algorithms have been used extensively to improve the accuracy of such system assuming sufficient amount of data is available for training. Little attention has been paid to the prospect of harnessing the information tapped in a large amount of un-annotated data. This paper focuses on co-training based framework that can be used for iterative labeling of the unlabeled data set exploiting the independence between the multiple views (features) of the data. This paradigm relaxes the assumption of sufficiency of the data available and tries to generate labeled data from unlabeled data set along with improving the accuracy of the system. However, performance of co-training based framework is dependent on the effectiveness of the algorithm used for the selection of data points to be added in the labeled set. We propose an Oracle based approach for data selection that learns the patterns in the score distribution of classes for labeled data points and then predicts the labels (writers) of the unlabeled data point. This method for selection statistically learns the class distribution and predicts the most probable class unlike traditional selection algorithms which were based on heuristic approaches. We conducted experiments on publicly available IAM dataset and illustrate the efficacy of the proposed approach.
Handwritten document age classification based on handwriting styles
Chetan Ramaiah, Gaurav Kumar, Venu Govindaraju
Handwriting styles are constantly changing over time. We approach the novel problem of estimating the approximate age of Historical Handwritten Documents using Handwriting styles. This system will have many applications in handwritten document processing engines where specialized processing techniques can be applied based on the estimated age of the document. We propose to learn a distribution over styles across centuries using Topic Models and to apply a classifier over weights learned in order to estimate the approximate age of the documents. We present a comparison of different distance metrics such as Euclidean Distance and Hellinger Distance within this application.
Handwriting individualization using distance and rarity
Forensic individualization is the task of associating observed evidence with a specific source. The likelihood ratio (LR) is a quantitative measure that expresses the degree of uncertainty in individualization, where the numerator represents the likelihood that the evidence corresponds to the known and the denominator the likelihood that it does not correspond to the known. Since the number of parameters needed to compute the LR is exponential with the number of feature measurements, a commonly used simplification is the use of likelihoods based on distance (or similarity) given the two alternative hypotheses. This paper proposes an intermediate method which decomposes the LR as the product of two factors, one based on distance and the other on rarity. It was evaluated using a data set of handwriting samples, by determining whether two writing samples were written by the same/different writer(s). The accuracy of the distance and rarity method, as measured by error rates, is significantly better than the distance method.
Construction of language models for an handwritten mail reading system
This paper presents a system for the recognition of unconstrained handwritten mails. The main part of this system is an HMM recognizer which uses trigraphs to model contextual information. This recognition system does not require any segmentation into words or characters and directly works at line level. To take into account linguistic information and enhance performance, a language model is introduced. This language model is based on bigrams and built from training document transcriptions only. Different experiments with various vocabulary sizes and language models have been conducted. Word Error Rate and Perplexity values are compared to show the interest of specific language models, fit to handwritten mail recognition task.
Interactive Paper Session
icon_mobile_dropdown
Bleed-through removal in degraded documents
Róisín Rowley-Brooke, Anil Kokaram
This paper presents a linear-based restoration method for bleed-through degraded document images and uses a Bayesian approach for bleed-through reduction. A variation of iterated conditional modes (ICM) optimisation is used whereby samples are drawn for the clean image estimates, whilst the remaining variables are estimated via the mode of their conditional probabilities. The proposed method is tested on various samples of scanned manuscript images with different degrees of degradation, and results visually compared with a recent user-assisted restoration method.
Clustering document fragments using background color and texture information
Sukalpa Chanda, Katrin Franke, Umapada Pal
Forensic analysis of questioned documents sometimes can be extensively data intensive. A forensic expert might need to analyze a heap of document fragments and in such cases to ensure reliability he/she should focus only on relevant evidences hidden in those document fragments. Relevant document retrieval needs finding of similar document fragments. One notion of obtaining such similar documents could be by using document fragment's physical characteristics like color, texture, etc. In this article we propose an automatic scheme to retrieve similar document fragments based on visual appearance of document paper and texture. Multispectral color characteristics using biologically inspired color differentiation techniques are implemented here. This is done by projecting document color characteristics to Lab color space. Gabor filter-based texture analysis is used to identify document texture. It is desired that document fragments from same source will have similar color and texture. For clustering similar document fragments of our test dataset we use a Self Organizing Map (SOM) of dimension 5×5, where the document color and texture information are used as features. We obtained an encouraging accuracy of 97.17% from 1063 test images.
Lecture video segmentation and indexing
Di Ma, Gady Agam
Video structuring and indexing are two crucial processes for multi-media document understanding and information retrieval. This paper presents a novel approach in automatic structuring and indexing lecture videos for an educational video system. By structuring and indexing video content, we can support both topic indexing and semantic querying of multimedia documents. In this paper, our goal is to extract indices of topics and link them with their associated video and audio segments. Two main techniques used in our proposed approach are video image analysis and video text analysis. Using this approach, we obtain accuracy of over 90.0% on our test collection.
Unsupervised categorization method of graphemes on handwritten manuscripts: application to style recognition
H. Daher, D. Gaceb, V. Eglin, et al.
We present in this paper a feature selection and weighting method for medieval handwriting images that relies on codebooks of shapes of small strokes of characters (graphemes that are issued from the decomposition of manuscripts). These codebooks are important to simplify the automation of the analysis, the manuscripts transcription and the recognition of styles or writers. Our approach provides a precise features weighting by genetic algorithms and a highperformance methodology for the categorization of the shapes of graphemes by using graph coloring into codebooks which are applied in turn on CBIR (Content Based Image Retrieval) in a mixed handwriting database containing different pages from different writers, periods of the history and quality. We show how the coupling of these two mechanisms 'features weighting - graphemes classification' can offer a better separation of the forms to be categorized by exploiting their grapho-morphological, their density and their significant orientations particularities.
Retrieving handwriting by combining word spotting and manifold ranking
Online handwritten data, produced with Tablet PCs or digital pens, consists in a sequence of points (x, y). As the amount of data available in this form increases, algorithms for retrieval of online data are needed. Word spotting is a common approach used for the retrieval of handwriting. However, from an information retrieval (IR) perspective, word spotting is a primitive keyword based matching and retrieval strategy. We propose a framework for handwriting retrieval where an arbitrary word spotting method is used, and then a manifold ranking algorithm is applied on the initial retrieval scores. Experimental results on a database of more than 2,000 handwritten newswires show that our method can improve the performances of a state-of-the-art word spotting system by more than 10%.
The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition
Farès Menasri, Jérôme Louradour, Anne-Laure Bianne-Bernard, et al.
This paper describes the system for the recognition of French handwriting submitted by A2iA to the competition organized at ICDAR2011 using the Rimes database. This system is composed of several recognizers based on three different recognition technologies, combined using a novel combination method. A framework multi-word recognition based on weighted finite state transducers is presented, using an explicit word segmentation, a combination of isolated word recognizers and a language model. The system was tested both for isolated word recognition and for multi-word line recognition and submitted to the RIMES-ICDAR2011 competition. This system outperformed all previously proposed systems on these tasks.
Using connected component decomposition to detect straight line segments in documents
Xiaofan Feng, Abdou Youssef
Straight line segment detection in digital documents has been studied extensively for the past few decades. One of the challenges is to detect line segments without priori information about document images and render good results without much parameter calibration. In this paper, we introduce a novel algorithm that is simple but effective in detecting straight line segments in scanned documents. Our Connected Component Decomposition (CCD) approach first decomposes the connected components based on the gradient direction of the edge contours, and then uses Chebyshev's inequality to statistically distinguish lines from characters, followed by a simple post processing step to examine straightness of remain segments. This CCD approach is simple to follow and fast in its implementation, and its high accuracy and usability are demonstrated empirically on a practical data set with large varieties.
A synthetic document image dataset for developing and evaluating historical document processing methods
Daniel Walker, William Lund, Eric Ringger
Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiqu´es. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.