Proceedings Volume 8658

Document Recognition and Retrieval XX

Richard Zanibbi, Bertrand Coüasnon
cover
Proceedings Volume 8658

Document Recognition and Retrieval XX

Richard Zanibbi, Bertrand Coüasnon
View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 21 January 2013
Contents: 12 Sessions, 43 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2013
Volume Number: 8658

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 8658
  • Keynote Session I
  • Image-Based Retrieval
  • Handwriting
  • Layout Analysis
  • Word and Symbol Spotting
  • Historical Documents
  • Arabic and Chinese Character Recognition
  • Interactive Paper Session
  • Math Recognition
  • Information Retrieval
  • Evaluation
Front Matter: Volume 8658
icon_mobile_dropdown
Front Matter: Volume 8658
This PDF file contains the front matter associated with SPIE Proceedings Volume 8658, including the Title Page, Copyright Information, Table of Contents, and the Conference Committee listing.
Keynote Session I
icon_mobile_dropdown
History of the Tesseract OCR engine: what worked and what didn't
This paper describes the development history of the Tesseract OCR engine, and compares the methods to general changes in the field over a similar time period. Emphasis is placed on the lessons learned with the goal of providing a primer for those interested in OCR research.
Image-Based Retrieval
icon_mobile_dropdown
Semi-structured document image matching and recognition
Olivier Augereau, Nicholas Journet, Jean-Philippe Domenger
This article presents a method to recognize and to localize semi-structured documents such as ID cards, tickets, invoices, etc. Standard object recognition methods based on interest points work well on natural images but fail on document images because of repetitive patterns like text. In this article, we propose an adaptation of object recognition for image documents. The advantages of our method is that it does not use character recognition or segmentation and it is robust to rotation, scale, illumination, blur, noise and local distortions. Furthermore, tests show that an average precision of 97.2% and recall of 94.6% is obtained for matching 7 different kinds of documents in a database of 2155 documents.
Rotation-robust math symbol recognition and retrieval using outer contours and image subsampling
Siyu Zhu, Lei Hu, Richard Zanibbi
This paper presents an unified recognition and retrieval system for isolated offline printed mathematical symbols for the first time. The system is based on nearest neighbor scheme and uses modified Turning Function and Grid Features to calculate the distance between two symbols based on Sum of Squared Difference. An unwrap process and an alignment process are applied to modify Turning Function to deal with the horizontal and vertical shift caused by the changing of staring point and rotation. This modified Turning Function make our system robust against rotation of the symbol image. The system obtains top-1 recognition rate of 96.90% and 47.27% Area Under Curve (AUC) of precision/recall plot on the InftyCDB-3 dataset. Experiment result shows that the system with modified Turning Function performs significantly better than the system with original Turning Function on the rotated InftyCDB-3 dataset.
NESP: Nonlinear enhancement and selection of plane for optimal segmentation and recognition of scene word images
Deepak Kumar, M. N. Anil Prasad, A. G. Ramakrishnan
In this paper, we report a breakthrough result on the difficult task of segmentation and recognition of coloured text from the word image dataset of ICDAR robust reading competition challenge 2: reading text in scene images. We split the word image into individual colour, gray and lightness planes and enhance the contrast of each of these planes independently by a power-law transform. The discrimination factor of each plane is computed as the maximum between-class variance used in Otsu thresholding. The plane that has maximum discrimination factor is selected for segmentation. The trial version of Omnipage OCR is then used on the binarized words for recognition. Our recognition results on ICDAR 2011 and ICDAR 2003 word datasets are compared with those reported in the literature. As baseline, the images binarized by simple global and local thresholding techniques were also recognized. The word recognition rate obtained by our non-linear enhancement and selection of plance method is 72.8% and 66.2% for ICDAR 2011 and 2003 word datasets, respectively. We have created ground-truth for each image at the pixel level to benchmark these datasets using a toolkit developed by us. The recognition rate of benchmarked images is 86.7% and 83.9% for ICDAR 2011 and 2003 datasets, respectively.
Handwriting
icon_mobile_dropdown
Combining evidence using likelihood ratios in writer verification
Sargur Srihari, Dimitry Kovalenko, Yi Tang, et al.
Forensic identification is the task of determining whether or not observed evidence arose from a known source. It involves determining a likelihood ratio (LR) – the ratio of the joint probability of the evidence and source under the identification hypothesis (that the evidence came from the source) and under the exclusion hypothesis (that the evidence did not arise from the source). In LR- based decision methods, particularly handwriting comparison, a variable number of input evidences is used. A decision based on many pieces of evidence can result in nearly the same LR as one based on few pieces of evidence. We consider methods for distinguishing between such situations. One of these is to provide confidence intervals together with the decisions and another is to combine the inputs using weights. We propose a new method that generalizes the Bayesian approach and uses an explicitly defined discount function. Empirical evaluation with several data sets including synthetically generated ones and handwriting comparison shows greater flexibility of the proposed method.
Handwritten word preprocessing for database adaptation
Handwriting recognition systems are typically trained using publicly available databases, where data have been collected in controlled conditions (image resolution, paper background, noise level,...). Since this is not often the case in real-world scenarios, classification performance can be affected when novel data is presented to the word recognition system. To overcome this problem, we present in this paper a new approach called database adaptation. It consists of processing one set (training or test) in order to adapt it to the other set (test or training, respectively). Specifically, two kinds of preprocessing, namely stroke thickness normalization and pixel intensity normalization are considered. The advantage of such approach is that we can re-use the existing recognition system trained on controlled data. We conduct several experiments with the Rimes 2011 word database and with a real-world database. We adapt either the test set or the training set. Results show that training set adaptation achieves better results than test set adaptation, at the cost of a second training stage on the adapted data. Accuracy of data set adaptation is increased by 2% to 3% in absolute value over no adaptation.
Optimal policy for labeling training samples
Lester Lipsky, Daniel Lopresti, George Nagy
Confirming the labels of automatically classified patterns is generally faster than entering new labels or correcting incorrect labels. Most labels assigned by a classifier, even if trained only on relatively few pre-labeled patterns, are correct. Therefore the overall cost of human labeling can be decreased by interspersing labeling and classification. Given a parameterized model of the error rate as an inverse power law function of the size of the training set, the optimal splits can be computed rapidly. Projected savings in operator time are over 60% for a range of empirical error functions for hand-printed digit classification with ten different classifiers.
Evaluation of lexicon size variations on a verification and rejection system based on SVM, for accurate and robust recognition of handwritten words
Yann Ricquebourg, Bertrand Coüasnon, Laurent Guichard
The transcription of handwritten words remains a still challenging and difficult task. When processing full pages, approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of human user verification. In this article, we present our investigations to improve the capabilities of an automatic recognizer, so as to be able to reject unknown words (not to take wrong decisions) while correctly rejecting (i.e. to recognize as much as possible from the lexicon of known words). This is the active research topic of developing a verification system that optimize the trade-off between performance and reliability. To minimize the recognition errors, a verification system is usually used to accept or reject the hypotheses produced by an existing recognition system. Thus, we re-use our novel verification architecture1 here: the recognition hypotheses are re-scored by a set of support vector machines, and validated by a verification mechanism based on multiple rejection thresholds. In order to tune these (class-dependent) rejection thresholds, an algorithm based on dynamic programming has been proposed which focus on maximizing the recognition rate for a given error rate. Experiments have been carried out on the RIMES database in three steps. The first two showed that this approach results in a performance superior or equal to other state-of-the-art rejection methods. We focus here on the third one showing that this verification system also greatly improves results of keywords extraction in a set of handwritten words, with a strong robustness to lexicon size variations (21 lexicons have been tested from 167 entries up to 5,600 entries) which is particularly relevant to our application context cooperating with humans, and only made possible thanks to the rejection ability of this proposed system. The proposed verification system, compared to a HMM with simple rejection, improves on average the recognition rate by 57% (resp. 33% and 21%) for a given error rate of 1% (resp. 5% and 10%).
Layout Analysis
icon_mobile_dropdown
Comic image understanding based on polygon detection
Luyuan Li, Yongtao Wang, Zhi Tang, et al.
Comic image understanding aims to automatically decompose scanned comic page images into storyboards and then identify the reading order of them, which is the key technique to produce digital comic documents that are suitable for reading on mobile devices. In this paper, we propose a novel comic image understanding method based on polygon detection. First, we segment a comic page images into storyboards by finding the polygonal enclosing box of each storyboard. Then, each storyboard can be represented by a polygon, and the reading order of them is determined by analyzing the relative geometric relationship between each pair of polygons. The proposed method is tested on 2000 comic images from ten printed comic series, and the experimental results demonstrate that it works well on different types of comic images.
Context modeling for text/non-text separation in free-form online handwritten documents
Adrien Delaye, Cheng-Lin Liu
Free-form online handwritten documents contain a high diversity of content, organized without constraints imposed to the user. The lack of prior knowledge about content and layout makes the modeling of contextual information of crucial importance for interpretation of such documents. In this work, we present a comprehensive investigation of the sources of contextual information that can benefit the task of discerning textual from non-textual strokes in handwritten online documents. An in-depth analysis of interactions between strokes is conducted through the design of various pairwise clique systems that are combined within a Conditional Random Field formulation of the stroke labeling problem. Our results demonstrate the benefits of combining complementary sources of context for improving the text/non-text recognition performance.
Annotating image ROIs with text descriptions for multimodal biomedical document retrieval
Daekeun You, Matthew Simpson, Sameer Antani, et al.
Regions of interest (ROIs) that are pointed to by overlaid markers (arrows, asterisks, etc.) in biomedical images are expected to contain more important and relevant information than other regions for biomedical article indexing and retrieval. We have developed several algorithms that localize and extract the ROIs by recognizing markers on images. Cropped ROIs then need to be annotated with contents describing them best. In most cases accurate textual descriptions of the ROIs can be found from figure captions, and these need to be combined with image ROIs for annotation. The annotated ROIs can then be used to, for example, train classifiers that separate ROIs into known categories (medical concepts), or to build visual ontologies, for indexing and retrieval of biomedical articles. We propose an algorithm that pairs visual and textual ROIs that are extracted from images and figure captions, respectively. This algorithm based on dynamic time warping (DTW) clusters recognized pointers into groups, each of which contains pointers with identical visual properties (shape, size, color, etc.). Then a rule-based matching algorithm finds the best matching group for each textual ROI mention. Our method yields a precision and recall of 96% and 79%, respectively, when ground truth textual ROI data is used.
Graphic composite segmentation for PDF documents with complex layouts
Canhui Xu, Zhi Tang, Xin Tao, et al.
Converting the PDF books to re-flowable format has recently attracted various interests in the area of e-book reading. Robust graphic segmentation is highly desired for increasing the practicability of PDF converters. To cope with various layouts, a multi-layer concept is introduced to segment graphic composites including photographic images, drawings with text insets or surrounded with text elements. Both image based analysis and inherent digital born document advantages are exploited in this multi-layer based layout analysis method. By combining low-level page elements clustering applied on PDF documents and connected component analysis on synthetically generated PNG image document, graphic composites can be segmented for PDF documents with complex layouts. The experimental results on graphic composite segmentation of PDF document pages have shown satisfactory performance.
Word and Symbol Spotting
icon_mobile_dropdown
A classification-free word-spotting system
Nikos Vasilopoulos, Ergina Kavallieratou
In this paper, a classification-free Word-Spotting system, appropriate for the retrieval of printed historical document images is proposed. The system skips many of the procedures of a common approach. It does not include segmentation, feature extraction or classification. Instead it treats the queries as compact shapes and uses image processing techniques in order to localize a query in the document images. Our system was tested on a historical document collection with many problems and a Google book, printed in 1675. Moreover, some comparative results are given for a traditional word spotting system.
Combining geometric matching with SVM to improve symbol spotting
Symbol spotting is important for automatic interpretation of technical line drawings. Current spotting methods are not reliable enough for such tasks due to low precision rates. In this paper, we combine a geometric matching-based spotting method with an SVM classifier to improve the precision of the spotting. In symbol spotting, a query symbol is to be located within a line drawing. Candidate matches can be found, however, the found matches may be true or false. To distinguish a false match, an SVM classifier is used. The classifier is trained on true and false matches of a query symbol. The matches are represented as vectors that indicate the qualities of how well the query features are matched, those qualities are obtained via geometric matching. Using the classification, the precision of the spotting improved from an average of 76.6% to an average of 97.2% on a database of technical line drawings.
Segmentation-free keyword spotting framework using dynamic background model
Gaurav Kumar, Safwan Wshah, Venu Govindaraju, et al.
We propose a segmentation free word spotting framework using Dynamic Background Model. The proposed approach is an extension to our previous work where dynamic background model was introduced and integrated with a segmentation based recognizer for keyword spotting. The dynamic background model uses the local character matching scores and global word level hypotheses scores to separate keywords from non-keywords. We integrate and evaluate this model on Hidden Markov Model (HMM) based segmentation free recognizer which works at line level without any need for word segmentation. We outperform the state of the art line level word spotting system on IAM dataset.
Historical Documents
icon_mobile_dropdown
Data acquisition from cemetery headstones
Cameron S. Christiansen, William A. Barrett
Data extraction from engraved text is discussed rarely, and nothing in the open literature discusses data extraction from cemetery headstones. Headstone images present unique challenges such as engraved or embossed characters (causing inner-character shadows), low contrast with the background, and significant noise due to inconsistent stone texture and weathering. Current systems for extracting text from outdoor environments (billboards, signs, etc.) make assumptions (i.e. clean and/or consistently-textured background and text) that fail when applied to the domain of engraved text. The ability to extract the data found on headstones is of great historical value. This paper describes a novel and efficient feature-based text zoning and segmentation method for the extraction of noisy text from a highly textured engraved medium. This paper also demonstrates the usefulness of constraining a problem to a specific domain. The transcriptions of images zoned and segmented through the proposed system have a precision of 55% compared to 1% precision without zoning, a 62% recall compared to 39%, and an error rate of 78% compared to 8303%.
Automated recognition and extraction of tabular fields for the indexing of census records
Robert Clawson, Kevin Bauer, Glen Chidester, et al.
We describe a system for indexing of census records in tabular documents with the goal of recognizing the content of each cell, including both headers and handwritten entries. Each document is automatically rectified, registered and scaled to a known template following which lines and fields are detected and delimited as cells in a tabular form. Whole-word or whole-phrase recognition of noisy machine-printed text is performed using a glyph library, providing greatly increased efficiency and accuracy (approaching 100%), while avoiding the problems inherent with traditional OCR approaches. Constrained handwriting recognition results for a single author reach as high as 98% and 94.5% for the Gender field and Birthplace respectively. Multi-author accuracy (currently 82%) can be improved through an increased training set. Active integration of user feedback in the system will accelerate the indexing of records while providing a tightly coupled learning mechanism for system improvement.
Old document image segmentation using the autocorrelation function and multiresolution analysis
Maroua Mehri, Petra Gomez-Krämer, Pierre Héroux, et al.
Recent progress in the digitization of heterogeneous collections of ancient documents has rekindled new challenges in information retrieval in digital libraries and document layout analysis. Therefore, in order to control the quality of historical document image digitization and to meet the need of a characterization of their content using intermediate level metadata (between image and document structure), we propose a fast automatic layout segmentation of old document images based on five descriptors. Those descriptors, based on the autocorrelation function, are obtained by multiresolution analysis and used afterwards in a specific clustering method. The method proposed in this article has the advantage that it is performed without any hypothesis on the document structure, either about the document model (physical structure), or the typographical parameters (logical structure). It is also parameter-free since it automatically adapts to the image content. In this paper, firstly, we detail our proposal to characterize the content of old documents by extracting the autocorrelation features in the different areas of a page and at several resolutions. Then, we show that is possible to automatically find the homogeneous regions defined by similar indices of autocorrelation without knowledge about the number of clusters using adapted hierarchical ascendant classification and consensus clustering approaches. To assess our method, we apply our algorithm on 316 old document images, which encompass six centuries (1200-1900) of French history, in order to demonstrate the performance of our proposal in terms of segmentation and characterization of heterogeneous corpus content. Moreover, we define a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of our methodology. We find a 85% of mean homogeneity accuracy. Those results help to represent a document by a hierarchy of layout structure and content, and to define one or more signatures for each page, on the basis of a hierarchical representation of homogeneous blocks and their topology.
Lexicon-supported OCR of eighteenth century Dutch books: a case study
Jesse de Does, Katrien Depuydt
We report on a case study on OCR of eighteenth century books conducted in the IMPACT project. After introducing the IMPACT project and its approach to lexicon building and deployment, we zoom in to the application of IMPACT tools and data to the Dutch EDBO collection. The results are exemplified by detailed discussion of various practical options to improve text recognition beyond a baseline of running an uncustomized Finereader 10. In particular, we discuss improved recognition of long s.
Arabic and Chinese Character Recognition
icon_mobile_dropdown
Character feature integration of Chinese calligraphy and font
Cao Shi, Jianguo Xiao, Wenhua Jia, et al.
A framework is proposed in this paper to effectively generate a new hybrid character type by means of integrating local contour feature of Chinese calligraphy with structural feature of font in computer system. To explore traditional art manifestation of calligraphy, multi-directional spatial filter is applied for local contour feature extraction. Then the contour of character image is divided into sub-images. The sub-images in the identical position from various characters are estimated by Gaussian distribution. According to its probability distribution, the dilation operator and erosion operator are designed to adjust the boundary of font image. And then new Chinese character images are generated which possess both contour feature of artistical calligraphy and elaborate structural feature of font. Experimental results demonstrate the new characters are visually acceptable, and the proposed framework is an effective and efficient strategy to automatically generate the new hybrid character of calligraphy and font.
A segmentation-free approach to Arabic and Urdu OCR
In this paper, we present a generic Optical Character Recognition system for Arabic script languages called Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and context sensitive. Moreover, Arabic script has different writing styles that vary in complexity. Nabocr is initially trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts. However, it can be trained by users to be used for other Arabic script languages. We have evaluated our system's performance for both Urdu and Arabic. In order to evaluate Urdu recognition, we have generated a dataset of Urdu text called UPTI (Urdu Printed Text Image Database), which measures different aspects of a recognition system. The performance of our system for Urdu clean text is 91%. For Arabic clean text, the performance is 86%. Moreover, we have compared the performance of our system against Tesseract's newly released Arabic recognition, and the performance of both systems on clean images is almost the same.
Local projection-based character segmentation method for historical Chinese documents
Linjie Yang, Liangrui Peng
Digitization of historical Chinese documents includes two key technologies, character segmentation and character recognition. This paper focuses on developing character segmentation algorithm. As a preprocessing step, we combine several effective measures to remove noises in a historical Chinese document image. After binarization, a new character segmentation algorithm segment single characters based on projections of a cost image in local windows. The cost image is constructed by utilizing the information of stroke bounding boxes and a skeleton image extracted from the binarized image. We evaluate the proposed algorithm based on matching degrees of character bounding boxes between segmentation results and ground-truth data, and achieve a recall rate of 74.3% on a test set, which shows the effectiveness of the proposed algorithm.
Interactive Paper Session
icon_mobile_dropdown
A super resolution framework for low resolution document image OCR
Di Ma, Gady Agam
Optical character recognition is widely used for converting document images into digital media. Existing OCR algorithms and tools produce good results from high resolution, good quality, document images. In this paper, we propose a machine learning based super resolution framework for low resolution document image OCR. Two main techniques are used in our proposed approach: a document page segmentation algorithm and a modified K-means clustering algorithm. Using this approach, by exploiting coherence in the document, we reconstruct from a low resolution document image a better resolution image and improve OCR results. Experimental results show substantial gain in low resolution documents such as the ones captured from video.
A robust pointer segmentation in biomedical images toward building a visual ontology for biomedical article retrieval
Daekeun You, Matthew Simpson, Sameer Antani, et al.
Pointers (arrows and symbols) are frequently used in biomedical images to highlight specific image regions of interest (ROIs) that are mentioned in figure captions and/or text discussion. Detection of pointers is the first step toward extracting relevant visual features from ROIs and combining them with textual descriptions for a multimodal (text and image) biomedical article retrieval system. Recently we developed a pointer recognition algorithm based on an edge-based pointer segmentation method, and subsequently reported improvements made on our initial approach involving the use of Active Shape Models (ASM) for pointer recognition and region growing-based method for pointer segmentation. These methods contributed to improving the recall of pointer recognition but not much to the precision. The method discussed in this article is our recent effort to improve the precision rate. Evaluation performed on two datasets and compared with other pointer segmentation methods show significantly improved precision and the highest F1 score.
Combining multiple thresholding binarization values to improve OCR output
For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a body of research seeks to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple simple thresholding binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold images and a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving an improvement over the original OCR output and a 8.41% WER result on this data set.
Goal-oriented evaluation of binarization algorithms for historical document images
Binarization is of significant importance in document analysis systems. It is an essential first step, prior to further stages such as Optical Character Recognition (OCR), document segmentation, or enhancement of readability of the document after some restoration stages. Hence, proper evaluation of binarization methods to verify their effectiveness is of great value to the document analysis community. In this work, we perform a detailed goal-oriented evaluation of image quality assessment of the 18 binarization methods that participated in the DIBCO 2011 competition using the 16 historical document test images used in the contest. We are interested in the image quality assessment of the outputs generated by the different binarization algorithms as well as the OCR performance, where possible. We compare our evaluation of the algorithms based on human perception of quality to the DIBCO evaluation metrics. The results obtained provide an insight into the effectiveness of these methods with respect to human perception of image quality as well as OCR performance.
Document segmentation via oblique cuts
Jeremy Svendsen, Alexandra Branzan-Albu
This paper presents a novel solution for the layout segmentation of graphical elements in Business Intelligence documents. We propose a generalization of the recursive X-Y cut algorithm, which allows for cutting along arbitrary oblique directions. An intermediate processing step consisting of line and solid region removal is also necessary due to presence of decorative elements. The output of the proposed segmentation is a hierarchical structure which allows for the identification of primitives in pie and bar charts. The algorithm was tested on a database composed of charts from business documents. Results are very promising.
Preprocessing document images by resampling is error prone and unnecessary
Integrity tests are proposed for image processing algorithms that should yield essentially the same output under 90 degree rotations, edge-padding and monotonic gray-scale transformations of scanned documents. The tests are demonstrated on built-in functions of the Matlab Image Processing Toolbox. Only the routine that reports the area of the convex hull of foreground components fails the rotation test. Ensuring error-free preprocessing operations like size and skew normalization that are based on resampling an image requires more radical treatment. Even if faultlessly implemented, resampling is generally irreversible and may introduce artifacts. Fortunately, advances in storage and processor technology have all but eliminated any advantage of preprocessing or compressing document images by resampling them. Using floating point coordinate transformations instead of resampling images yields accurate run-length, moment, slope, and other geometric features.
Multilingual artificial text detection and extraction from still images
Ahsen Raza, Ali Abidi, Imran Siddiqi
In this paper we present a novel method for multilingual artificial text extraction from still images. We propose a lexicon independent, block based technique that employs a combination of spatial transforms, texture, edge and, gradient based operations to detect unconstrained textual regions from still images. Finally, some morphological and geometrical constraints are applied for fine localization of textual content. The proposed method was evaluated on two standard and three custom developed datasets comprising a wide variety of images with artificial text occurrences in five different languages namely English, Urdu, Arabic, Chinese and Hindi.
A proposal system for historic Arabic manuscript transcription and retrieval
In this paper, we propose a computer-assisted transcription system of old registers, handwritten in Arabic from the 19th century onwards, held in the National Archives of Tunisia (NAT). The proposed system assists the human supervisor to complete the transcription task as efficiently as possible. This assistance is given at all different recognition levels. Our system addresses different approaches for transcription of document images. It also implements an alignment method to find mappings between word images of a handwritten document and their respective words in its given transcription.
Evaluation of document binarization using eigen value decomposition
Deepak Kumar, M. N. Anil Prasad, A. G. Ramakrishnan
A necessary step for the recognition of scanned documents is binarization, which is essentially the segmentation of the document. In order to binarize a scanned document, we can find several algorithms in the literature. What is the best binarization result for a given document image? To answer this question, a user needs to check different binarization algorithms for suitability, since different algorithms may work better for different type of documents. Manually choosing the best from a set of binarized documents is time consuming. To automate the selection of the best segmented document, either we need to use ground-truth of the document or propose an evaluation metric. If ground-truth is available, then precision and recall can be used to choose the best binarized document. What is the case, when ground-truth is not available? Can we come up with a metric which evaluates these binarized documents? Hence, we propose a metric to evaluate binarized document images using eigen value decomposition. We have evaluated this measure on DIBCO and H-DIBCO datasets. The proposed method chooses the best binarized document that is close to the ground-truth of the document.
Efficient symbol retrieval by building a symbol index from a collection of line drawings
Symbol retrieval is important for content-based search in digital libraries and for automatic interpretation of line drawings. In this work, we present a complete symbol retrieval system. The proposed system has an off-line content-analysis stage, where the contents of a database of line drawings are represented as a symbol index, which is a compact indexable representation of the database. Such representation allows efficient on-line query retrieval. Within the retrieval system, three methods are presented. First, a feature grouping method for identifying local regions of interest (ROIs) in the drawings. The found ROIs represent symbols' parts. Second, a clustering method based on geometric matching, is used to cluster the similar parts from all the drawings together. A symbol index is then constructed from the clusters' representatives. Finally, the ROIs of a query symbol are matched to the clusters' representatives. The matching symbols' parts are retrieved from the clusters, and spatial verification is performed on the matching parts. By using the symbol index we are able to achieve a query look-up time that is independent of the database size, and dependent on the size of the symbol index. The retrieval system achieves higher recall and precision than state-of-the-art methods.
Math Recognition
icon_mobile_dropdown
Structural analysis of online handwritten mathematical symbols based on support vector machines
Foteini Simistira, Vassilis Papavassiliou, Vassilis Katsouros, et al.
Mathematical expression recognition is still a very challenging task for the research community mainly because of the two-dimensional (2d) structure of mathematical expressions (MEs). In this paper, we present a novel approach for the structural analysis between two on-line handwritten mathematical symbols of a ME, based on spatial features of the symbols. We introduce six features to represent the spatial affinity of the symbols and compare two multi-class classification methods that employ support vector machines (SVMs): one based on the “one-against-one” technique and one based on the “one-against-all”, in identifying the relation between a pair of symbols (i.e. subscript, numerator, etc). A dataset containing 1906 spatial relations derived from the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) 2012 training dataset is constructed to evaluate the classifiers and compare them with the rule-based classifier of the ILSP-1 system participated in the contest. The experimental results give an overall mean error rate of 2.61% for the “one-against-one” SVM approach, 6.57% for the “one-against-all” SVM technique and 12.31% error rate for the ILSP-1 classifier.
Using online handwriting and audio streams for mathematical expressions recognition: a bimodal approach
Sofiane Medjkoune, Harold Mouchère, Simon Petitrenaud, et al.
The work reported in this paper concerns the problem of mathematical expressions recognition. This task is known to be a very hard one. We propose to alleviate the difficulties by taking into account two complementary modalities. The modalities referred to are handwriting and audio ones. To combine the signals coming from both modalities, various fusion methods are explored. Performances evaluated on the HAMEX dataset show a significant improvement compared to a single modality (handwriting) based system.
Information Retrieval
icon_mobile_dropdown
Using clustering and a modified classification algorithm for automatic text summarization
Abdelkrime Aries, Houda Oufaida, Omar Nouali
In this paper we describe a modified classification method destined for extractive summarization purpose. The classification in this method doesn’t need a learning corpus; it uses the input text to do that. First, we cluster the document sentences to exploit the diversity of topics, then we use a learning algorithm (here we used Naive Bayes) on each cluster considering it as a class. After obtaining the classification model, we calculate the score of a sentence in each class, using a scoring model derived from classification algorithm. These scores are used, then, to reorder the sentences and extract the first ones as the output summary. We conducted some experiments using a corpus of scientific papers, and we have compared our results to another summarization system called UNIS.1 Also, we experiment the impact of clustering threshold tuning, on the resulted summary, as well as the impact of adding more features to the classifier. We found that this method is interesting, and gives good performance, and the addition of new features (which is simple using this method) can improve summary’s accuracy.
Evaluating supervised topic models in the presence of OCR errors
Daniel Walker, Eric Ringger, Kevin Seppi
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three supervised topic models and a naive baseline on synthetic OCR data having various levels of degradation and on real OCR data from two different decades. The evaluation includes experiments with and without feature selection. Our results suggest that supervised topic models are no better, or at least not much better in terms of their robustness to OCR errors, than unsupervised topic models and that feature selection has the mixed result of improving topic quality while harming metadata prediction quality. For users of topic modeling methods on OCR data, supervised topic models do not yet solve the problem of finding better topics than the original unsupervised topic models.
Rule-based versus training-based extraction of index terms from business documents: how to combine the results
Daniel Schuster, Marcel Hanke, Klemens Muthmann, et al.
Current systems for automatic extraction of index terms from business documents either take a rule-based or training-based approach. As both approaches have their advantages and disadvantages it seems natural to combine both methods to get the best of both worlds. We present a combination method with the steps selection, normalization, and combination based on comparable scores produced during extraction. Furthermore, novel evaluation metrics are developed to support the assessment of each step in an existing extraction system. Our methods were evaluated on an example extraction system with three individual extractors and a corpus of 12,000 scanned business documents.
Post processing with first- and second-order hidden Markov models
Kazem Taghva, Srijana Poudel, Spandana Malreddy
In this paper, we present the implementation and evaluation of first order and second order Hidden Markov Models to identify and correct OCR errors in the post processing of books. Our experiments show that the first order model approximately corrects 10% of the errors with 100% precision, while the second order model corrects a higher percentage of errors with much lower precision.
Combining discriminative SVM models for the improved recognition of investigator names in medical articles
Xiaoli Zhang, Jie Zou, Daniel X. Le, et al.
Investigators are people who are listed as members of corporate organizations but not entered as authors in an article. Beginning with journals published in 2008, investigator names are required to be included in a new bibliographic field in MEDLINE citations. Automatic extraction of investigator names is necessary due to the increase in collaborative biomedical research and consequently the large number of such names. We implemented two discriminative SVM models, i.e., SVM and structural SVM, to identify named entities such as the first and last names of investigators from online medical journal articles. Both approaches achieve good performance at the word and name chunk levels. We further conducted an error analysis and found that SVM and structural SVM can offer complementary information about the patterns to be classified. Hence, we combined the two independently trained classifiers where the SVM is chosen as a base learner with its outputs enhanced by the predictions from the structural SVM. The overall performance especially the recall rate of investigator name retrieval exceeds that of the standalone SVM model.
Evaluation
icon_mobile_dropdown
Adaptive detection of missed text areas in OCR outputs: application to the automatic assessment of OCR quality in mass digitization projects
Ahmed Ben Salah, Nicolas Ragot, Thierry Paquet
The French National Library (BnF*) has launched many mass digitization projects in order to give access to its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). OCR softwares have become increasingly complex systems composed of several subsystems dedicated to the analysis and the recognition of the elements in a page. However, the reliability of these systems is always an issue at stake. Indeed, in some cases, we can find errors in OCR outputs that occur because of an accumulation of several errors at different levels in the OCR process. One of the frequent errors in OCR outputs is the missed text components. The presence of such errors may lead to severe defects in digital libraries. In this paper, we investigate the detection of missed text components to control the OCR results from the collections of the French National Library. Our verification approach uses local information inside the pages based on Radon transform descriptors and Local Binary Patterns descriptors (LBP) coupled with OCR results to control their consistency. The experimental results show that our method detects 84.15% of the missed textual components, by comparing the OCR ALTO files outputs (produced by the service providers) to the images of the document.
Evaluating structural pattern recognition for handwritten math via primitive label graphs
Richard Zanibbi, Harold Mouchère, Christian Viard-Gaudin
Currently, structural pattern recognizer evaluations compare graphs of detected structure to target structures (i.e. ground truth) using recognition rates, recall and precision for object segmentation, classification and relationships. In document recognition, these target objects (e.g. symbols) are frequently comprised of multiple primitives (e.g. connected components, or strokes for online handwritten data), but current metrics do not characterize errors at the primitive level, from which object-level structure is obtained. Primitive label graphs are directed graphs defined over primitives and primitive pairs. We define new metrics obtained by Hamming distances over label graphs, which allow classification, segmentation and parsing errors to be characterized separately, or using a single measure. Recall and precision for detected objects may also be computed directly from label graphs. We illustrate the new metrics by comparing a new primitive-level evaluation to the symbol-level evaluation performed for the CROHME 2012 handwritten math recognition competition. A Python-based set of utilities for evaluating, visualizing and translating label graphs is publicly available.
WFST-based ground truth alignment for difficult historical documents with text modification and layout variations
Mayce Al Azawi, Marcus Liwicki, Thomas M. Breuel
This work proposes several approaches that can be used for generating correspondences between real scanned books and their transcriptions which might have different modifications and layout variations, also taking OCR errors into account. Our approaches for the alignment between the manuscript and the transcription are based on weighted finite state transducers (WFST). In particular, we propose adapted WFSTs to represent the transcription to be aligned with the OCR lattices. The character-level alignment has edit rules to allow edit operations (insertion, deletion, substitution). Those edit operations allow the transcription model to deal with OCR segmentation and recognition errors, and also with the task of aligning with different text editions. We implemented an alignment model with a hyphenation model, so it can adapt the non-hyphenated transcription. Our models also work with Fraktur ligatures, which are typically found in historical Fraktur documents. We evaluated our approach on Fraktur documents from Wanderungen durch die Mark Brandenburg" volumes (1862-1889) and observed the performance of those models under OCR errors. We compare the performance of our model for three different scenarios: having no information about the correspondence at the word (i), line (ii), sentence (iii) or page (iv) level.