Proceedings Volume 4307

Document Recognition and Retrieval VIII

cover
Proceedings Volume 4307

Document Recognition and Retrieval VIII

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 21 December 2000
Contents: 9 Sessions, 38 Papers, 0 Presentations
Conference: Photonics West 2001 - Electronic Imaging 2001
Volume Number: 4307

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Performance Evaluation
  • Table Understanding
  • Retrieval Topics
  • Document Layout Analysis
  • Document Compression
  • Special Topics in Document Recognition and Retrieval
  • Text Segmentation
  • Text Recognition and Postprocessing
  • Poster Session
  • Special Topics in Document Recognition and Retrieval
Performance Evaluation
icon_mobile_dropdown
TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR
Tapas Kanungo, Chang Ha Lee, Jeff Czorapinski, et al.
Tools for visualizing and creating groundtruth and metadata are crucial for document image analysis research. In this paper, we describe TrueViz which is a tool for visualizing and editing groundtruth/metadata for OCR. TrueViz is implemented in the Java programming language and works on various platforms including Windows and Unix. TrueViz reads and stores groundtruth/metadata in XML format, and reads a corresponding image stored in TIFF image file format. Multilingual text editing, display, and search module based on the Unicode representation for text is also provided. This software is being made available free of charge to researchers.
Handwritten document image database construction and retrieval system
This paper describes an off-line handwritten document data collection effort conducted at CEDAR and discusses systems that manage the document image data. We introduce the CEDAR letter, discuss its completeness and then describe the specification of the CEDAR letter image database consisting of writer data and features obtained from a handwriting sample, statistically representative of the U.S. population. We divide the document image and information management system into four systems: (1) acquisition, (2) archiving, (3) indexing and retrieval and (4) display systems. This paper discusses the issues raised by constructing the CEDAR letter database and by its potential usefulness to document image analysis, recognition, and identification fields.
Experimental tool for generating ground truths for skewed page images
Oleg G. Okun, Ari Vesanen, Matti Pietikainen
We describe a new tool called GROTTO to generate ground truths for skewed page images, which can be used for performance evaluation of page segmentation algorithms. Some of these algorithms are claimed to be more or less insensitive to skew. However, this fact is usually only supported by a visual comparison of what one obtains and what one should obtain. As a result, the evaluation is both subjective, that is, prone to errors and tedious. Our tool allows users to quickly and easily produce many sufficiently accurate ground truths that can be employed in practice and therefore it facilitates automatic performance evaluation. The main idea is to utilize the ground truths available for upright images, that is, for those without skew, and the concept of the representative square in order to produce the ground truths for skewed images. The usefulness of our tool is demonstrated through a number of experiments described.
Table Understanding
icon_mobile_dropdown
Table analysis for multiline cell identification
A table in a document is a rectilinear arrangement of cells where each cell contains a sequence of words. Several lines of text may compose one cell. Cells may be delimited by horizontal or vertical lines, but often this is not the case. A table analysis system is described which reconstructs table formatting information from table images whether or not the cells are explicitly delimited. Inputs to the system are word bounding boxes and any horizontal and vertical lines that delimit cells. Using a sequence of carefully-crafted rules, multi-line cells and their interrelationships are found even though no explicit delimiters are visible. This robust system is a component of a commercial document recognition system.
Table structure recognition and its evaluation
Jianying Hu, Ramanujan S. Kashi, Daniel P. Lopresti, et al.
Tables are an important means for communicating information in written media, and understanding such tables is a challenging problem in document layout analysis. In this paper we describe a general solution to the problem of recognizing the structure of a detected table region. First hierarchial clustering is used to identify columns and then spatial and lexical criteria to classify headers. We also address the problem of evaluating table structure recognition. Our model is based on a directed acyclic attribute graph, or table DAG. We describe a new paradigm, 'random graph probing,' for comparing the results returned by the recognition system and the representation created during ground-truthing. Probing is in fact a general concept that could be applied to other document recognition tasks and perhaps even other computer vision problems as well.
Layout and language: an efficient algorithm for detecting text blocks based on spatial and linguistic evidence
The ability to accurately detect those areas in plain text documents that consist of contiguous text is an important pre- process to many applications. This paper introduces a novel method that uses both spatial and linguistic knowledge in an accurate manner to provide an initial analysis of the document. This initial analysis may then be extended to provide a complete analysis of the text areas in the document.
Retrieval Topics
icon_mobile_dropdown
Evaluating text categorization in the presence of OCR errors
In this paper we describe experiments that investigate the effects of OCR errors on text categorization. In particular, we show that in our environment, OCR errors have no effect on categorization when we use a classifier based on the naive Bayes model. We also observe that dimensionality reduction techniques eliminate a large number of OCR errors and improve categorization results.
QDOC'99: a system for automatic cataloging and searching of ument bases
Andreas Myka
The costs of cataloguing a book manually for future access in a library exceed the costs of buying it by far. Therefore, libraries have a need for methods that support the automatic cataloguing of material. This paper describes the system QDOC'99 that implements the automatic extraction of bibliographical data from printed books into electronic catalogues; thereby, errors that originate from digitization of printed material are taken into consideration. The retrieval part of QDOC'99 comprises two approaches: one approach is directed towards a recall-oriented evaluation of queries; the other approach is directed towards a precision- oriented evaluation of queries. The performance of QDOC'99 was evaluated in an international contest on automatic cataloguing for which the results are given in the paper.
Improving retrieval effectiveness by automatically creating multiscaled links between text and pictures
Nicolas Malandain, Mauro Gaio, Jacques Madelaine
This paper describes a method to improve retrieval of composite documents (text and graphic) by creating a set of internal links. We propose the concept of granularity to add structure for this given set of typed semantic links. That is obtained by a multi-scaling processing. We propose three major classes based on the capacity of a link to include or to be included by others one. Global links are obtained with a classical IR methods. A computational model allows an automatic extraction of the textual information units contained in the text source of the global links. In our geographic corpus the units denotes georeferenced entities. A semantic representation of these entities is proposed that allows further cooperation with processing of the graphical part of the document.
Document Layout Analysis
icon_mobile_dropdown
Logical structure detection for heterogeneous document classes
Leon Todoran, Marco Aiello, Christof Monz, et al.
We present a fully implemented system based on generic document knowledge for detecting the logical structure of documents for which only general layout information is assumed. In particular, we focus on detecting the reading order. Our system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The prominent feature of our framework is its ability to handle documents from heterogeneous collections. The system has been evaluated on a standard collection of documents to measure the quality of the reading order detection. Experimental results for each component and the system as a whole are presented and discussed in detail. The performance of the system is promising, especially when considering the diversity of the document collection.
Automated labeling in document images
Jongwoo Kim, Daniel X. Le, George R. Thoma
The National Library of Medicine (NLM) is developing an automated system to produce bibliographic records for its MEDLINER database. This system, named Medical Article Record System (MARS), employs document image analysis and understanding techniques and optical character recognition (OCR). This paper describes a key module in MARS called the Automated Labeling (AL) module, which labels all zones of interest (title, author, affiliation, and abstract) automatically. The AL algorithm is based on 120 rules that are derived from an analysis of journal page layouts and features extracted from OCR output. Experiments carried out on more than 11,000 articles in over 1,000 biomedical journals show the accuracy of this rule-based algorithm to exceed 96%.
Turbo recognition: a statistical approach to layout analysis
Taku A. Tokuyasu, Philip A. Chou
Turbo recognition (TR) is a communication theory approach to the analysis of rectangular layouts, in the spirit of Document Image Decoding. The TR algorithm, inspired by turbo decoding, is based on a generative model of image production, in which two grammars are used simultaneously to describe structure in orthogonal (horizontal and vertical directions. This enables TR to strictly embody non-local constraints that cannot be taken into account by local statistical methods. This basis in finite state grammars also allows TR to be quickly retargetable to new domains. We illustrate some of the capabilities of TR with two examples involving realistic images. While TR, like turbo decoding, is not guaranteed to recover the statistically optimal solution, we present an experiment that demonstrates its ability to produce optimal or near-optimal results on a simple yet nontrivial example, the recovery of a filled rectangle in the midst of noise. Unlike methods such as stochastic context free grammars and exhaustive search, which are often intractable beyond small images, turbo recognition scales linearly with image size, suggesting TR as an efficient yet near-optimal approach to statistical layout analysis.
Recognition techniques for extracting information from semistructured documents
Anna Della Ventura, Isabella Gagliardi, Bruna Zonta
Archives of optical documents are more and more massively employed, the demand driven also by the new norms sanctioning the legal value of digital documents, provided they are stored on supports that are physically unalterable. On the supply side there is now a vast and technologically advanced market, where optical memories have solved the problem of the duration and permanence of data at costs comparable to those for magnetic memories. The remaining bottleneck in these systems is the indexing. The indexing of documents with a variable structure, while still not completely automated, can be machine supported to a large degree with evident advantages both in the organization of the work, and in extracting information, providing data that is much more detailed and potentially significant for the user. We present here a system for the automatic registration of correspondence to and from a public office. The system is based on a general methodology for the extraction, indexing, archiving, and retrieval of significant information from semi-structured documents. This information, in our prototype application, is distributed among the database fields of sender, addressee, subject, date, and body of the document.
Extracting halftones from scanned color documents and converting them into continuous form
Jiyun Byun, Youngmee Han, Minhwan Kim
The purpose of this paper is to propose a procedure that automatically extracts color halftones from a document image and then converts them into continuous-tone images. An extraction method of color halftones is proposed, which is based on analysis of the Fourier spectrum of the color halftones. The characteristics of color halftone patterns in the Fourier spectrum are used to determine halftone regions in the document image. Particularly, this method is designed not to extract the background-halftone region that consists of text and simple halftones. This method is also applicable to extract arbitrarily shaped halftone regions. The extracted halftone regions are then converted into continuous-tone images by using the color inverse halftoning that is proposed in our previous work. A new color-channel separation method for the color inverse halftoning is proposed in this paper. Additionally, an enhancing method for inverse halftoned images is also presented in order to reduce various distortions in printing or scanning process. The proposed methods can be effectively used in the field of digital library, World Wide Web (WWW), and content-based image retrieval.
Document Compression
icon_mobile_dropdown
Lossy compression of gray-scale document images by adaptive-offset quantization
This paper describes an adaptive-offset quantization scheme and considers its application to the lossy compression of grayscale document images. The technique involves scalar- quantizing and entropy-coding pixels sequentially, such that the quantizer's offset is always chosen to minimize the expected number of bits emitted for each pixel, where the expectation is based on the predictive distribution used for entropy coding. To accomplish this, information is fed back from the entropy coder's statistical modeling unit to the quantizer. This feedback path is absent in traditional compression schemes. Encouraging but preliminary experimental results are presented comparing the technique with JPEG and with fixed-offset quantization on a scanned grayscale text image.
Scalable DSP architecture for high-speed color document compression
Michael Thierschmann, Kai-Uwe Barthel, Uwe-Erik Martin
The processing of colored documents with Document Management Systems (DMS) is possible with the modern document scanning systems today. Because of the enormous amount of image data generated scanning a typical A4 document with a 300 dpi resolution, image compression is used. The JPEG compression scheme is widely used for such image data. The lack of image quality caused by necessary lossy compression, can significantly reduce the recognition quality of a subsequent optical character recognition (OCR) process, which is essential to any DMS system. LuraDocument, a high performance system for compressing and archiving scanned documents, particularly those containing text and image, is overcoming the gap between high compression and legibility of documents suitable to be managed inside DMS systems. The utilization of LuraDocument results in substantially higher image quality in comparison to standard compression techniques. This high quality is achieved by combining automatic text detection with bitonal compression of text and color/grayscale wavelet compression of images. Since the innovative LuraDocument compression scheme is a complex image processing system, allocating some computational performance, a scalable DSP system has been designed to meet the throughput of high- performance document scanners.
Unified design of symbol-matching-based document image compression and merging system
Xing Zhang, Jian Zhang
This paper describes a document image compression and merging system, which provides capabilities for automatically indexing from documents to form a document library and for merging the partial image of a document page. Because of the nature of document images, the technique described in this paper is intended to be used to process bi-level text images. The key technology for image merging is correlation analysis. State- of-the-art techniques exist to merge gray-scale and color natural image. However, these techniques do not apply for document image containing much text and they fail too often when used to merge document images regardless of their computational intensive nature. The proposed system solution will provide a reliable correlation analysis technique for document image merging where only bi-level images are primarily available.
Special Topics in Document Recognition and Retrieval
icon_mobile_dropdown
Estimating scanning characteristics from corners in bilevel images
Degradations that occur during scanning can cause errors in Optical Character Recognition (OCR). Scans made in bilevel mode (no gray scale) from high contrast source patterns are the input to the estimation processes. Two scanner system parameters are estimated from bilevel scans using models of the scanning process and bilevel source patterns. The scanner's point spread function (PSF) width and the binarization threshold are estimated by using corner features in the scanned images. These estimation algorithms were tested in simulation and with scanned test patterns. The resulting estimates are close in value to what is expected based on gray-level analysis. The results of estimation are used to produce synthetically scanned characters that in most cases bear a strong resemblance to the characters scanned on the scanner at the same settings as the test pattern used for estimation.
Design of paper-based user interface for editing documents
Yoji Maeda, Masaki Nakagawa
This paper describes pattern processing and recognition for a user interface with real papers rather than display, keyboard and mouse for document editing. We call this UI (User Interface) as 'Paper based User Interface.' In order to realize this UI, we need a technology to detect and recognize correction marks written on printouts of documents. This paper presents a new technology that detects handwritten marks without restricting the way to write correction marks or forcing the users to use a pen of a specific color and a specific scanner. At first, documents are printed out with dotted characters, and they can be easily dropped out to detect correction marks and also they can be easily emphasized to detect where each correction mark is applied. We made a prototype on a PC and verified that the method can extract almost all correction marks correctly.
Text Segmentation
icon_mobile_dropdown
Detection of text strings from mixed text/graphics images
Chien-Hua Tsai, Christos A. Papachristou
A robust system for text strings separation from mixed text/graphics images is presented. Based on a union-find (region growing) strategy the algorithm is thus able to classify the text from graphics and adapts to changes in document type, language category (e.g., English, Chinese and Japanese), text font style and size, and text string orientation within digital images. In addition, it allows for a document skew that usually occurs in documents, without skew correction prior to discrimination while these proposed methods such a projection profile or run length coding are not always suitable for the condition. The method has been tested with a variety of printed documents from different origins with one common set of parameters, and the experimental results of the performance of the algorithm in terms of computational efficiency are demonstrated by using several tested images from the evaluation.
Document text segmentation using multiband disc model
Chew Lim Tan, Bo Yuan
This paper proposes a multi-band disc model to do document page segmentation to segregate text blocks from graphic images. We first introduce the idea of our disc-model and go on to discuss the improved multi-band version of the disc- model. The disc-model takes a bottom-up segmentation approach that tries to establish local neighborhood of objects on a page and then trace the propagation of such neighborhood until all objects in text blocks are reached. The significance of the disc-model is the link established between the sizes of the objects and their positional thus logical relationship. Furthermore, the disc-model is rotational symmetric. Therefore, the disc-model can be applied to text with mixed typefaces, with arbitrary outline shapes. It is tolerable to skews or misalignment of the objects in the input images.
Text segmentation of machine-printed Gurmukhi script
Gurpreet Singh Lehal, Chandan Singh
This paper describes a scheme for text segmentation of machine printed Gurmukhi script documents. There has been a tremendous research in text segmentation of machine printed Roman script documents. In contrast there has been very little reported research on text segmentation of Indian language scripts in general and Gurmukhi script in particular. Research in the field of text segmentation of Gurmukhi script faces major problems mainly related to the unique characteristics of the script like connectivity of characters on the headline, two or more characters in a word having intersecting minimum bounding rectangles along horizontal direction, multi-component characters, touching characters which are present even in clean documents and horizontally overlapping text segments. In our proposed method we have used horizontal projection profile to successively divide the text area into small sub-areas or horizontal strips each of which contains (1) A set of text lines or (2) A single text line or (3) Sub-parts of text lines. Using vertical projection profile the horizontal strips are physically split into smaller units such as words, characters or sub characters depending on the type of the strip. Finally each of this unit is segmented into a set of connected components. The classifier is trained to recognize these connected components which are later merged to form character(s).
Text Recognition and Postprocessing
icon_mobile_dropdown
Approximate string matching algorithms for limited-vocabulary OCR output correction
Thomas A. Lasko, Susan E. Hauser
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
Pattern matching techniques for correcting low-confidence OCR words in a known context
Glenn Ford, Susan E. Hauser, Daniel X. Le, et al.
A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. This 5-engine OCR system, while exhibiting high performance overall, does not reliably convert very small characters, especially those that are in italics. As a result, the 'affiliations' field that typically contains such characters in most journals, is not captured accurately, and requires a disproportionately high manual input. To correct this problem, dictionaries have been created from words occurring in this field (e.g., university, department, street addresses, names of cities, etc.) from 230,000 articles already processed. The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification. This paper outlines the techniques employed and the results of a comparative evaluation.
Document image decoding using iterated complete path search
Thomas P. Minka, Dan S. Bloomberg, Kris Popat
The computation time of Document Image Decoding can be significantly reduced by employing heuristics in the search for the best decoding of a text line. By using a cheap upper bound on template match scores, up to 99.9% of the potential template matches can be avoided. In the Iterated Complete Path method, template matches are performed only along the best path found by dynamic programming on each iteration. When the best path stabilizes, the decoding is optimal and no more template matches need be performed. Computation can be further reduced in this scheme by exploiting the incremental nature of the Viterbi iterations. Because only a few trellis edge weights have changed since the last iteration, most of the backpointers do not need to be updated. We describe how to quickly identify these backpointers, without forfeiting optimality of the path. Together these improvements provide a 30x speedup over previous implementations of Document Image Decoding.
Adding linguistic constraints to document image decoding: comparing the iterated complete path and stack algorithms
Kris Popat, Daniel H. Greene, Justin K. Romberg, et al.
Beginning with an observed document image and a model of how the image has been degraded, Document Image Decoding recognizes printed text by attempting to find a most probable path through a hypothesized Markov source. The incorporation of linguistic constraints, which are expressed by a sequential predictive probabilistic language model, can improve recognition accuracy significantly in the case of moderately to severely corrupted documents. Two methods of incorporating linguistic constraints in the best-path search are described, analyzed and compared. The first, called the iterated complete path algorithm, involves iteratively rescoring complete paths using conditional language model probability distributions of increasing order, expanding state only as necessary with each iteration. A property of this approach is that it results in a solution that is exactly optimal with respect to the specified source, degradation, and language models; no approximation is necessary. The second approach considered is the Stack algorithm, which is often used in speech recognition and in the decoding of convolutional codes. Experimental results are presented in which text line images that have been corrupted in a known way are recognized using both the ICP and Stack algorithms. This controlled experimental setting preserves many of the essential features and challenges of real text line decoding, while highlighting the important algorithmic issues.
Secondary classification using key features
Venu Govindaraju, Zhixin Shi, A. Teredesai
A new multiple level classification method is introduced. With an available feature set, classification can be done in several steps. After first step of the classification using the full feature set, the high confidence recognition result will lead to an end of the recognition process. Otherwise a secondary classification designed using partial feature set and the information available from earlier classification step will help classify the input further. In comparison with the existing methods, our method is aimed for increasing recognition accuracy and reliability. A feature selection mechanism with help of genetic algorithms is employed to select important features that provide maximum separability between classes under consideration. These features are then used to get a sharper decision on fewer classes in the secondary classification. The full feature set is still used in earlier classification to retain complete information. There are no features dumped as they would be in feature selection methods described in most related publications.
Poster Session
icon_mobile_dropdown
Segmentation of touching handwritten Japanese characters using the graph theory method
Projection analysis methods have been widely used to segment Japanese character strings. However, if adjacent characters have overhanging strokes or a touching point doesn't correspond to the histogram minimum, the methods are prone to result in errors. In contrast, non-projection analysis methods being proposed for use on numerals or alphabet characters cannot be simply applied for Japanese characters because of the differences in the structure of the characters. Based on the oversegmenting strategy, a new pre-segmentation method is presented in this paper: touching patterns are represented as graphs and touching strokes are regarded as the elements of proper edge cutsets. By using the graph theoretical technique, the cutset martrix is calculated. Then, by applying pruning rules, potential touching strokes are determined and the patterns are over segmented. Moreover, this algorithm was confirmed to be valid for touching patterns with overhanging strokes and doubly connected patterns in simulations.
Inference process of programmed attributed regular grammars for character recognition
Mihail Prundaru, Ioana Prundaru
The paper presents the grammar inference engine of a pattern recognition system for character recognition. The input characters are identified, thinned to a one pixel width pattern and a feature-based description is provided. Using the syntactic recognition paradigm, the features are the set of terminals (or terminal symbols) for the application. The feature-based description includes a set of three attributes (i.e. A, B, C) for each feature. The combined feature and attribute description for each input pattern preserves in a more accurate way the structure of the original pattern. The grammar inference engine uses the feature-based description of each input pattern from the training set to build a grammar for each class of patterns. For each input pattern from the training set, the productions (rewriting rules) are derived together with all the necessary elements such as: the nonterminals, branch and testing conditions. Since the grammars are regular, the process of deriving the production rules is simple. All the productions are collected together providing the tags to be consecutive, without gaps. The size of the class grammars is reduced at an acceptable level for further processing using a set of Evans heuristic rules. These algorithms identifies the redundant productions, eliminating those productions and the correspondent nonterminal symbols. The stop criteria for the Evans thinning algorithm makes sure that no further reductions are possible. The last step of the grammar inference process enables the grammar to identify class members which were not in the training set: a cycling production rule. The above built grammars are used by the syntactic (character) classifier to identify the input patterns as being members of a-priori known classes.
Text block segmentation using pyramid structure
Chew Lim Tan, Zheng Zhang
Text block segmentation is necessary in document layout analysis. An algorithm and its implementation that segregates text block by block (a block is either a title or a paragraph) from the provided document, e.g. newspaper image, based on pyramid structure is described in this paper. The pyramid structure, which is amenable for parallel processing on output, is a multi-resolution image representation. The pyramid structure also simulates what the human eyes see the document from afar visualizing the block structure of the document, the block segmentation can identify the titles, and distinguish different paragraphs based on the indentation between them. Our implementation will be used in a news articles retrieval project.
Two-dimensional wavelet-packet-based feature selection method for image recognition
Min-soo Kim, Jang-sun Baek, Soo-hyung Kim, et al.
We propose a new approach to feature selection for the classification of image data using two-dimensional (2D) wavelet packet bases. To select key features of the image data, the techniques for the dimension reduction are required for which PCA has been most frequently used. However PCA relies on the eigenvalue system, it is not only sensitive to outliers or perturbations but has a tendency to extract only global features. Since the important features for the image data are often characterized by local information such as edges and spikes, PCA does not provide good solutions to such problems. Also eigen value systems usually require high cost in getting the solutions and the complexity of the algorithm is O(n3), where n is the number of variables, or pixels in the original data. In this paper, original image data are transformed into 2D wavelet packet bases and the best discriminant basis is searched to extract relevant features from image data and to discard irrelevant information. In contrast to PCA solutions, properties of wavelets enable the extraction of detail features with global features. Also, the computational complexity of computing the best 2D wavelet packet basis goes down to approximately O(nlog4 n), where n is the number of pixels in the original image data. Experiment results are compared the recognition rates of PCA and our approach to show that the proposed method gives a better results than PCA in most cases.
Gray-scale-image-based character recognition algorithm for low-quality and low-resolution images
Character recognition in low quality and low-resolution images is still a challenging problem. In this paper a gray-scale image based character recognition algorithm is proposed, which is specially suit to gray scale images captured from real world or very low quality character recognition. In our research, we classify the deformations of the low quality and low-resolution character images into two categories: (1) High spatial frequency deformations derived from either the blur distortion by the point spread function (PSF) of scanners or cameras, random noises, or character deformations; (b) Low spatial frequency deformations mainly derived from the large- scale background variations. The traditional recognition methods based on binary images cannot give satisfactory results in these images because these deformations will result in great amount of strokes touch or stroke broken in the binarization process. In the proposed method, we directly extract transform features on the gray-scale character images, which will avoid the shortcomings produced by binarization process. Our method differs from the existing gray-scale methods in that it avoids the difficult and unstable step of finding character structures in the images. By applying adequate feature selection algorithms, such as linear discriminant analysis (LDA) or principal component analysis (PCA), we can select the low frequency components that preserve the fundamental shape of characters and discard the high frequency deformation components. We also develop a gray- level histogram based algorithm using native integral ratio (NIR) technique to find a threshold to remove the backgrounds of character images while maintaining the details of the character strokes as much as possible. Experiments have shown that this method is especially effective for recognition of images of low quality and low-resolution.
Page segmentation and text extraction from gray-scale images in microfilm format
Qing Yuan, Chew Lim Tan
The paper deals with a suitably designed system that is being used to separate textual regions from graphics regions and locate textual data from textured background. We presented a method based on edge detection to automatically locate text in some noise infected grayscale newspaper images with microfilm format. The algorithm first finds the appropriate edges of textual region using Canny edge detector, and then by edge merging it makes use of edge features to do block segmentation and classification, afterwards feature aided connected component analysis was used to group homogeneous textual regions together within the scope of its bounding box. We can obtain an efficient block segmentation with reduced memory size by introducing the TLC. The proposed method has been used to locate text in a group of newspaper images with multiple page layout. Initial results are encouraging, we would expand the experiment data to over 300 microfilm images with different layout structures, promising result is anticipated with corresponding modification on the prototype of former algorithm to make it more robust and suitable to different cases.
Hough-based model for recognizing bar charts in document images
YanPing Zhou, Chew Lim Tan
Bar charts are the most basic graphic representation for scientific data in technical and business papers. The objective of bar chart recognition in document image analysis is to extract graphics and text primitives structurally, then to correlate graphic interpretative information with text primitives semantically. This paper proposes a new model for generic bar chart recognition. We first change the image space into the Hough space by applying Hough Transform on the feature points. Then we use hypothesis-testing bar pattern searching algorithm to detect the bar patterns. We also apply a new text primitives grouping algorithm to extract text primitives. Finally, we interpret bar primitives by correlating them with corresponding text primitives like human's visual processing. The results show that the new model can recognize bar charts lying in any orientations, such as slant bar charts, or even hand-drawn bar charts.
Partial matching: an efficient form classification method
Yungcheol Byun, Yeongwoo Choi, Gyungwhan Kim, et al.
In this paper, we are proposing an efficient method of classifying form that is applicable in real life. Our method identifies a small number of matching areas by their distinctive images with respect to their layout structure and then form classification is performed by matching only these local regions partially. The partial matching method can overcomes the problems caused by the lengthy computation time and low recognition rate. The process is summarized as follows. First, each image of the form is partitioned into rectangular local regions along specific locations of horizontal and vertical lines of the forms. Next, the disparity in each local region of the comparing form images is defined and measured. The penalty for each local area is computed by using the pre-printed text, filled-in data, and the size of a partitioned local area to prevent extracting erroneous lines. The disparity and penalty are considered to compute the score to select matching areas. Genetic Algorithm will also be applied to select the best regions of matching. Our approach of searching and matching only a small number of structurally distinctive local regions would reduce the processing time and yield a high rate of classification.
Segmentation of unconstrained handwritten numeral strings using the continuation property
Sungsoo Yoon, Gyeonghwan Kim, Yeongwoo Choi, et al.
The digit string recognition differs from that of isolated digits because it requires segmentation of a given string into individual digits. However, a proper segmentation requires a priori knowledge of the patterns that form meaningful units, which implies recognition capability. Therefore segmentation and recognition are not different things, rather one thing composed of two procedures with mutual dependencies. In this paper, we propose a new approach to segment the unconstrained handwritten numeral strings without the explicit guessing of break points. To segment the string of digits naturally, we adopt the concept of continuation and introduce the technique of subgraph matching to predefined prototypes. This approach makes an explicit segmentation unnecessary because it does not guess the possible break positions and also it possible to recognize a digit even if strokes not belonging to digit are attached to it. The correct segmentation rate of our method for 20 handwritten numerical strings belonging to NIST database is 97.5%.
Word extraction using irregular pyramid
PohKok Loo, Chew Lim Tan
This paper proposed a new algorithm to perform text extraction from imaged documents. The paper focused in the extraction of word group. Irregular pyramid structure is used as the basis of the algorithm. The uniqueness of this algorithm is its inclusion of strategic background information in the analysis where most techniques have discarded. Both foreground (i.e. text area) and portion of background (i.e. white area) regions are examined. The fundamental of the algorithm is based on the concept of 'closeness' where text information within a group is closed to each other, in terms of spatial distance, as compared to other text area. The result produced by the algorithm is encouraging with the ability to correctly group words of different size, font, arrangement and orientation.
Special Topics in Document Recognition and Retrieval
icon_mobile_dropdown
Modeling the sample distribution for clustering OCR
The paper re-examines a well-known technique in OCR, recognition by clustering followed by cryptanalysis, from a Bayesian perspective. The advantage of such techniques is that they are font-independent, but they appear not to have offered competitive performance with other pattern recognition techniques in the past. The analysis presented in this paper suggests an approach to OCR that is based on modeling the sample distribution as a mixture of Gaussians. Results suggest that such an approach may combine the advantages of cluster- based OCR with the performance of traditional classification algorithms.