Proceedings Volume 6815

Document Recognition and Retrieval XV

cover
Proceedings Volume 6815

Document Recognition and Retrieval XV

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 27 January 2008
Contents: 9 Sessions, 36 Papers, 0 Presentations
Conference: Electronic Imaging 2008
Volume Number: 6815

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 6815
  • Invited Presentation
  • Classification and Recognition I
  • Image Processing and Enhancement
  • Segmentation I
  • Invited Presentation
  • Classification and Recognition II
  • Segmentation II
  • Information Extraction and Document Retrieval
  • Interactive Paper and Symposium Demonstration Session-Tuesday
Front Matter: Volume 6815
icon_mobile_dropdown
Front Matter: Volume 6815
This PDF file contains the front matter associated with SPIE-IS&T Proceedings Volume 6815, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing.
Invited Presentation
icon_mobile_dropdown
DRR is a teenager
The fifteenth anniversary of the first SPIE symposium (titled Character Recognition Technologies) on Document Recognition and Retrieval provides an opportunity to examine DRR's contributions to the development of document technologies. Many of the tools taken for granted today, including workable general purpose OCR, large-scale, semi-automatic forms processing, inter-format table conversion, and text mining, followed research presented at this venue. This occasion also affords an opportunity to offer tribute to the conference organizers and proceedings editors and to the coterie of professionals who regularly participate in DRR.
Classification and Recognition I
icon_mobile_dropdown
Recognition of Arabic handwritten words using contextual character models
In this paper we present a system for the off-line recognition of cursive Arabic handwritten words. This system in an enhanced version of our reference system presented in [El-Hajj et al., 05] which is based on Hidden Markov Models (HMMs) and uses a sliding window approach. The enhanced version proposed here uses contextual character models. This approach is motivated by the fact that the set of Arabic characters includes a lot of ascending and descending strokes which overlap with one or two neighboring characters. Additional character models are constructed according to characters in their left or right neighborhood. Our experiments on images of the benchmark IFN/ENIT database of handwritten villages/towns names show that using contextual character models improves recognition. For a lexicon of 306 name classes, accuracy is increased by 0.6% in absolute value which corresponds to a 7.8% reduction in error rate.
Combining different classification approaches to improve off-line Arabic handwritten word recognition
Ilya Zavorin, Eugene Borovikov, Ericson Davis, et al.
Machine perception and recognition of handwritten text in any language is a difficult problem. Even for Latin script most solutions are restricted to specific domains like bank checks courtesy amount recognition. Arabic script presents additional challenges for handwriting recognition systems due to its highly connected nature, numerous forms of each letter, and other factors. In this paper we address the problem of offline Arabic handwriting recognition of pre-segmented words. Rather than focusing on a single classification approach and trying to perfect it, we propose to combine heterogeneous classification methodologies. We evaluate our system on the IFN/ENIT corpus of Tunisian village and town names and demonstrate that the combined approach yields results that are better than those of the individual classifiers.
Writer adaptation in off-line Arabic handwriting recognition
Writer adaptation or specialization is the adjustment of handwriting recognition algorithms to a specific writer's style of handwriting. Such adjustment yields significantly improved recognition rates over counterpart general recognition algorithms. We present the first unconstrained off-line handwriting adaptation algorithm for Arabic presented in the literature. We discuss an iterative bootstrapping model which adapts a writer-independent model to a writer-dependent model using a small number of words achieving a large recognition rate increase in the process. Furthermore, we describe a confidence weighting method which generates better results by weighting words based on their length. We also discuss script features unique to Arabic, and how we incorporate them into our adaptation process. Even though Arabic has many more character classes than languages such as English, significant improvement was observed. The testing set consisting of about 100 pages of handwritten text had an initial average overall recognition rate of 67%. After the basic adaptation was finished, the overall recognition rate was 73.3%. As the improvement was most marked for the longer words, and the set of confidently recognized longer words contained many fewer false results, a second method was presented using them alone, resulting in a recognition rate of about 75%. Initially, these words had a 69.5% recognition rate, improving to about a 92% recognition rate after adaptation. A novel hybrid method is presented with a rate of about 77.2%.
Whole-book recognition using mutual-entropy-driven model adaptation
We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and linguistics (e.g. a word-occurrence probability model), we detect evidence for disagreements between the two models by analyzing the mutual entropy between two kinds of probability distributions: (1) the a posteriori probabilities of character classes (the recognition results from image classification alone), and (2) the a posteriori probabilities of word classes (the recognition results from image classification combined with linguistic constraints). The most serious of these disagreements are identified as candidates for automatic corrections to one or the other of the models. We describe a formal information-theoretic framework for detecting model disagreement and for proposing corrections. We illustrate this approach on a small test case selected from real book-image data. This reveals that a sequence of automatic model corrections can drive improvements in both models, and can achieve a lower recognition error rate. The importance of considering the contents of the whole book is motivated by a series of studies, over the last decade, showing that isogeny can be exploited to achieve unsupervised improvements in recognition accuracy.
Image Processing and Enhancement
icon_mobile_dropdown
Interactive evolutionary computing for the binarization of degenerated handwritten images
The digital cleaning of dirty and old documents and the binarization into a black/white image can be a tedious process. It is usually done by experts. In this article a method is shown that is easy for the end user. Untrained persons are able to do this task now while before an expert was needed. The method uses interactive evolutionary computing to program image processing operations that act on the document image.
Correlating degradation models and image quality metrics
OCR often performs poorly on degraded documents. One approach to improving performance is to determine a good filter to improve the appearance of the document image before sending it to the OCR engine. Quality metrics have been measured in document images to determine what type of filtering would most likely improve the OCR response for that document image. In this paper those same quality metrics are measured for several word images degraded by known parameters in a document degradation model. The correlation between the degradation model parameters and the quality metrics is measured. High correlations do appear in many places that were expected. They are also absent in some expected places and offer a comparison of quality metric definitions proposed by different authors.
Ensemble LUT classification for degraded document enhancement
The fast evolution of scanning and computing technologies have led to the creation of large collections of scanned paper documents. Examples of such collections include historical collections, legal depositories, medical archives, and business archives. Moreover, in many situations such as legal litigation and security investigations scanned collections are being used to facilitate systematic exploration of the data. It is almost always the case that scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large, global degradation models do not perform well. In contrast, we propose to estimate local degradation models and use them in enhancing degraded document images. Using a semi-automated enhancement system we have labeled a subset of the Frieder diaries collection.1 This labeled subset was then used to train an ensemble classifier. The component classifiers are based on lookup tables (LUT) in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly effcient. Experimental evaluation results are provided using the Frieder diaries collection.1
Automatic removal of crossed-out handwritten text and the effect on writer verification and identification
Axel Brink, Harro van der Klauw, Lambert Schomaker
A method is presented for automatically identifying and removing crossed-out text in off-line handwriting. It classifies connected components by simply comparing two scalar features with thresholds. The performance is quantified based on manually labeled connected components of 250 pages of a forensic dataset. 47% of connected components consisting of crossed-out text can be removed automatically while 99% of the normal text components are preserved. The influence of automatically removing crossed-out text on writer verification and identification is also quantified. This influence is not significant.
Segmentation I
icon_mobile_dropdown
A mixed approach to book splitting
Liangcai Gao, Zhi Tang
In this paper, we present a hybrid approach to splitting a book document into individual chapters. We use multiple sources of information to obtain a reliable assessment of the chapter title pages. These sources are produced by four methods: blank space detection, font analysis, header and footer association, and table of content (TOC) analysis. Finally, a combination component is used to score potential chapter title pages and select the best candidates. This approach takes full advantage of various kinds of information such as page header and footer, layout, and keywords. It works well even without the information of TOC which is crucial for most previous similar researches. Experiments show that this approach is robust and reliable.
Robust line segmentation for handwritten documents
Kamal Kuzhinjedathu, Harish Srinivasan, Sargur Srihari
Line segmentation is the first and the most critical pre-processing step for a document recognition/analysis task. Complex handwritten documents with lines running into each other impose a great challenge for the line segmentation problem due to the absence of online stroke information. This paper describes a method to disentangle lines running into each other, by splitting and associating the correct character strokes to the appropriate lines. The proposed method can be used along with the existing algorithm1 that identifies such overlapping lines in documents. A stroke tracing method is used to intelligently segment the overlapping components. The method uses slope and curvature information of the stroke to disambiguate the course of the stroke at cross points. Once the overlapping components are segmented into strokes, a statistical method is used to associate the strokes with appropriate lines.
Line-touching character recognition based on dynamic reference feature synthesis
In recognizing characters written on forms, it often happens that characters overlap with pre-printed form lines. In order to recognize overlapped characters, removal of the line and restoration of the broken character strokes caused by line removal are generally conducted. But it is not easy to restore the broken character strokes accurately especially when the direction of the line and the character stroke are almost same. In this paper, a novel recognition method of line-touching characters without line removal is proposed in order to avoid the difficulty of the stroke restoration problem. A line-touching character is recognized as a whole by matching with reference character features which include a line feature. And the reference features are synthesized dynamically from a character feature and a line feature based on the touching condition of an input line-touching character string. We compared the performance of the proposed method with a conventional method in which a touching line is removed leaving the overlapped character stroke by mathematical morphology. Experimental results show that proposed method can achieves 96.26% character recognition rate whereas the conventional method achieves 92.77%.
Word segmentation of off-line handwritten documents
Word segmentation is the most critical pre-processing step for any handwritten document recognition and/or retrieval system. When the writing style is unconstrained (written in a natural manner), recognition of individual components may be unreliable, so they must be grouped together into word hypotheses before recognition algorithms can be used. This paper describes a gap metrics based machine learning approach to separate a line of unconstrained handwritten text into words. Our approach uses a set of both local and global features, which is motivated by the ways in which human beings perform this kind of task. In addition, in order to overcome the disadvantage of different distance computation methods, we propose a combined distance measure computed using three different methods. The classification is done by using a three-layer neural network. The algorithm is evaluated using an unconstrained handwriting database that contains 50 pages (1026 line, 7562 words images) handwritten documents. The overall accuracy is 90.8%, which shows a better performance than a previous method.
Invited Presentation
icon_mobile_dropdown
The OCRopus open source OCR system
OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.
Classification and Recognition II
icon_mobile_dropdown
Measuring the impact of character recognition errors on downstream text analysis
Noise presents a serious challenge in optical character recognition, as well as in the downstream applications that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We present experimental results based on injecting single errors into a large corpus of test documents to study their varying impacts depending on the nature of the error and the character(s) involved. While most such errors are found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of the original error, thereby degrading the performance of the end-to-end system.
Online writer identification using character prototypes distributions
Siew Keng Chan, Christian Viard-Gaudin, Yong Haur Tay
Writer identification is a process which aims to identify the writer of a given handwritten document. Its implementation is needed in applications such as forensic document analysis and document retrieval which involved the use of offline handwritten documents. With the recent advances of technology, the invention of digital pen and paper has extended the field of writer identification to cover online handwritten documents. In this communication, a methodology is proposed to solve the problem of text-independent writer identification using online handwritten documents. The proposed methodology would strive to identify the writer of a given handwritten document regardless of its text contents by comparing his or her handwritings with those stored in a reference database. The output of this process would be a ranked list of the writers whose handwritings are stored in the reference database. The main idea is to use the distance measurement between the distributions of reference patterns defined at the character level. Very few, if any, attempts have been done at this character level. Two sets of handwritten document databases each with 82 online documents contributed by 82 subjects were used in the experiments. The reported result was 95% of Top 1 rate accuracy. Only four writers were identified wrongly, ranked as 2, 4, 5 and 12 choice returned.
Stroke frequency descriptors for handwriting-based writer identification
Bart Dolega, Gady Agam, Shlomo Argamon
Writer identification in offline handwritten documents is a difficult task with multiple applications such as authentication, identification, and clustering in document collections. For example, in the context of content-based document image retrieval, given a document with handwritten annotations it is possible to determine whether the comments were added by a specific individual and find other documents annotated by the same person. In contrast to online writer identification in which temporal stroke information is available, such information is not readily available in offline writer identification. The base approach and the main contribution of our work is the idea of using derived canonical stroke frequency descriptors from handwritten text to identify writers. We show that a relatively small set of canonical strokes can be successfully employed for generating discriminative frequency descriptors. Moreover, we show that by using frequency descriptors alone it is possible to perform writer identification with success rate which is comparable to the known state of the art in offline writer identification with close to 90% accuracy. As frequency descriptors are independent of existing descriptors, the performance of offline writer identification may be improved by combining both standard and frequency descriptors. Experimental evaluation with quantitative performance evaluation is provided using the IAM dataset.1
Segmentation II
icon_mobile_dropdown
Address block localization based on graph theory
Djamel Gaceb, Véronique Eglin, Frank Lebourgeois, et al.
An efficient mail sorting system is mainly based on an accurate optical recognition of the addresses on the envelopes. However, the localizing of the address block (ABL) should be done before the OCR recognition process. The location step is very crucial as it has a great impact on the global performance of the system. Currently, a good localizing step leads to a better recognition rate. The limit of current methods is mainly caused by modular linear architectures used for ABL: their performances greatly depend on each independent module performance. We are presenting in this paper a new approach for ABL based on a pyramidal data organization and on a hierarchical graph coloring for classification process. This new approach presents the advantage to guarantee a good coherence between different modules and reduces both the computation time and the rejection rate. The proposed method gives a very satisfying rate of 98% of good locations on a set of 750 envelope images.
Versatile page numbering analysis
Hervé Déjean, Jean-Luc Meunier
In this paper, we revisit the problem of detecting the page numbers of a document. This work is motivated by a need for a generic method which applies on a large variety of documents, as well as the need for analyzing the document page numbering scheme rather than spotting one number per page. We propose here a novel method, based on the notion of sequence, which goes beyond any previous described work, and we report on an extensive evaluation of its performance.
Segmentation-based retrieval of document images from diverse collections
We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc, in documents represented as bilevel, greylevel, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a fraction of a certain type of content. When the distribution of content and error rates are uniform across the entire collection, it is possible to derive IR measures from classification measures and vice versa. Our largest experiments to date, consisting of 80 training images totaling over 416 million pixels, are presented to illustrate these conclusions. This data set is more representative than previous experiments, containing a more balanced distribution of content types. Contained in this data set are also images of text obtained from handheld digital cameras and the success of existing methods (with no modification) in classifying these images with are discussed. Initial experiments in discriminating line art from the four classes mentioned above are also described. We also discuss methodological issues that affect both ground-truthing and evaluation measures.
Transcript mapping for handwritten English documents
Transcript mapping or text alignment with handwritten documents is the automatic alignment of words in a text file with word images in a handwritten document. Such a mapping has several applications in fields ranging from machine learning where large quantities of truth data are required for evaluating handwriting recognition algorithms, to data mining where word image indexes are used in ranked retrieval of scanned documents in a digital library. The alignment also aids "writer identity" verification algorithms. Interfaces which display scanned handwritten documents may use this alignment to highlight manuscript tokens when a person examines the corresponding transcript word. We propose an adaptation of the True DTW dynamic programming algorithm for English handwritten documents. The integration of the dissimilarity scores from a word-model word recognizer and Levenshtein distance between the recognized word and lexicon word, as a cost metric in the DTW algorithm leading to a fast and accurate alignment, is our primary contribution. Results provided, confirm the effectiveness of our approach.
Information Extraction and Document Retrieval
icon_mobile_dropdown
Word mining in a sparsely labeled handwritten collection
Word-spotting techniques are usually based on detailed modeling of target words, followed by search for the locations of such a target word in images of handwriting. In this study, the focus is on deciding for the presence of target words in lines of text, regardless and disregarding their horizontal position. Line strips are modeled using a Bag-of-Glyphs approach using a self-organized map. This approach uses the presence of fragmented-connected component shapes (glyphs) in a line strip to characterize this text passage, similar to the Bag-of-Words approach for 'ASCII'-encoded documents in regular Information Retrieval. Subsequently, the presence of a word or word category is trained to a support-vector machine in an iterative setup which involves an active group of users. Results are promising for a large proportion of words and are dependent both on the amount of labeled lines as well as shape uniqueness. Particularly useful is the ability to train on abstract content classes such as proper names, municipalities or word-bigram presence in the line-strip images.
An OCR based approach for word spotting in Devanagari documents
Anurag Bhardwaj, Suryaprakash Kompalli, Srirangaraj Setlur, et al.
This paper describes an OCR-based technique for word spotting in Devanagari printed documents. The system accepts a Devanagari word as input and returns a sequence of word images that are ranked according to their similarity with the input query. The methodology involves line and word separation, pre-processing document words, word recognition using OCR and similarity matching. We demonstrate a Block Adjacency Graph (BAG) based document cleanup in the pre-processing phase. During word recognition, multiple recognition hypotheses are generated for each document word using a font-independent Devanagari OCR. The similarity matching phase uses a cost based model to match the word input by a user and the OCR results. Experiments are conducted on document images from the publicly available ILT and Million Book Project dataset. The technique achieves an average precision of 80% for 10 queries and 67% for 20 queries for a set of 64 documents containing 5780 word images. The paper also presents a comparison of our method with template-based word spotting techniques.
Extracting a sparsely located named entity from online HTML medical articles using support vector machine
Jie Zou, Daniel Le, George R. Thoma
We describe a statistical machine learning method for extracting databank accession numbers (DANs) from online medical journal articles. Because the DANs are sparsely-located in the articles, we take a hierarchical approach. The HTML journal articles are first segmented into zones according to text and geometric features. The zones are then classified as DAN zones or other zones by an SVM classifier. A set of heuristic rules are applied on the candidate DAN zones to extract DANs according to their edit distances to the DAN formats. An evaluation shows that the proposed method can achieve a very high recall rate (above 99%) and a significantly better precision rate compared to extraction through brute force regular expression matching.
Exploring use of images in clinical articles for decision support in evidence-based medicine
Sameer Antani, Dina Demner-Fushman M.D., Jiang Li, et al.
Essential information is often conveyed pictorially (images, illustrations, graphs, charts, etc.) in biomedical publications. A clinician's decision to access the full text when searching for evidence in support of clinical decision is frequently based solely on a short bibliographic reference. We seek to automatically augment these references with images from the article that may assist in finding evidence. In a previous study, the feasibility of automatically classifying images by usefulness (utility) in finding evidence was explored using supervised machine learning and achieved 84.3% accuracy using image captions for modality and 76.6% accuracy combining captions and image data for utility on 743 images from articles over 2 years from a clinical journal. Our results indicated that automatic augmentation of bibliographic references with relevant images was feasible. Other research in this area has determined improved user experience by showing images in addition to the short bibliographic reference. Multi-panel images used in our study had to be manually pre-processed for image analysis, however. Additionally, all image-text on figures was ignored. In this article, we report on developed methods for automatic multi-panel image segmentation using not only image features, but also clues from text analysis applied to figure captions. In initial experiments on 516 figure images we obtained 95.54% accuracy in correctly identifying and segmenting the sub-images. The errors were flagged as disagreements with automatic parsing of figure caption text allowing for supervised segmentation. For localizing text and symbols, on a randomly selected test set of 100 single panel images our methods reported, on the average, precision and recall of 78.42% and 89.38%, respectively, with an accuracy of 72.02%.
Interactive Paper and Symposium Demonstration Session-Tuesday
icon_mobile_dropdown
Model-based document categorization employing semantic pattern analysis and local structure clustering
Kosei Fume, Yasuto Ishitani
We propose a document categorization method based on a document model that can be defined externally for each task and that categorizes Web content or business documents into a target category in accordance with the similarity of the model. The main feature of the proposed method consists of two aspects of semantics extraction from an input document. The semantics of terms are extracted by the semantic pattern analysis and implicit meanings of document substructure are specified by a bottom-up text clustering technique focusing on the similarity of text line attributes. We have constructed a system based on the proposed method for trial purposes. The experimental results show that the system achieves more than 80% classification accuracy in categorizing Web content and business documents into 15 or 70 categories.
Large scale parallel document image processing
Building a system which allows to search a very large database of document images requires professionalization of hardware and software, e-science and web access. In astrophysics there is ample experience dealing with large data sets due to an increasing number of measurement instruments. The problem of digitization of historical documents of the Dutch cultural heritage is a similar problem. This paper discusses the use of a system developed at the Kapteyn Institute of Astrophysics for the processing of large data sets, applied to the problem of creating a very large searchable archive of connected cursive handwritten texts. The system is adapted to the specific needs of processing document images. It shows that interdisciplinary collaboration can be beneficial in the context of machine learning, data processing and professionalization of image processing and retrieval systems.
A mixed approach to auto-detection of page body
Liangcai Gao, Zhi Tang, Ruiheng Qiu
Page body holds the central information of a page in most documents. This paper addresses the problem of automatically detecting page body area in digital books or journals. A novel method based on font expansion and header and footer elimination is detailed. This method extracts body text font (BFont) and headers and footers from a document first, and then draws two page body bounding boxes for each page, one by analyzing the distribution of BFont in pages and the other by removing headers and footers from pages. Finally, the two bounding boxes are combined to obtain the resultant page body bounding box. The test results demonstrate very high recognition rate: up to 99.49% in precision.
Extracting curved text lines using the chain composition and the expanded grouping method
In this paper, we present a method to extract the text lines in poorly structured documents. The text lines may have different orientations, considerably curved shapes, and there are possibly a few wide inter-word gaps in a text line. Those text lines can be found in posters, blocks of addresses, artistic documents. Our method is an expansion of the traditional perceptual grouping. We develop novel solutions to overcome the problems of insufficient seed points and varied orientations in a single line. In this paper, we assume that text lines consists of connected components, in which each connected components is a set of black pixels within a letter, or some touched letters. In our scheme, the connected components closer than an iteratively incremented threshold will be combined to make chains of connected components. Elongate chains are identified as the seed chains of lines. Then the seed chains are extended to the left and the right regarding the local orientations. The local orientations will be reevaluated at each side of the chains when it is extended. By this process, all text lines are finally constructed. The advantage of the proposed method over prior works in extraction of curved text lines is that this method can both deal with more than a specific language and extract text lines containing some wide inter-word gaps. The proposed method is good for extraction of the considerably curved text lines from logos and slogans in our experiment; 98% and 94% for the straight-line extraction and the curved-line extraction, respectively.
Achieving high recognition reliability using decision trees and AdaBoost
Jianying Xiang, Xiao Tu, Yue Lu, et al.
Recognition rate is traditionally used as the main criterion for evaluating the performance of a recognition system. High recognition reliability with low misclassification rate is also a must for many applications. To handle the variability of the writing style of different individuals, this paper employs decision trees and WRB AdaBoost to design a classifier with high recognition reliability for recognizing Bangla handwritten numerals. Experiments on the numeral images obtained from real Bangladesh envelopes show that the proposed recognition method is capable of achieving high recognition reliability with acceptable recognition rate.
A generic method for structure recognition of handwritten mail documents
This paper presents a system to extract the logical structure of handwritten mail documents. It consists in two joined tasks: the segmentation of documents into blocks and the labeling of such blocks. The main considered label classes are: addressee details, sender details, date, subject, text body, signature. This work has to face with difficulties of unconstrained handwritten documents: variable structure and writing. We propose a method based on a geometric analysis of the arrangement of elements in the document. We give a description of the document using a two-dimension grammatical formalism, which makes it possible to easily introduce knowledge on mail into a generic parser. Our grammatical parser is LL(k), which means several combinations are tried before extracting the good one. The main interest of this approach is that we can deal with low structured documents. Moreover, as the segmentation into blocks often depends on the associated classes, our method is able to retry a different segmentation until labeling succeeds. We validated this method in the context of the French national project RIMES, which proposed a contest on a large base of documents. We obtain a recognition rate of 91.7% on 1150 images.
Hybrid approach combining contextual and statistical information for identifying MEDLINE citation terms
In Cheol Kim, Daniel X. Le, George R. Thoma
There is a strong demand for developing automated tools for extracting pertinent information from the biomedical literature that is a rich, complex, and dramatically growing resource, and is increasingly accessed via the web. This paper presents a hybrid method based on contextual and statistical information to automatically identify two MEDLINE citation terms: NIH grant numbers and databank accession numbers from HTML-formatted online biomedical documents. Their detection is challenging due to many variations and inconsistencies in their format (although recommended formats exist), and also because of their similarity to other technical or biological terms. Our proposed method first extracts potential candidates for these terms using a rule-based method. These are scored and the final candidates are submitted to a human operator for verification. The confidence score for each term is calculated using statistical information, and morphological and contextual information. Experiments conducted on more than ten thousand HTML-formatted online biomedical documents show that most NIH grant numbers and databank accession numbers can be successfully identified by the proposed method, with recall rates of 99.8% and 99.6%, respectively. However, owing to the high false alarm rate, the proposed method yields F-measure rates of 86.6% and 87.9% for NIH grants and databanks, respectively.
Form classification
The problem of form classification is to assign a single-page form image to one of a set of predefined form types or classes. We classify the form images using low level pixel density information from the binary images of the documents. In this paper, we solve the form classification problem with a classifier based on the k-means algorithm, supported by adaptive boosting. Our classification method is tested on the NIST scanned tax forms data bases (special forms databases 2 and 6) which include machine-typed and handwritten documents. Our method improves the performance over published results on the same databases, while still using a simple set of image features.
Interactive degraded document enhancement and ground truth generation
G. Bal, G. Agam, O. Frieder, et al.
Degraded documents are frequently obtained in various situations. Examples of degraded document collections include historical document depositories, document obtained in legal and security investigations, and legal and medical archives. Degraded document images are hard to to read and are hard to analyze using computerized techniques. There is hence a need for systems that are capable of enhancing such images. We describe a language-independent semi-automated system for enhancing degraded document images that is capable of exploiting inter- and intra-document coherence. The system is capable of processing document images with high levels of degradations and can be used for ground truthing of degraded document images. Ground truthing of degraded document images is extremely important in several aspects: it enables quantitative performance measurements of enhancement systems and facilitates model estimation that can be used to improve performance. Performance evaluation is provided using the historical Frieder diaries collection.1
Efficient implementation of local adaptive thresholding techniques using integral images
Adaptive binarization is an important first step in many document analysis and OCR processes. This paper describes a fast adaptive binarization algorithm that yields the same quality of binarization as the Sauvola method,1 but runs in time close to that of global thresholding methods (like Otsu's method2), independent of the window size. The algorithm combines the statistical constraints of Sauvola's method with integral images.3 Testing on the UW-1 dataset demonstrates a 20-fold speedup compared to the original Sauvola algorithm.