Proceedings Volume 7534

Document Recognition and Retrieval XVII

cover
Proceedings Volume 7534

Document Recognition and Retrieval XVII

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 17 January 2010
Contents: 11 Sessions, 37 Papers, 0 Presentations
Conference: IS&T/SPIE Electronic Imaging 2010
Volume Number: 7534

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 7534
  • Invited Presentation I
  • Information Retrieval
  • Content Analysis
  • Text Line and Segmentation
  • Invited Presentation II
  • Document Image Processing
  • Recognition I
  • Recognition II
  • Document Structure Recognition
  • Interactive Paper Session
Front Matter: Volume 7534
icon_mobile_dropdown
Front Matter: Volume 7534
This PDF file contains the front matter associated with SPIE Proceedings Volume 7534, including the Title Page, Copyright information, Table of Contents, and the Conference Committee listing.
Invited Presentation I
icon_mobile_dropdown
A general approach to discovering, registering, and extracting features from raster maps
Craig A. Knoblock, Ching-Chien Chen, Yao-Yi Chiang, et al.
Maps can be a great source of information for a given geographic region, but they can be difficult to find and even harder to process. A significant problem is that many interesting and useful maps are only available in raster format, and even worse many maps have been poorly scanned and they are often compressed with lossy compression algorithms. Furthermore, for many of these maps there is no meta data providing the geographic coordinates, scale, or projection. Previous research on map processing has developed techniques that typically work on maps from a single map source. In contrast, we have developed a general approach to finding and processing street maps. This includes techniques for discovering maps online, extracting geographic and textual features from maps, using the extracted features to determine the geographic coordinates of the maps, and aligning the maps with imagery. The resulting system can find, register, and extract a variety of features from raster maps, which can then be used for various applications, such as annotating satellite imagery, creating and updating maps, or constructing detailed gazetteers.
Information Retrieval
icon_mobile_dropdown
Combining approaches to on-line handwriting information retrieval
In this work, we propose to combine two quite different approaches for retrieving handwritten documents. Our hypothesis is that different retrieval algorithms should retrieve different sets of documents for the same query. Therefore, significant improvements in retrieval performances can be expected. The first approach is based on information retrieval techniques carried out on the noisy texts obtained through handwriting recognition, while the second approach is recognition-free using a word spotting algorithm. Results shows that for texts having a word error rate (WER) lower than 23%, the performances obtained with the combined system are close to the performances obtained on clean digital texts. In addition, for poorly recognized texts (WER > 52%), an improvement of nearly 17% can be observed with respect to the best available baseline method.
A stacked sequential learning method for investigator name recognition from web-based medical articles
Xiaoli Zhang, Jie Zou, Daniel X. Le, et al.
"Investigator Names" is a newly required field in MEDLINE citations. It consists of personal names listed as members of corporate organizations in an article. Extracting investigator names automatically is necessary because of the increasing volume of articles reporting collaborative biomedical research in which a large number of investigators participate. In this paper, we present an SVM-based stacked sequential learning method in a novel application - recognizing named entities such as the first and last names of investigators from online medical journal articles. Stacked sequential learning is a meta-learning algorithm which can boost any base learner. It exploits contextual information by adding the predicted labels of the surrounding tokens as features. We apply this method to tag words in text paragraphs containing investigator names, and demonstrate that stacked sequential learning improves the performance of a nonsequential base learner such as an SVM classifier.
Numbered sequence detection in documents
Hervé Déjean
We present in this work a method to detect numbered sequences in a document. The method relies on the following steps: first, all potential "numbered patterns" are automatically extracted from the document. Secondly, possible coherent sequences are built using pattern incrementality (called incremental relation). Finally possible wrong links between items are corrected using the notion of optimization context. An evaluation of the method is presented and weaknesses and possible improvements are discussed.
Date of birth extraction using precise shallow parsing
This paper presents the implementation and evaluation of a pattern-based program to extract date of birth information from OCR text. Although the program finds data of birth information with high precision and recall, this type of information extraction task seems to be negatively impacted by OCR errors.
Content Analysis
icon_mobile_dropdown
The aware toolbox for the detection of law infringements on web pages
Asif Shahab, Thomas Kieninger, Andreas Dengel
In the project Aware we aim to develop an automatic assistant for the detection of law infringements on web pages. The motivation for this project is that many authors of web pages are at some points infringing copyrightor other laws, mostly without being aware of that fact, and are more and more often confronted with costly legal warnings. As the legal environment is constantly changing, an important requirement of Aware is that the domain knowledge can be maintained (and initially defined) by numerous legal experts remotely working without further assistance of the computer scientists. Consequently, the software platform was chosen to be a web-based generic toolbox that can be configured to suit individual analysis experts, definitions of analysis flow, information gathering and report generation. The report generated by the system summarizes all critical elements of a given web page and provides case specific hints to the page author and thus forms a new type of service. Regarding the analysis subsystems, Aware mainly builds on existing state-of-the-art technologies. Their usability has been evaluated for each intended task. In order to control the heterogeneous analysis components and to gather the information, a lightweight scripting shell has been developed. This paper describes the analysis technologies, ranging from text based information extraction, over optical character recognition and phonetic fuzzy string matching to a set of image analysis and retrieval tools; as well as the scripting language to define the analysis flow.
On the usability and security of pseudo-signatures
Jin Chen, Daniel Lopresti
Handwriting has been proposed as a possible biometric for a number of years. However, recent work has shown that handwritten passphrases are vulnerable to both human-based and machine-based forgeries. Pseudosignatures as an alternative are designed to thwart such attacks while still being easy for users to create, remember, and reproduce. In this paper, we briefly review the concept of pseudo-signatures, then describe an evaluation framework that considers aspects of both usability and security. We present results from preliminary experiments that examine user choice in creating pseudo-signatures and discuss the implications when sketching is used for generating cryptographic keys.
Time and space optimization of document content classifiers
Scaling up document-image classifiers to handle an unlimited variety of document and image types poses serious challenges to conventional trainable classifier technologies. Highly versatile classifiers demand representative training sets which can be dauntingly large: in investigating document content extraction systems, we have demonstrated the advantages of employing as many as a billion training samples in approximate k-nearest neighbor (kNN) classifiers sped up using hashed K-d trees. We report here on an algorithm, which we call online bin-decimation, for coping with training sets that are too big to fit in main memory, and we show empirically that it is superior to offline pre-decimation, which simply discards a large fraction of the training samples at random before constructing the classifier. The key idea of bin-decimation is to enforce an upper bound approximately on the number of training samples stored in each K-d hash bin; an adaptive statistical technique allows this to be accomplished online and in linear time, while reading the training data exactly once. An experiment on 86.7M training samples reveals a 23-times speedup with less than 0.1% loss of accuracy (compared to pre-decimation); or, for another value of the upper bound, a 60-times speedup with less than 5% loss of accuracy. We also compare it to four other related algorithms.
Detecting modifications in paper documents: a coding approach
Yogesh Sankarasubramaniam, Badri Narayanan, Kapali Viswanathan, et al.
This paper presents an algorithm called CIPDEC (Content Integrity of Printed Documents using Error Correction), which identifies any modifications made to a printed document. CIPDEC uses an error correcting code for accurate detection of addition/deletion of even a few pixels. A unique advantage of CIPDEC is that it works blind - it does not require the original document for such detection. Instead, it uses fiducial marks and error correcting code parities. CIPDEC is also robust to paper-world artifacts like photocopying, annotations, stains, folds, tears and staples. Furthermore, by working at a pixel level, CIPDEC is independent of language, font, software, and graphics that are used to create paper documents. As a result, any changes made to a printed document can be detected long after the software, font, and graphics have fallen out of use. The utility of CIPDEC is illustrated in the context of tamper-proofing of printed documents and ink extraction for form-filling applications.
Text Line and Segmentation
icon_mobile_dropdown
General text line extraction approach based on locally orientation estimation
Nazih Ouwayed, Abdel Belaïd, François Auger
This paper presents a novel approach for the multi-oriented text line extraction from historical handwritten Arabic documents. Because of the multi-orientation of lines and their dispersion in the page, we use an image paving algorithm that can progressively and locally determine the lines. The paving algorithm is initialized with a small window and then its size is corrected by extension until enough lines and connected components were found. We use the Snake for line extraction. Once the paving is established, the orientation is determined using the Wigner-Ville distribution on the histogram projection profile. This local orientation is then enlarged to limit the orientation in the neighborhood. Afterwards, the text lines are extracted locally in each zone basing on the follow-up of the baselines and the proximity of connected components. Finally, the connected components that overlap and touch in adjacent lines are separated. The morphology analysis of the terminal letters of Arabic words is here considered. The proposed approach has been experimented on 100 documents reaching an separation accuracy of about 98.6%.
Semi-supervised learning for detecting text-lines in noisy document images
Zongyi Liu, Hanning Zhou
Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.
Touching character segmentation method for Chinese historical documents
The OCR technology for Chinese historical documents is still an open problem. As these documents are hand-written or hand-carved in various styles, overlapped and touching characters bring great difficulty for character segmentation module. This paper presents an over-segmentation-based method to handle the overlapped and touching Chinese characters in historic documents. The whole segmentation process includes two parts: over-segmented and segmenting path optimization. In the former part, touching strokes will be found and segmented by analyzing the geometric information of the white and black connected components. The segmentation cost of the touching strokes is estimated with connected components' shape and location, as well as the touching stroke width. The latter part uses local optimization dynamic programming to find best segmenting path. HMM is used to express the multiple choices of segmenting paths, and Viterbi algorithm is used to search local optimal solution. Experimental results on practical Chinese documents show the proposed method is effective.
Invited Presentation II
icon_mobile_dropdown
Technologies for developing an advanced intelligent ATM with self-defence capabilities
We have developed several technologies for protecting automated teller machines. These technologies are based mainly on pattern recognition and are used to implement various self-defence functions. They include (i) banknote recognition and information retrieval for preventing machines from accepting counterfeit and damaged banknotes and for retrieving information about detected counterfeits from a relational database, (ii) form processing and character recognition for preventing machines from accepting remittance forms without due dates and/or insufficient payment, (iii) person identification to prevent machines from transacting with non-customers, and (iv) object recognition to guard machines against foreign objects such as spy cams that might be surreptitiously attached to them and to protect users against someone attempting to peek at their user information such as their personal identification number. The person identification technology has been implemented in most ATMs in Japan, and field tests have demonstrated that the banknote recognition technology can recognise more then 200 types of banknote from 30 different countries. We are developing an "advanced intelligent ATM" that incorporates all of these technologies.
Document Image Processing
icon_mobile_dropdown
Learning shape features for document enhancement
In previous work we showed that shape descriptor features can be used in Look Up Table (LUT) classifiers to learn patterns of degradation and correction in historical document images. The algorithm encodes the pixel neighborhood information effectively using a variant of shape descriptor. However, the generation of the shape descriptor features was approached in a heuristic manner. In this work, we propose a system of learning the shape features from the training data set by using neural networks: Multilayer Perceptrons (MLP) for feature extraction. Given that the MLP maybe restricted by a limited dataset, we apply a feature selection algorithm to generalize, and thus improve, the feature set obtained from the MLP. We validate the effectiveness and efficiency of the proposed approach via experimental results.
Enhancement of camera-based whiteboard images
Yuan He, Jun Sun, Satoshi Naoi, et al.
Quality of camera-based whiteboard images is highly related to the light environment and the writing effect of the content. Specular reflection and low contrast reduce the readability of captured whiteboard images frequently. A novel method is proposed to enhance camera-based whiteboard images in this paper. The images are enhanced by removing the highlight specular reflection to improve the visibility and emphasizing the content to improve the readability of the whiteboards. The method can be practically embedded in mobile devices with image capturing cameras.
Effect of pre-processing on binarization
The effects of different image pre-processing methods for document image binarization are explored. They are compared on five different binarization methods on images with bleed through and stains as well as on images with uniform background speckle. The binarization method is significant in the binarization accuracy, but the pre-processing also plays a significant role. The Total Variation method of pre-processing shows the best performance over a variety of pre-processing methods.
Recognition I
icon_mobile_dropdown
Context-dependent HMM modeling using tree-based clustering for the recognition of handwritten words
Anne-Laure Bianne, Christopher Kermorvant, Laurence Likforman-Sulem
This paper presents an HMM-based recognizer for the off-line recognition of handwritten words. Word models are the concatenation of context-dependent character models (trigraphs). The trigraph models we consider are similar to triphone models in speech recognition, where a character adapts its shape according to its adjacent characters. Due to the large number of possible context-dependent models to compute, a top-down clustering is applied on each state position of all models associated with a particular character. This clustering uses decision trees, based on rhetorical questions we designed. Decision trees have the advantage to model untrained trigraphs. Our system is shown to perform better than a baseline context independent system, and reaches an accuracy higher than 74% on the publicly available Rimes database.
Font adaptation of an HMM-based OCR system
Kamel Ait-Mohand, Laurent Heutte, Thierry Paquet, et al.
We create a polyfont OCR recognizer using HMM (Hidden Markov models) models of character trained on a dataset of various fonts. We compare this system to monofont recognizers showing its decrease of performance when it is used to recognize unseen fonts. In order to fill this gap of performance, we adapt the parameters of the models of the polyfont recognizer to a new dataset of unseen fonts using four different adaptation algorithms. The results of our experiments show that the adapted system is far more accurate than the initial system although it does not reach the accuracy of a monofont recognizer.
A new pre-classification method based on associative matching method
Yutaka Katsuyama, Akihiro Minagawa, Yoshinobu Hotta, et al.
Reducing the time complexity of character matching is critical to the development of efficient Japanese Optical Character Recognition (OCR) systems. To shorten processing time, recognition is usually split into separate preclassification and recognition stages. For high overall recognition performance, the pre-classification stage must both have very high classification accuracy and return only a small number of putative character categories for further processing. Furthermore, for any practical system, the speed of the pre-classification stage is also critical. The associative matching (AM) method has often been used for fast pre-classification, because its use of a hash table and reliance solely on logical bit operations to select categories makes it highly efficient. However, redundant certain level of redundancy exists in the hash table because it is constructed using only the minimum and maximum values of the data on each axis and therefore does not take account of the distribution of the data. We propose a modified associative matching method that satisfies the performance criteria described above but in a fraction of the time by modifying the hash table to reflect the underlying distribution of training characters. Furthermore, we show that our approach outperforms pre-classification by clustering, ANN and conventional AM in terms of classification accuracy, discriminative power and speed. Compared to conventional associative matching, the proposed approach results in a 47% reduction in total processing time across an evaluation test set comprising 116,528 Japanese character images.
A neural-linguistic approach for the recognition of a wide Arabic word lexicon
Recently, we have investigated the use of Arabic linguistic knowledge to improve the recognition of wide Arabic word lexicon. A neural-linguistic approach was proposed to mainly deal with canonical vocabulary of decomposable words derived from tri-consonant healthy roots. The basic idea is to factorize words by their roots and schemes. In this direction, we conceived two neural networks TNN_R and TNN_S to respectively recognize roots and schemes from structural primitives of words. The proposal approach achieved promising results. In this paper, we will focus on how to reach better results in terms of accuracy and recognition rate. Current improvements concern especially the training stage. It is about 1) to benefit from word letters order 2) to consider "sisters letters" (letters having same features), 3) to supervise networks behaviors, 4) to split up neurons to save letter occurrences and 5) to solve observed ambiguities. Considering theses improvements, experiments carried on 1500 sized vocabulary show a significant enhancement: TNN_R (resp. TNN_S) top4 has gone up from 77% to 85.8% (resp. from 65% to 97.9%). Enlarging the vocabulary from 1000 to 1700, adding 100 words each time, again confirmed the results without altering the networks stability.
Recognition II
icon_mobile_dropdown
Incorporating linguistic post-processing into whole-book recognition
We describe a technique of linguistic post-processing of whole-book recognition results. Whole-book recognition is a technique that improves recognition of book images using fully automatic cross-entropy-based model adaptation. In previous published works, word recognition was performed on individual words separately, without awaring passage-level information such as word-occurrence frequencies. Therefore, some rare words in real texts may appear much more often in recognition results; vice versa. Differences between word frequencies in recognition results and in prior knowledge may indicate recognition errors on a long passage. In this paper, we propose a post-processing technique to enhance whole-book recognition results by minimizing differences between word frequencies in recognition results and prior word frequencies. This technique works better when operating on longer passages, and it drives the character error rate down 20% from 1.24% to 0.98% in a 90-page experiment.
A word language model based contextual language processing on Chinese character recognition
Chen Huang, Xiaoqing Ding, Yan Chen
The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing algorithm is proposed to reduce the perplexity of the language model. The word based language model improves the performance of post-processing of online handwritten character recognition system compared with the character based language model, but it also increases computation and storage cost greatly. Besides quantizing the model data non-uniformly, we design a new tree storage structure to compress the model size, which leads to an increase in searching efficiency as well. We illustrate the set of approaches on a test corpus of recognition results of online handwritten Chinese characters, and propose a modified confidence measure for recognition candidate characters to get their accurate posterior probabilities while reducing the complexity. The weighted combination of linguistic knowledge and candidate confidence information proves successful in this paper and can be further developed to achieve improvements in recognition accuracy.
Efficient automatic OCR word validation using word partial format derivation and language model
Siyuan Chen, Dharitri Misra, George R. Thoma
In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine.1 The module detects and corrects suspicious words in the OCR output of scanned textual documents through a procedure of deriving partial formats for each suspicious word, retrieving candidate words by partial-match search from lexicons, and comparing the joint probabilities of N-gram and OCR edit transformation corresponding to the candidates. The partial format derivation, based on OCR error analysis, efficiently and accurately generates candidate words from lexicons represented by ternary search trees. In our test case comprising a historic medico-legal document collection, this OCR validation module yielded the correct words with 87% accuracy and reduced the overall OCR word errors by around 60%.
Comparison of historical documents for writership
Gregory R. Ball, Danjun Pu, Roger Stritmatter, et al.
Over the last century forensic document science has developed progressively more sophisticated pattern recognition methodologies for ascertaining the authorship of disputed documents. These include advances not only in computer assisted stylometrics, but forensic handwriting analysis. We present a writer verification method and an evaluation of an actual historical document written by an unknown writer. The questioned document is compared against two known handwriting samples of Herman Melville, a 19th century American author who has been hypothesized to be the writer of this document. The comparison led to a high confidence result that the questioned document was written by the same writer as the known documents. Such methodology can be applied to many such questioned documents in historical writing, both in literary and legal fields.
Document Structure Recognition
icon_mobile_dropdown
Interactive-predictive detection of handwritten text blocks
O. Ramos Terrades, N. Serrano, A. Gordó, et al.
A method for text block detection is introduced for old handwritten documents. The proposed method takes advantage of sequential book structure, taking into account layout information from pages previously transcribed. This glance at the past is used to predict the position of text blocks in the current page with the help of conventional layout analysis methods. The method is integrated into the GIDOC prototype: a first attempt to provide integrated support for interactive-predictive page layout analysis, text line detection and handwritten text transcription. Results are given in a transcription task on a 764-page Spanish manuscript from 1891.
Using definite clause grammars to build a global system for analyzing collections of documents
Joseph Chazalon, Bertrand Coüasnon
Collections of documents are sets of heterogeneous documents, like a specific ancient book series, having proper structural and semantic properties linking them. A particular collection contains document images with specific physical layouts, like text pages or full-page illustrations, appearing in a specific order. Its contents, like journal articles, may be shared by several pages, not necessary following, producing strong dependencies between pages interpretations. In order to build an analysis system which can bring contextual information from the collection to the appropriate recognition modules for each page, we propose to express the structural and the semantic properties of a collection with a definite clause grammar. This is made possible by representing collections as streams of document images, and by using extensions to the formalism we present here. We are then able to automatically generate a parser dedicated to a collection. Beside allowing structural variations and complex information flows, we also show that this approach enables the design of analysis stages, on a document or a set of documents. The interest of context usage is illustrated with several examples and their appropriate formalization in this framework.
Detection of figure and caption pairs based on disorder measurements
Claudie Faure, Nicole Vincent
Figures inserted in documents mediate a kind of information for which the visual modality is more appropriate than the text. A complete understanding of a figure often necessitates the reading of its caption or to establish a relationship with the main text using a numbered figure identifier which is replicated in the caption and in the main text. A figure and its caption are closely related; they constitute single multimodal components (FC-pair) that Document Image Analysis cannot extract with text and graphics segmentation. We propose a method to go further than the graphics and text segmentation in order to extract FC-pairs without performing a full labelling of the page components. Horizontal and vertical text lines are detected in the pages. The graphics are associated with selected text lines to initiate the detector of FC-pairs. Spatial and visual disorders are introduced to define a layout model in terms of properties. It enables to cope with most of the numerous spatial arrangements of graphics and text lines. The detector of FC-pairs performs operations in order to eliminate the layout disorder and assigns a quality value to each FC-pair. The processed documents were collected in medic@, the digital historical collection of the BIUM (Bibliothèque InterUniversitaire Médicale). A first set of 98 pages constitutes the design set. Then 298 pages were collected to evaluate the system. The performances are the result of a full process, from the binarisation of the digital images to the detection of FC-pairs.
Interactive Paper Session
icon_mobile_dropdown
Evaluation of human perception of degradation in document images
Large degradations in document images impede their readability as well as substantially deteriorating the performance of automated document processing systems. Image quality metrics have been defined to correlate with OCR accuracy. However, this does not always correlate with human perception of image quality. When enhancing document images with the goal of improving readability, it is important to understand human perception of quality. The goal of this work is to evaluate human perception of degradation and correlate it to known degradation parameters and existing image quality metrics. The information captured enables the learning and estimation of human perception of document image quality.
Naïve Bayes and SVM classifiers for classifying databank accession number sentences from online biomedical articles
Jongwoo Kim, Daniel X. Le, George R. Thoma
This paper describes two classifiers, Naïve Bayes and Support Vector Machine (SVM), to classify sentences containing Databank Accession Numbers, a key piece of bibliographic information, from online biomedical articles. The correct identification of these sentences is necessary for the subsequent extraction of these numbers. The classifiers use words that occur most frequently in sentences as features for the classification. Twelve sets of word features are collected to train and test the classifiers. Each set has a different number of word features ranging from 100 to 1,200. The performance of each classifier is evaluated using four measures: Precision, Recall, F-Measure, and Accuracy. The Naïve Bayes classifier shows performance above 93.91% at 200 word features for all four measures. The SVM shows 98.80% Precision at 200 word features, 94.90% Recall at 500 and 700, 96.46% F-Measure at 200, and 99.14% Accuracy at 200 and 400. To improve classification performance, we propose two merging operators, Max and Harmonic Mean, to combine results of the two classifiers. The final results show a measureable improvement in Recall, F-Measure, and Accuracy rates.
Biomedical article retrieval using multimodal features and image annotations in region-based CBIR
Daekeun You, Sameer Antani, Dina Demner-Fushman, et al.
Biomedical images are invaluable in establishing diagnosis, acquiring technical skills, and implementing best practices in many areas of medicine. At present, images needed for instructional purposes or in support of clinical decisions appear in specialized databases and in biomedical articles, and are often not easily accessible to retrieval tools. Our goal is to automatically annotate images extracted from scientific publications with respect to their usefulness for clinical decision support and instructional purposes, and project the annotations onto images stored in databases by linking images through content-based image similarity. Authors often use text labels and pointers overlaid on figures and illustrations in the articles to highlight regions of interest (ROI). These annotations are then referenced in the caption text or figure citations in the article text. In previous research we have developed two methods (a heuristic and dynamic time warping-based methods) for localizing and recognizing such pointers on biomedical images. In this work, we add robustness to our previous efforts by using a machine learning based approach to localizing and recognizing the pointers. Identifying these can assist in extracting relevant image content at regions within the image that are likely to be highly relevant to the discussion in the article text. Image regions can then be annotated using biomedical concepts from extracted snippets of text pertaining to images in scientific biomedical articles that are identified using National Library of Medicine's Unified Medical Language System® (UMLS) Metathesaurus. The resulting regional annotation and extracted image content are then used as indices for biomedical article retrieval using the multimodal features and region-based content-based image retrieval (CBIR) techniques. The hypothesis that such an approach would improve biomedical document retrieval is validated through experiments on an expert-marked biomedical article dataset.
Trainable multiscript orientation detection
Detecting the correct orientation of document images is an important step in large scale digitization processes, as most subsequent document analysis and optical character recognition methods assume upright position of the document page. Many methods have been proposed to solve the problem, most of which base on ascender to descender ratio computation. Unfortunately, this cannot be used for scripts having no descenders nor ascenders. Therefore, we present a trainable method using character similarity to compute the correct orientation. A connected component based distance measure is computed to compare the characters of the document image to characters whose orientation is known. This allows to detect the orientation for which the distance is lowest as the correct orientation. Training is easily achieved by exchanging the reference characters by characters of the script to be analyzed. Evaluation of the proposed approach showed accuracy of above 99% for Latin and Japanese script from the public UW-III and UW-II datasets. An accuracy of 98.9% was obtained for Fraktur on a non-public dataset. Comparison of the proposed method to two methods using ascender / descender ratio based orientation detection shows a significant improvement.
Improved CHAID algorithm for document structure modelling
A. Belaïd, T. Moinel, Y. Rangoni
This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.
Ant colony optimization with selective evaluation for feature selection in character recognition
Il-Seok Oh, Jin-Seon Lee
This paper analyzes the size characteristics of character recognition domain with the aim of developing a feature selection algorithm adequate for the domain. Based on the results, we further analyze the timing requirements of three popular feature selection algorithms, greedy algorithm, genetic algorithm, and ant colony optimization. For a rigorous timing analysis, we adopt the concept of atomic operation. We propose a novel scheme called selective evaluation to improve convergence of ACO. The scheme cut down the computational load by excluding the evaluation of unnecessary or less promising candidate solutions. The scheme is realizable in ACO due to the valuable information, pheromone trail which helps identify those solutions. Experimental results showed that the ACO with selective evaluation was promising both in timing requirement and recognition performance.
Analysis of line structure in handwritten documents using the Hough transform
Gregory R. Ball, Harish Kasiviswanathan, Sargur N. Srihari, et al.
In the analysis of handwriting in documents a central task is that of determining line structure of the text, e.g., number of text lines, location of their starting and end-points, line-width, etc. While simple methods can handle ideal images, real world documents have complexities such as overlapping line structure, variable line spacing, line skew, document skew, noisy or degraded images etc. This paper explores the application of the Hough transform method to handwritten documents with the goal of automatically determining global document line structure in a top-down manner which can then be used in conjunction with a bottom-up method such as connected component analysis. The performance is significantly better than other top-down methods, such as the projection profile method. In addition, we evaluate the performance of skew analysis by the Hough transform on handwritten documents.
A hybrid classifier for handwritten mathematical expression recognition
In this paper we propose a hybrid symbol classifier within a global framework for online handwritten mathematical expression recognition. The proposed architecture aims at handling mathematical expression recognition as a simultaneous optimization of symbol segmentation, symbol recognition, and 2D structure recognition under the restriction of a mathematical expression grammar. To deal with the junk problem encountered when a segmentation graph approach is used, we consider a two level classifier. A symbol classifier cooperates with a second classifier specialized to accept or reject a segmentation hypothesis. The proposed system is trained with a set of synthetic online handwritten mathematical expressions. When tested on a set of real complex expressions, the system achieves promising results at both symbol and expression interpretation levels.
A combined recognition system for online handwritten Pinyin input
We have developed an online pinyin recognition system which combined hmm method and statistic method together. Pinyin recognition is useful for those who may forget how to write a certain Chinese character but know how to pronounce it. We combined HMM model and statistic model to segment a word and recognize it. We have achieved a writer-independent accuracy of 91.37% for 17745 unconstrained-style Pinyin syllables.