Share Email Print

Proceedings Paper

Enriching a document collection by integrating information extraction and PDF annotation
Author(s): Brett Powley; Robert Dale; Ilya Anisimoff
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

Modern digital libraries offer all the hyperlinking possibilities of the World Wide Web: when a reader finds a citation of interest, in many cases she can now click on a link to be taken to the cited work. This paper presents work aimed at providing the same ease of navigation for legacy PDF document collections that were created before the possibility of integrating hyperlinks into documents was ever considered. To achieve our goal, we need to carry out two tasks: first, we need to identify and link citations and references in the text with high reliability; and second, we need the ability to determine physical PDF page locations for these elements. We demonstrate the use of a high-accuracy citation extraction algorithm which significantly improves on earlier reported techniques, and a technique for integrating PDF processing with a conventional text-stream based information extraction pipeline. We demonstrate these techniques in the context of a particular document collection, this being the ACL Anthology; but the same approach can be applied to other document sets.

Paper Details

Date Published: 19 January 2009
PDF: 10 pages
Proc. SPIE 7247, Document Recognition and Retrieval XVI, 724707 (19 January 2009); doi: 10.1117/12.805548
Show Author Affiliations
Brett Powley, Macquarie Univ. (Australia)
Robert Dale, Macquarie Univ. (Australia)
Ilya Anisimoff, Macquarie Univ. (Australia)

Published in SPIE Proceedings Vol. 7247:
Document Recognition and Retrieval XVI
Kathrin Berkner; Laurence Likforman-Sulem, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?