You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bill Janssen <ja...@parc.com> on 2008/05/14 17:35:01 UTC

Re: text extraction from pdf

> > the unix program pdf2text can convert keeping the text places, but I wanted
> > to ask you guys if you know something better,
> 
> AFAIK, PDFBox has a lower-level API that allows you to get hold of text 
> positions.

In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
font information for each word.  You can get the xpdf sources from
http://www.foolabs.com/xpdf/, and the patch file is at
http://uplib.parc.com/misc/xpdf-3.02-PATCH.  To extract the byte
positions, use pdftotext with the "-wordboxes" switch, and see the
pdftotext man page for more info.  This is run automatically in UpLib
before the indexing is done.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org