You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Thomas Arni <ar...@zhwin.ch> on 2007/08/14 08:28:32 UTC

Indexing PDF documents with structure information

Hello Luceners

I have started a new project and need to index pdf documents.
There are several projects around, which allow to extract the content,
like pdfbox, xpdf and pjclassic.

As far as I studied the FAQ's and examples, all these
tools allow simple text extraction.

Which of these open source tool can you recommend the most?

My pdf documents are quite long (in average more than 60 pages long).
Therefore I would like to have additional structure information for 
indexing.
This allows that the user not only gets the whole document as a result,
he also gets additional information like the page or the chapter, where
the relevant information is.

As anyone have similar requirements? Which of these tools
are the best to fit my requirements?

Thanks for your help
Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing PDF documents with structure information

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

Thomas Arni a écrit :
> Hello Luceners
>
> I have started a new project and need to index pdf documents.
> There are several projects around, which allow to extract the content,
> like pdfbox, xpdf and pjclassic.
>
> As far as I studied the FAQ's and examples, all these
> tools allow simple text extraction.
>
> Which of these open source tool can you recommend the most?
pdftk or iText?
>
> My pdf documents are quite long (in average more than 60 pages long).
> Therefore I would like to have additional structure information for
> indexing.
> This allows that the user not only gets the whole document as a result,
> he also gets additional information like the page or the chapter, where
> the relevant information is.
page is simple to extract, chapter should be more tricky, if the
document got internal links.
PDF reader accept argument like in http to open a page.
>
> As anyone have similar requirements? Which of these tools
> are the best to fit my requirements?

Have a look to "PDF hacks" (ISBN: 0596006551). When your document will
be split, it will be easy to index it.

M.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org