You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vinod Bhagat <vb...@blastradius.com> on 2002/10/14 11:26:32 UTC

Extracting Complete Text from PDF using Lucene and JPEDAL!!!!

Dear People

  I am using Lucene and one of the requirement is to index PDF. I am using
JPEDAL's  API to extract text from PDF.  Till now i manage to get the text
of the first page, I am using the ExtractTextObject.java class to do the
above. But i want to extract the complete text of the PDF file. Have anyone
done this and possible could guide me towards it.

 Appritiate for your positive and quick reply.

 Cheers
Vin.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Extracting Complete Text from PDF using Lucene and JPEDAL!!!!

Posted by Mikael Söderman <mi...@excito.se>.
Hi Vin!

With JPedal you process one page at a time by calling the method decodePage
and supply the number of the page you want to process as argument.

In the example ExtractTextObjects the total number of pages is hard-coded to
1 (the variable end is set to 1 in the constructor), try to set the number
of pages by using the getPageCount method instead.

Best regards

Mikael Söderman

PS. Don't forget to always call flushObjectValues when done with a page.
This will make JPedal reuse memory.


----- Original Message -----
From: "Vinod Bhagat" <vb...@blastradius.com>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Monday, October 14, 2002 11:26 AM
Subject: Extracting Complete Text from PDF using Lucene and JPEDAL!!!!


> Dear People
>
>   I am using Lucene and one of the requirement is to index PDF. I am using
> JPEDAL's  API to extract text from PDF.  Till now i manage to get the text
> of the first page, I am using the ExtractTextObject.java class to do the
> above. But i want to extract the complete text of the PDF file. Have
anyone
> done this and possible could guide me towards it.
>
>  Appritiate for your positive and quick reply.
>
>  Cheers
> Vin.
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>