You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shyam Bhaskaran <sh...@gmail.com> on 2005/12/29 07:40:57 UTC

Lucene parsing for PDF

Hi,

I am working on a search project using Lucene and currently I am working on
parsing PDF documents. I was successful in implementing my parser using
Lucene and PDFBox. I have a doubt on how to exclude or (maybe delete) pages
from the index. I am not sure how to do this.. I mean when exactly it has to
be done.. Looking at the Lucene book it tells about removing documents using
Lucene by id or by term, but I was not successful in implementing this.. Can
anyone help me with this...

Regards,
Shyam

Re: Lucene parsing for PDF

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Shyam - I moderated your message through, so please subscribe to the  
list to send to it in the future.

Please provide us with some details - a standalone RAMDirectory-using  
JUnit TestCase is the most ideal way to share an issue like this and  
have someone else take a look at it.  And frequently the act of  
distilling an issue down to a test case points out the error being  
made :)

	Erik


On Dec 29, 2005, at 1:40 AM, Shyam Bhaskaran wrote:

> Hi,
>
> I am working on a search project using Lucene and currently I am  
> working on
> parsing PDF documents. I was successful in implementing my parser  
> using
> Lucene and PDFBox. I have a doubt on how to exclude or (maybe  
> delete) pages
> from the index. I am not sure how to do this.. I mean when exactly  
> it has to
> be done.. Looking at the Lucene book it tells about removing  
> documents using
> Lucene by id or by term, but I was not successful in implementing  
> this.. Can
> anyone help me with this...
>
> Regards,
> Shyam


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Lucene parsing for PDF

Posted by Klaus <kl...@vommond.de>.
Hi,

I think the easiest way is ro exclude the pages while you are parsing the
pdf document. So you will provide just the necessary pages to lucene.
Another solution is to create for each site an own document, this should
hafe a field "pagenumber" or, und you can delete the document from the
index. 

Peace


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org