You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shyam Bhaskaran <sh...@gmail.com> on 2005/12/29 07:40:57 UTC
Lucene parsing for PDF
Hi,
I am working on a search project using Lucene and currently I am working on
parsing PDF documents. I was successful in implementing my parser using
Lucene and PDFBox. I have a doubt on how to exclude or (maybe delete) pages
from the index. I am not sure how to do this.. I mean when exactly it has to
be done.. Looking at the Lucene book it tells about removing documents using
Lucene by id or by term, but I was not successful in implementing this.. Can
anyone help me with this...
Regards,
Shyam
Re: Lucene parsing for PDF
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Shyam - I moderated your message through, so please subscribe to the
list to send to it in the future.
Please provide us with some details - a standalone RAMDirectory-using
JUnit TestCase is the most ideal way to share an issue like this and
have someone else take a look at it. And frequently the act of
distilling an issue down to a test case points out the error being
made :)
Erik
On Dec 29, 2005, at 1:40 AM, Shyam Bhaskaran wrote:
> Hi,
>
> I am working on a search project using Lucene and currently I am
> working on
> parsing PDF documents. I was successful in implementing my parser
> using
> Lucene and PDFBox. I have a doubt on how to exclude or (maybe
> delete) pages
> from the index. I am not sure how to do this.. I mean when exactly
> it has to
> be done.. Looking at the Lucene book it tells about removing
> documents using
> Lucene by id or by term, but I was not successful in implementing
> this.. Can
> anyone help me with this...
>
> Regards,
> Shyam
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
AW: Lucene parsing for PDF
Posted by Klaus <kl...@vommond.de>.
Hi,
I think the easiest way is ro exclude the pages while you are parsing the
pdf document. So you will provide just the necessary pages to lucene.
Another solution is to create for each site an own document, this should
hafe a field "pagenumber" or, und you can delete the document from the
index.
Peace
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org