You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Joe Hansen <jo...@gmail.com> on 2010/07/20 00:31:27 UTC

How to modify a document Field before the document is indexed?

Hey All,

I am using Apache Lucene (2.9.1) and its fast and it works great! I
have a question in connection with Apache PDFBox.

The following command creates a Lucent Document from a PDF file:
Document document =
org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);

The Lucene Document, document, has a bunch of fields. Among those
fields, is a field named, "content". I need to add some more data to
that field. For example, I would like to add some description and
keywords. How do I go about doing that? Any pointers would be greatly
welcome! :)

Thanks for your time!

Regards,
Joe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to modify a document Field before the document is indexed?

Posted by Erick Erickson <er...@gmail.com>.
One subtlety you might be able to use to advantage... This is
where getPositionIncrementGap in your analyzer can be used
to separate the two bits of data in the same field. If I have my
own analyzer (which could be a trivial override of an existing one)
that returns, say 10,000 from getPositionIncrementGap Now, if
you wanted to insure that proximity queries only matched in a particular
add to your "content" field, you could specify that all the terms had
to occur within 10,000 of each other...

FWIW
Erick

On Mon, Jul 19, 2010 at 7:56 PM, Joe Hansen <jo...@gmail.com> wrote:

> Thanks for your reply Koji! Your suggestion worked fine. I thought
> adding a field named "contents" to a document, even though it contains
> a field already named "contents" would NOT do anything. But looks like
> I am wrong!
>
> Thank you for your kind help! :)
>
> Regards,
> Joe
>
> On Mon, Jul 19, 2010 at 5:12 PM, Koji Sekiguchi <ko...@r.email.ne.jp>
> wrote:
> > (10/07/20 7:31), Joe Hansen wrote:
> >>
> >> Hey All,
> >>
> >> I am using Apache Lucene (2.9.1) and its fast and it works great! I
> >> have a question in connection with Apache PDFBox.
> >>
> >> The following command creates a Lucent Document from a PDF file:
> >> Document document =
> >>
> >>
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
> >>
> >> The Lucene Document, document, has a bunch of fields. Among those
> >> fields, is a field named, "content". I need to add some more data to
> >> that field. For example, I would like to add some description and
> >> keywords. How do I go about doing that? Any pointers would be greatly
> >> welcome! :)
> >>
> >> Thanks for your time!
> >>
> >> Regards,
> >> Joe
> >>
> >>
> >
> > Joe,
> >
> > You can add your data to the document object:
> >
> > Document document =
> >
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
> > document.add( new Field( "content", "your data", Store.YES,
> Index.ANALYZED )
> > );
> >
> >
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29
> >
> > Koji
> >
> > --
> > http://www.rondhuit.com/en/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to modify a document Field before the document is indexed?

Posted by Joe Hansen <jo...@gmail.com>.
Thanks for your reply Koji! Your suggestion worked fine. I thought
adding a field named "contents" to a document, even though it contains
a field already named "contents" would NOT do anything. But looks like
I am wrong!

Thank you for your kind help! :)

Regards,
Joe

On Mon, Jul 19, 2010 at 5:12 PM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:
> (10/07/20 7:31), Joe Hansen wrote:
>>
>> Hey All,
>>
>> I am using Apache Lucene (2.9.1) and its fast and it works great! I
>> have a question in connection with Apache PDFBox.
>>
>> The following command creates a Lucent Document from a PDF file:
>> Document document =
>>
>> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
>>
>> The Lucene Document, document, has a bunch of fields. Among those
>> fields, is a field named, "content". I need to add some more data to
>> that field. For example, I would like to add some description and
>> keywords. How do I go about doing that? Any pointers would be greatly
>> welcome! :)
>>
>> Thanks for your time!
>>
>> Regards,
>> Joe
>>
>>
>
> Joe,
>
> You can add your data to the document object:
>
> Document document =
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
> document.add( new Field( "content", "your data", Store.YES, Index.ANALYZED )
> );
>
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to modify a document Field before the document is indexed?

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(10/07/20 7:31), Joe Hansen wrote:
> Hey All,
>
> I am using Apache Lucene (2.9.1) and its fast and it works great! I
> have a question in connection with Apache PDFBox.
>
> The following command creates a Lucent Document from a PDF file:
> Document document =
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
>
> The Lucene Document, document, has a bunch of fields. Among those
> fields, is a field named, "content". I need to add some more data to
> that field. For example, I would like to add some description and
> keywords. How do I go about doing that? Any pointers would be greatly
> welcome! :)
>
> Thanks for your time!
>
> Regards,
> Joe
>
>    
Joe,

You can add your data to the document object:

Document document =
org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
document.add( new Field( "content", "your data", Store.YES, 
Index.ANALYZED ) );

http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29

Koji

-- 
http://www.rondhuit.com/en/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to modify a document Field before the document is indexed?

Posted by Ken Bowen <kb...@als.com>.
Look at this example:

Developers Guide > Cookbook > Dealing with Forms > SetField

On Jul 19, 2010, at 6:38 PM, Joe Hansen wrote:

> Hey All,
>
> I am using Apache Lucene (2.9.1) with PDFBox 0.8.0. The combination
> works really great for our website.
>
> I use the following command to create Lucent Document from a PDF file:
> Document document =
> org
> .apache
> .pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
>
> Now, document, has a bunch of fields. Among those fields, is a field
> named, "content". I need to add some more data to that field. For
> example, I would like to add some description and keywords. How do I
> go about doing that? Any pointers would be greatly welcome! :)
>
> Thanks for your time!
>
> Regards,
> Joe


How to modify a document Field before the document is indexed?

Posted by Joe Hansen <jo...@gmail.com>.
Hey All,

I am using Apache Lucene (2.9.1) with PDFBox 0.8.0. The combination
works really great for our website.

I use the following command to create Lucent Document from a PDF file:
Document document =
org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);

Now, document, has a bunch of fields. Among those fields, is a field
named, "content". I need to add some more data to that field. For
example, I would like to add some description and keywords. How do I
go about doing that? Any pointers would be greatly welcome! :)

Thanks for your time!

Regards,
Joe