You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Thompson <jo...@gmail.com> on 2008/06/25 23:58:32 UTC

Understanding Lucene Document Fields

Hi,

I'm trying to understand the members of the Field class.  According to
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html
:

Field.Index.NO implies:

Do not index the field value. This field can thus not be searched, but one
can still access its contents provided it is
stored<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.Store.html>.


But Field.Store.YES implies:

Store the original field value in the index. This is useful for short texts
like a document's title which should be displayed with the results. The
value is stored in its original form, i.e. no analyzer is used before it is
stored.

I'm not sure I understand the relationship between indexing and storing.
According to the above:  I can still access a field's content if I have not
indexed its value, as long as I have stored that field.  But storing a field
is by definition "storing the original field value in the index."

*scratches head*

What is the difference between "indexing a field value" and "storing an
original field value in the index"?

-John

Re: Understanding Lucene Document Fields

Posted by John Thompson <jo...@gmail.com>.
I think I found an answer to my own question:

A Field object contains a name (a String) and a value (a String or a
Reader), and three booleans that control whether or not the value will be
indexed for searches, tokenized prior to indexing, and stored in the index
so it can be returned with the search.

Let me explain those three booleans a bit more.

   - *Indexed for searches* - sometimes you'll want to have fields available
   in your Documents that don't really have anything to do with searching. Two
   examples I can think of off the top of my head are creation dates and file
   names, so you can compare when the Document was created against the file
   modification date, and decide if the document needs to be reindexed. Since
   these fields won't ever make sense to use in an actual search, you can
   decrease the amount of work Lucene does by marking them as not indexed for
   searches.
   - *Tokenized prior to indexing* - tokenizing refers to taking a piece of
   text and cleaning it up, and breaking it down into individual pieces
   (tokens) for the indexer. This is done by the Analyzer. Some fields you may
   not want to be tokenized, for example a serial number field.
   - *Stored in the index* - even if a field is entirely indexed, it doesn't
   necessarily mean that it'll be easy for Lucene to reconstruct it. Although
   Lucene is a search index, and not a database, if your fields are reasonably
   small, you can ask Lucene to store them in the index. With the fields stored
   in the index, instead of using the Document to locate the original file or
   data and load it, you can actually pull the data out of the Document. This
   works best with fairly small fields and documents that you'd need to parse
   for display anyway.
   Some fields contain bulk data and are so large that you don't really want
   to store them in the index. You can still make your life a little easier by
   storing not just the filename, but a Reader object in the Field. This makes
   it simpler for your application to just get the Reader out of the Hit and
   use it to read in the data to display it to the user.

>From http://darksleep.com/lucene/

Sorry for the noise.

-John

On Wed, Jun 25, 2008 at 2:58 PM, John Thompson <jo...@gmail.com>
wrote:

> Hi,
>
> I'm trying to understand the members of the Field class.  According to
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html
> :
>
> Field.Index.NO implies:
>
> Do not index the field value. This field can thus not be searched, but one
> can still access its contents provided it is stored<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.Store.html>.
>
>
> But Field.Store.YES implies:
>
> Store the original field value in the index. This is useful for short texts
> like a document's title which should be displayed with the results. The
> value is stored in its original form, i.e. no analyzer is used before it is
> stored.
>
> I'm not sure I understand the relationship between indexing and storing.
> According to the above:  I can still access a field's content if I have not
> indexed its value, as long as I have stored that field.  But storing a field
> is by definition "storing the original field value in the index."
>
> *scratches head*
>
> What is the difference between "indexing a field value" and "storing an
> original field value in the index"?
>
> -John
>