You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Thompson <jo...@gmail.com> on 2008/06/25 23:58:32 UTC
Understanding Lucene Document Fields
Hi,
I'm trying to understand the members of the Field class. According to
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html
:
Field.Index.NO implies:
Do not index the field value. This field can thus not be searched, but one
can still access its contents provided it is
stored<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.Store.html>.
But Field.Store.YES implies:
Store the original field value in the index. This is useful for short texts
like a document's title which should be displayed with the results. The
value is stored in its original form, i.e. no analyzer is used before it is
stored.
I'm not sure I understand the relationship between indexing and storing.
According to the above: I can still access a field's content if I have not
indexed its value, as long as I have stored that field. But storing a field
is by definition "storing the original field value in the index."
*scratches head*
What is the difference between "indexing a field value" and "storing an
original field value in the index"?
-John
Re: Understanding Lucene Document Fields
Posted by John Thompson <jo...@gmail.com>.
I think I found an answer to my own question:
A Field object contains a name (a String) and a value (a String or a
Reader), and three booleans that control whether or not the value will be
indexed for searches, tokenized prior to indexing, and stored in the index
so it can be returned with the search.
Let me explain those three booleans a bit more.
- *Indexed for searches* - sometimes you'll want to have fields available
in your Documents that don't really have anything to do with searching. Two
examples I can think of off the top of my head are creation dates and file
names, so you can compare when the Document was created against the file
modification date, and decide if the document needs to be reindexed. Since
these fields won't ever make sense to use in an actual search, you can
decrease the amount of work Lucene does by marking them as not indexed for
searches.
- *Tokenized prior to indexing* - tokenizing refers to taking a piece of
text and cleaning it up, and breaking it down into individual pieces
(tokens) for the indexer. This is done by the Analyzer. Some fields you may
not want to be tokenized, for example a serial number field.
- *Stored in the index* - even if a field is entirely indexed, it doesn't
necessarily mean that it'll be easy for Lucene to reconstruct it. Although
Lucene is a search index, and not a database, if your fields are reasonably
small, you can ask Lucene to store them in the index. With the fields stored
in the index, instead of using the Document to locate the original file or
data and load it, you can actually pull the data out of the Document. This
works best with fairly small fields and documents that you'd need to parse
for display anyway.
Some fields contain bulk data and are so large that you don't really want
to store them in the index. You can still make your life a little easier by
storing not just the filename, but a Reader object in the Field. This makes
it simpler for your application to just get the Reader out of the Hit and
use it to read in the data to display it to the user.
>From http://darksleep.com/lucene/
Sorry for the noise.
-John
On Wed, Jun 25, 2008 at 2:58 PM, John Thompson <jo...@gmail.com>
wrote:
> Hi,
>
> I'm trying to understand the members of the Field class. According to
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html
> :
>
> Field.Index.NO implies:
>
> Do not index the field value. This field can thus not be searched, but one
> can still access its contents provided it is stored<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.Store.html>.
>
>
> But Field.Store.YES implies:
>
> Store the original field value in the index. This is useful for short texts
> like a document's title which should be displayed with the results. The
> value is stored in its original form, i.e. no analyzer is used before it is
> stored.
>
> I'm not sure I understand the relationship between indexing and storing.
> According to the above: I can still access a field's content if I have not
> indexed its value, as long as I have stored that field. But storing a field
> is by definition "storing the original field value in the index."
>
> *scratches head*
>
> What is the difference between "indexing a field value" and "storing an
> original field value in the index"?
>
> -John
>