You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by cybercouf <cy...@free.fr> on 2007/03/16 10:49:09 UTC

How the Field.Store flag works?

I'm using Lucene for indexing my nutch crawls. But I don't really understand
the difference for this flag Field.Store.YES or NO. It seems (using luke) I
still can read some data who were not 'store.YES'. Where are store this data
if it's not in the index? what is better to use for small fields? (and for
medium ones)
thanks to give me some light in my understanding!
-- 
View this message in context: http://www.nabble.com/How-the-Field.Store-flag-works--tf3413510.html#a9511458
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How the Field.Store flag works?

Posted by Erick Erickson <er...@gmail.com>.
This confused me at first too, so here's my current understanding...
When you use YES, you store the actual data as-is with the document.
This is entirely independent of indexing. Internally, I assume that
searching and storing are separate parts of the index that
have nothing to do with each other.

When you use NO, (and here I assume you index the data, because
not storing it and not indexing it is logically a no-op), the relevant
terms are stored in the index (stop words possibly removed) but
NOT stored with the document.

Programmatically, you can't say something like
doc.get("field") on a field that's not stored.

When you do NOT store data and use Lukes "reconstruct and edit"
button, you will probably not get an entirely accurate version of
the document because what I believe is happening (although I haven't
been in the guts of luke) is that it's is using something like TermEnum
for all the terms in the index and ordering them sequentially for each
unstored field in the particular document. Conceptually, I think it's
something like

   For each term in the index
        Assemble an ordered list of the termpositions in this document.
   now merge all those lists by termposition.

So stemmed terms may/may not come back correctly. I don't think
you get stopwords. Etc. Luke does its best to reassemble unstored
fields from the index data, but for unstored data you'll see, in big red
letters "RESTORED content ONLY - check for errors!" It's a
inevitably a lossy process.

As for your question of which is better for small or large fields... It's
not a relevant question. A better question is "Will I ever need the
field exactly as it was originally?". If the answer is YES, store it.

Think of them as two independent questions.
Do I need to show the original to the user? Store.Yes. Otherwise NO.
Do I need to search the data? Index.TOKENIZED/UN_TOKENIZED.
                                              otherwise NO.

You do NOT need to store data to search for it.

In general, IMO, it's better to not store the data if you don't need
it since the index that results is significantly smaller if you don't
store data.

On large indexes, BTW, Luke takes a LONG time to reconstruct
a document. It has to do a lot of work behind the scenes.

So think of indexing and storing as putting the data in different
places. Indexing data puts it in with all the searchable terms.
Storing it puts it with the document. Indexing is for find things,
and storing is for showing the original to the user. You can do
either or both.

I'm not sure whether this has added more confusion or cleared
things up, but at least it's a try <G>.

Best
Erick

On 3/16/07, cybercouf <cy...@free.fr> wrote:
>
>
> I'm using Lucene for indexing my nutch crawls. But I don't really
> understand
> the difference for this flag Field.Store.YES or NO. It seems (using luke)
> I
> still can read some data who were not 'store.YES'. Where are store this
> data
> if it's not in the index? what is better to use for small fields? (and for
> medium ones)
> thanks to give me some light in my understanding!
> --
> View this message in context:
> http://www.nabble.com/How-the-Field.Store-flag-works--tf3413510.html#a9511458
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>