You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ro...@earthlink.net on 2009/02/17 20:19:06 UTC

newbie seeking explanation of semantics of "Field" class

R2.4 

I have been looking through the soon-to-be-superseded (by its 2nd ed.) book "Lucene In Action" (hope it's ok on this newsgroup to say I like that book); also at these two tutorials: http://darksleep.com/lucene/ and http://www.informit.com/articles/article.aspx?p=461633&seqNum=3 and also at the Lucene online docco (http://lucene.apache.org/java/2_4_0/index.html) the last of which has nothing on the topic at all! I've also tried to search http://www.nabble.com/Lucene---Java-Users-f45.html -- but there are almost 10,000 docs there on "Field." so that is too much data. 

The book is consistent with the two tutorials, but all three seem to be out of date (and the design less clear) compared to the code: http://lucene.apache.org/java/2_4_0/api/index.html 

I have copied some code and it is working for me, but I am a little uncertain how to decide what value of Field.Index and Field.Store to choose in order to get the behavior I'd like. If I read the javadocs, and decide to ignore all the "expert" items, it looks like this: 

Field.Store.NO = I'll never see that data again; I wonder why I'd do this? 

Field.Store.YES = good, the data will be stored 

Field.Store.COMPRESS = even better, stored and compressed; why would anyone do anything else? 

========

Field.Index.NO = I cannot search that data, but if I need its value for a given document (e.g., to decorate a result), I can retrieve it (use-case: maybe, the date the document was created -- but why not just make that searchable? I am having a hard time thinking of an actually useful piece of data that could go here and would not want to be one of ANALYZED or NOT_ANALYZED) 

Field.Index.ANALYZED = the normal value, I would guess, except in the special case of stuff not searchable but used to decorate results (Field.Index.NO)

Field.Index.NOT_ANALYZED = I can search for this value, but it won't get analyzed, so it is searched for as the very same value I put in (the docco suggests product numbers: any other interesting use-cases anyone can suggest?) 

========= 

thanks in advance for helping me get clearer on this!

-Paul 






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: newbie seeking explanation of semantics of "Field" class

Posted by Erick Erickson <er...@gmail.com>.

This confused me on my first encounter, but it all makes
sense after a while....

The first thing to understand is that Store and Index are
orthogonal.That is, when you index a field that data
is placed in the inverted index and is searchable, whether
or not you store it. But it is not retrievable easily.

Conversely, when you store the data, a literal copy is
stored, no analysis is done. This preserves case,
stopwords, original word if stemming is used,
punctuation, etc.

See below for more details...

On Tue, Feb 17, 2009 at 2:19 PM, <ro...@earthlink.net> wrote:

> R2.4
>
> I have been looking through the soon-to-be-superseded (by its 2nd ed.) book
> "Lucene In Action" (hope it's ok on this newsgroup to say I like that book);
> also at these two tutorials: http://darksleep.com/lucene/ and
> http://www.informit.com/articles/article.aspx?p=461633&seqNum=3 and also
> at the Lucene online docco (http://lucene.apache.org/java/2_4_0/index.html)
> the last of which has nothing on the topic at all! I've also tried to search
> http://www.nabble.com/Lucene---Java-Users-f45.html -- but there are almost
> 10,000 docs there on "Field." so that is too much data.
>
> The book is consistent with the two tutorials, but all three seem to be out
> of date (and the design less clear) compared to the code:
> http://lucene.apache.org/java/2_4_0/api/index.html
>
> I have copied some code and it is working for me, but I am a little
> uncertain how to decide what value of Field.Index and Field.Store to choose
> in order to get the behavior I'd like. If I read the javadocs, and decide to
> ignore all the "expert" items, it looks like this:
>
> Field.Store.NO = I'll never see that data again; I wonder why I'd do this?

Mostly for space reasons. Say you have a version of a document that you
really want to show the user, for instance PDFs or images of pages. We have
cases where we have images of pages in books, and OCRd data of those images.
We'll never want to show the OCR to the user, but that's all we have to
search. What we do show the user is the image of the page from the image
vault. So we don't store the OCR text, just index it.

>
>
> Field.Store.YES = good, the data will be stored

Yes, but this bloats the index. If you're not going to show the user the
field exactly as it exists, there's no reason to store it. See above.

>
>
> Field.Store.COMPRESS = even better, stored and compressed; why would anyone
> do anything else?

Because of the cost involved in decompressing it, assuming you want to store
it in the first place (see above). But assume you want to show 500 documents
at a time. Decompressing time may count.

>
>
> ========
>
> Field.Index.NO = I cannot search that data, but if I need its value for a
> given document (e.g., to decorate a result), I can retrieve it (use-case:
> maybe, the date the document was created -- but why not just make that
> searchable? I am having a hard time thinking of an actually useful piece of
> data that could go here and would not want to be one of ANALYZED or
> NOT_ANALYZED)
>

Think harder <G>.... We have an application where we store meta-data in the
document for page navigation. There's no reason to index this data because
we never search on it. We don't want to provide the users with an interface
like "search for Erick on pages 21-32", we don't think it's valuable. One
can solve this kind of problem with an external data store (say a database),
but the added complexity of a second storage mechanism is worth avoiding if
possible.

>
> Field.Index.ANALYZED = the normal value, I would guess, except in the
> special case of stuff not searchable but used to decorate results (
> Field.Index.NO)

Yep, this is the most common case in my experience.

>
>
> Field.Index.NOT_ANALYZED = I can search for this value, but it won't get
> analyzed, so it is searched for as the very same value I put in (the docco
> suggests product numbers: any other interesting use-cases anyone can
> suggest?)

Most analyzers change the stream into tokens. For instance it would be very
common to break up 1234-5678 into 1234 and 5678. Assume this is a part
number. Matching 1234 is useless. And it's even worse if your analyzer
strips out the numbers entirely. Or SSN or even a case where you don't want
to break up on whitespace. Telephone numbers. Although in truth I don't use
this very often, but it saves a world of hurt when needed

Best
Erick

>
>
> =========
>
> thanks in advance for helping me get clearer on this!
>
> -Paul
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: newbie seeking explanation of semantics of "Field" class

Posted by Matthew Hall <mh...@informatics.jax.org>.

Comments inline:

rolarenfan@earthlink.net wrote:
> R2.4 
>
> I have been looking through the soon-to-be-superseded (by its 2nd ed.) book "Lucene In Action" (hope it's ok on this newsgroup to say I like that book); also at these two tutorials: http://darksleep.com/lucene/ and http://www.informit.com/articles/article.aspx?p=461633&seqNum=3 and also at the Lucene online docco (http://lucene.apache.org/java/2_4_0/index.html) the last of which has nothing on the topic at all! I've also tried to search http://www.nabble.com/Lucene---Java-Users-f45.html -- but there are almost 10,000 docs there on "Field." so that is too much data. 
>
> The book is consistent with the two tutorials, but all three seem to be out of date (and the design less clear) compared to the code: http://lucene.apache.org/java/2_4_0/api/index.html 
>
> I have copied some code and it is working for me, but I am a little uncertain how to decide what value of Field.Index and Field.Store to choose in order to get the behavior I'd like. If I read the javadocs, and decide to ignore all the "expert" items, it looks like this: 
>
> Field.Store.NO = I'll never see that data again; I wonder why I'd do this?
This is useful in the cases where you have data you want to be able to 
search by, but never need to display it.

For example in my application we have complex data like:

kit<Gsfco1>
^ 
<http://www.informatics.jax.org/javawi2/servlet/WIFetch?page=alleleDetail&id=MGI:3530308>
In one of our searchable indexes we do quite a bit of transformation to 
this data, and remove all of the punctuation, etc etc.

so it turns into: kit gsfcol

This is great for searching, cause it allows us to have punctuation 
irrelevant search results, but the user simply doesn't care whatsoever.  
So at display time, we show them the unmodified, case sensitive version 
of this data, which is stored in another field.
>  
>
> Field.Store.YES = good, the data will be stored 
>
>   
Storage takes up space, so if you are ONLY going to search on a piece of 
data, and never display it, you should not store it.
> Field.Store.COMPRESS = even better, stored and compressed; why would anyone do anything else? 
>
>   
I agree.
> ========
>
> Field.Index.NO = I cannot search that data, but if I need its value for a given document (e.g., to decorate a result), I can retrieve it (use-case: maybe, the date the document was created -- but why not just make that searchable? I am having a hard time thinking of an actually useful piece of data that could go here and would not want to be one of ANALYZED or NOT_ANALYZED) 
>
>   
Correct, you use this type of data as additional information about the 
data you matched on. 
> Field.Index.ANALYZED = the normal value, I would guess, except in the special case of stuff not searchable but used to decorate results (Field.Index.NO)
>
>   
Correct.
> Field.Index.NOT_ANALYZED = I can search for this value, but it won't get analyzed, so it is searched for as the very same value I put in (the docco suggests product numbers: any other interesting use-cases anyone can suggest?) 
>
>   
Its highly useful for exact match searching.
> ========= 
>
> thanks in advance for helping me get clearer on this!
>
> -Paul 
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: newbie seeking explanation of semantics of "Field" class

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Paul,

> I have copied some code and it is working for me, but I am a little
> uncertain how to decide what value of Field.Index and Field.Store to
> choose in order to get the behavior I'd like. If I read the javadocs, and
> decide to ignore all the "expert" items, it looks like this:
> 
> Field.Store.NO = I'll never see that data again; I wonder why I'd do this?

If you have the data somewhere else (like in XML) files, it makes no sense,
to store it additionally in the index. You would normally store one field
containing the filename or identifier and use it for retrieving the original
document. On the other hand, if you want to have a index-only solution, it
may be good to store the filed values in index, but this may not be needed
for all fields. E.g. you have a filed, in which you index the whole document
contents and another field, where you only index the title (with that you
can search only inside the title or in the whole documents). As the title is
part of the whole document contents, it makes no sense to additionally store
it, if it's not really needed for displaying results.

> Field.Store.YES = good, the data will be stored

Yes.

> Field.Store.COMPRESS = even better, stored and compressed; why would
> anyone do anything else?

Compressing is very contraproductive for small values (and decreases
performance). Short values like identifers and so on mostly "compress" to
larger values than before. So, only use compress, if you have large document
contents, where performance of retrieving is not important.

> ========
> 
> Field.Index.NO = I cannot search that data, but if I need its value for a
> given document (e.g., to decorate a result), I can retrieve it (use-case:
> maybe, the date the document was created -- but why not just make that
> searchable? I am having a hard time thinking of an actually useful piece
> of data that could go here and would not want to be one of ANALYZED or
> NOT_ANALYZED)

E.g. we have this here, to store the original XML document. The XML
documents does not get indexed directly, only the text contents are indexed.
For result display, I store the XML file in a stored-only field (and
compressed). By the way, you can also store binary data like images (but not
index it).

> Field.Index.ANALYZED = the normal value, I would guess, except in the
> special case of stuff not searchable but used to decorate results
> (Field.Index.NO)
> 
> Field.Index.NOT_ANALYZED = I can search for this value, but it won't get
> analyzed, so it is searched for as the very same value I put in (the docco
> suggests product numbers: any other interesting use-cases anyone can
> suggest?)

All type of identifiers or primary keys, numbers like prices, dates,...

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org