You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Sokolov <so...@ifactory.com> on 2012/07/11 02:54:14 UTC

storing pre-analyzed fields

I have a question about the API for storing and indexing lucene 
documents (in 3.x).

If I want to index a document by providing a TokenStream, I can do that 
by calling document.add (field) where field is something I write 
deriving from AbstractField that returns the TokenStream for 
tokenStreamValue(), and nothing for stringValue() or readerValue().

Now if I also want to store a value for that field, do I just add a 
different field with different options (eg stored=true, and the field a 
normal Field)?

Do these two things conflict in any way?  Do I have to be careful about 
the order in which I do them?  Or is it just a mildly weird API with no 
lurking ill effects? :)

Also: I have been seeing various e-mails about changes to this API so I 
assume it's all different in 4.0; if you want to take this opportunity 
to explain that, please go ahead, but for now I am working with the 3.x API.

Thanks

-Mike Sokolov

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: storing pre-analyzed fields

Posted by Michael Sokolov <so...@ifactory.com>.

Uwe  - thank you very much for the thorough explanation!

-Mike

On 7/11/2012 1:14 AM, Uwe Schindler wrote:
> Hi Mike,
>
> The order does not matter at all in all versions of Lucene. You also don't
> need to subclass AbstractField (but you can use e.g. NumericField as an
> example); it is enough to use new Field(name, TokenStream); if you also want
> to store this field, simply add a stored-only field with the *same* name (in
> addition to the TokenStream one).
>
> In Lucene 4.0 we are going the direction to split between the "Document"
> objects using for indexing from them returned by IndexReader/Searcher,
> because they are two different things and the latter only returning stored
> fields. But this does not affect anything here.
>
> In all Lucene versions, stored field values and indexed values are
> completely decoupled and do not relate to each other at all. Adding a Field
> in stored+indexed way is just for convenience, but you can also add it two
> times (one time as stored, one time as indexed - I prefer to always do this)
> in any order. The resulting index will be identical (don't compare files;
> there will be differences in headers!).
>
> There is one importance of order: Fields with the same name and same type
> rely on order, so two stored fields with same name are returned in same
> order by IndexReader/-Searcher, and 2 indexed fields with same name produce
> the same order for e.g. PhraseQuery or SpanQuery only, if the Field order is
> predefined. But you can interleave the Field instances for each type as you
> like.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Michael Sokolov [mailto:sokolov@ifactory.com]
>> Sent: Wednesday, July 11, 2012 2:54 AM
>> To: java-user@lucene.apache.org
>> Subject: storing pre-analyzed fields
>>
>> I have a question about the API for storing and indexing lucene documents
> (in
>> 3.x).
>>
>> If I want to index a document by providing a TokenStream, I can do that by
>> calling document.add (field) where field is something I write deriving
> from
>> AbstractField that returns the TokenStream for tokenStreamValue(), and
>> nothing for stringValue() or readerValue().
>>
>> Now if I also want to store a value for that field, do I just add a
> different field
>> with different options (eg stored=true, and the field a normal Field)?
>>
>> Do these two things conflict in any way?  Do I have to be careful about
> the
>> order in which I do them?  Or is it just a mildly weird API with no
> lurking ill
>> effects? :)
>>
>> Also: I have been seeing various e-mails about changes to this API so I
> assume
>> it's all different in 4.0; if you want to take this opportunity to explain
> that,
>> please go ahead, but for now I am working with the 3.x API.
>>
>> Thanks
>>
>> -Mike Sokolov
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: storing pre-analyzed fields

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Mike,

The order does not matter at all in all versions of Lucene. You also don't
need to subclass AbstractField (but you can use e.g. NumericField as an
example); it is enough to use new Field(name, TokenStream); if you also want
to store this field, simply add a stored-only field with the *same* name (in
addition to the TokenStream one).

In Lucene 4.0 we are going the direction to split between the "Document"
objects using for indexing from them returned by IndexReader/Searcher,
because they are two different things and the latter only returning stored
fields. But this does not affect anything here.

In all Lucene versions, stored field values and indexed values are
completely decoupled and do not relate to each other at all. Adding a Field
in stored+indexed way is just for convenience, but you can also add it two
times (one time as stored, one time as indexed - I prefer to always do this)
in any order. The resulting index will be identical (don't compare files;
there will be differences in headers!).

There is one importance of order: Fields with the same name and same type
rely on order, so two stored fields with same name are returned in same
order by IndexReader/-Searcher, and 2 indexed fields with same name produce
the same order for e.g. PhraseQuery or SpanQuery only, if the Field order is
predefined. But you can interleave the Field instances for each type as you
like.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael Sokolov [mailto:sokolov@ifactory.com]
> Sent: Wednesday, July 11, 2012 2:54 AM
> To: java-user@lucene.apache.org
> Subject: storing pre-analyzed fields
> 
> I have a question about the API for storing and indexing lucene documents
(in
> 3.x).
> 
> If I want to index a document by providing a TokenStream, I can do that by
> calling document.add (field) where field is something I write deriving
from
> AbstractField that returns the TokenStream for tokenStreamValue(), and
> nothing for stringValue() or readerValue().
> 
> Now if I also want to store a value for that field, do I just add a
different field
> with different options (eg stored=true, and the field a normal Field)?
> 
> Do these two things conflict in any way?  Do I have to be careful about
the
> order in which I do them?  Or is it just a mildly weird API with no
lurking ill
> effects? :)
> 
> Also: I have been seeing various e-mails about changes to this API so I
assume
> it's all different in 4.0; if you want to take this opportunity to explain
that,
> please go ahead, but for now I am working with the 3.x API.
> 
> Thanks
> 
> -Mike Sokolov
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org