You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Matt Turner <m4...@hotmail.com> on 2009/06/25 22:32:58 UTC

Order of fields within a Document in Lucene 2.4+

The Lucene FAQ says...

What is the order of fields returned by Document.fields()?
* Fields are returned in the same order they were added to the document.
(now getFields() as fields is deprecated)

However I think this may no longer be the case in 2.4

We are indexing documents in a specific order so that we can LOAD_AND_BREAK out of our FieldSelector as early as possible.
i.e. we have typically 50 indexed fields for a document, but when we are loading results with .doc(), we know we only need 4 of them.

So, our code ensures that these are added to the index first - and once the 4th field is loaded we break out of the selector.

This speeds us up by an order of magnitude.

However, we are finding that our field selector is processing fields in alphabetical order, not order of addition. This means that we'd have to rename our fields to 'aaa..' in order to guarantee they'd be processed first.

I think, but am not sure, that this bit of code causes the problem (as spotted in http://www.mail-archive.com/java-user@lucene.apache.org/msg24105.html).
It seems to have been introduced in version 2.4 (fields are in addition order in 2.3.2)

DocFieldProcessorPerThread.java:

// If we are writing vectors then we must visit
// fields in sorted order so they are written in
// sorted order. TODO: we actually only need to
// sort the subset of fields that have vectors
// enabled; we could save [small amount of] CPU
// here.
quickSort(fields, 0, fieldCount-1);

This appears to sort fields into alphabetical order.

Assuming that implementing the TODO would keep them in order of addition (and just keep vectors fields themselves sorted) - is it worth raising a JIRA to fix this ?

regards,

matt

_________________________________________________________________
Get the best of MSN on your mobile
http://clk.atdmt.com/UKM/go/147991039/direct/01/

RE: Order of fields within a Document in Lucene 2.4+

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.

 

I agree. Using Lucene 2.4.1 doc.getFields() returns in alpha order and
not the order in which they were added.

Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Matt Turner [mailto:m4tt_turner@hotmail.com] 
Sent: Thursday, June 25, 2009 4:33 PM
To: java-user@lucene.apache.org
Subject: Order of fields within a Document in Lucene 2.4+


The Lucene FAQ says...
 
What is the order of fields returned by Document.fields()?
* Fields are returned in the same order they were added to the document.

(now getFields() as fields is deprecated)
 
However I think this may no longer be the case in 2.4 
 
We are indexing documents in a specific order so that we can
LOAD_AND_BREAK out of our FieldSelector as early as possible.
i.e. we have typically 50 indexed fields for a document, but when we are
loading results with .doc(), we know we only need 4 of them.
 
So, our code ensures that these are added to the index first - and once
the 4th field is loaded we break out of the selector.
 
This speeds us up by an order of magnitude.
 
 
 
However, we are finding that our field selector is processing fields in
alphabetical order, not order of addition.  This means that we'd have to
rename our fields to 'aaa..' in order to guarantee they'd be processed
first.
 
 
I think, but am not sure, that this bit of code causes the problem (as
spotted in
http://www.mail-archive.com/java-user@lucene.apache.org/msg24105.html).
It seems to have been introduced in version 2.4 (fields are in addition
order in 2.3.2)
 
DocFieldProcessorPerThread.java:

   // If we are writing vectors then we must visit
   // fields in sorted order so they are written in
   // sorted order.  TODO: we actually only need to
   // sort the subset of fields that have vectors
   // enabled; we could save [small amount of] CPU
   // here.
   quickSort(fields, 0, fieldCount-1);

 
This appears to sort fields into alphabetical order.
 
Assuming that implementing the TODO would keep them in order of addition
(and just keep vectors fields themselves sorted) - is it worth raising a
JIRA to fix this ?
 
 
regards,
 
matt
 

 

_________________________________________________________________
Get the best of MSN on your mobile
http://clk.atdmt.com/UKM/go/147991039/direct/01/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Order of fields within a Document in Lucene 2.4+

Posted by Michael McCandless <lu...@mikemccandless.com>.

Sorry, yes, this was my fault with the indexing speedups in 2.3
(LUCENE-843): as of 2.3, if any fields have term vectors enabled, the
fields are sorted lexicographically.  As of 2.4 (LUCENE-1301,
refactoring the indexing core), that sort happens even without term
vectors.

Hoss I see you've opened an issue for this (LUCENE-1727) for this;
I'll take that & fix for 2.9.

Sorry,

Mike

On Tue, Jun 30, 2009 at 9:20 PM, Mark Miller<ma...@gmail.com> wrote:
> Yeah, I've heard rumblings about this issue before. I can't remember what
> patch changed it though - one of Mike M's I think?
>
> On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter
> <ho...@fucit.org>wrote:
>
>>
>> Hmmm... i'm not an expert on the internals of indexing, and i don't use
>> FieldSelectors much, but this seems like a pretty big bug to me ... or at
>> the very least: a change in behavior that completely eliminates the value
>> of LOAD_AND_BREAK.
>>
>> https://issues.apache.org/jira/browse/LUCENE-1727
>>
>>
>>
>> : The Lucene FAQ says...
>> :
>> : What is the order of fields returned by Document.fields()?
>> : * Fields are returned in the same order they were added to the document.
>> : (now getFields() as fields is deprecated)
>> :
>> : However I think this may no longer be the case in 2.4
>> :
>> : We are indexing documents in a specific order so that we can
>> LOAD_AND_BREAK out of our FieldSelector as early as possible.
>> : i.e. we have typically 50 indexed fields for a document, but when we are
>> loading results with .doc(), we know we only need 4 of them.
>> :
>> : So, our code ensures that these are added to the index first - and once
>> the 4th field is loaded we break out of the selector.
>> :
>> : This speeds us up by an order of magnitude.
>> :
>> :
>> :
>> : However, we are finding that our field selector is processing fields in
>> alphabetical order, not order of addition.  This means that we'd have to
>> rename our fields to 'aaa..' in order to guarantee they'd be processed
>> first.
>> :
>> :
>> : I think, but am not sure, that this bit of code causes the problem (as
>> spotted in
>> http://www.mail-archive.com/java-user@lucene.apache.org/msg24105.html).
>> : It seems to have been introduced in version 2.4 (fields are in addition
>> order in 2.3.2)
>> :
>> : DocFieldProcessorPerThread.java:
>> :
>> :    // If we are writing vectors then we must visit
>> :    // fields in sorted order so they are written in
>> :    // sorted order.  TODO: we actually only need to
>> :    // sort the subset of fields that have vectors
>> :    // enabled; we could save [small amount of] CPU
>> :    // here.
>> :    quickSort(fields, 0, fieldCount-1);
>> :
>> :
>> : This appears to sort fields into alphabetical order.
>> :
>> : Assuming that implementing the TODO would keep them in order of addition
>> (and just keep vectors fields themselves sorted) - is it worth raising a
>> JIRA to fix this ?
>> :
>> :
>> : regards,
>> :
>> : matt
>> :
>> :
>> :
>> :
>> : _________________________________________________________________
>> : Get the best of MSN on your mobile
>> : http://clk.atdmt.com/UKM/go/147991039/direct/01/
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Order of fields within a Document in Lucene 2.4+

Posted by Mark Miller <ma...@gmail.com>.

Yeah, I've heard rumblings about this issue before. I can't remember what
patch changed it though - one of Mike M's I think?

On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> Hmmm... i'm not an expert on the internals of indexing, and i don't use
> FieldSelectors much, but this seems like a pretty big bug to me ... or at
> the very least: a change in behavior that completely eliminates the value
> of LOAD_AND_BREAK.
>
> https://issues.apache.org/jira/browse/LUCENE-1727
>
>
>
> : The Lucene FAQ says...
> :
> : What is the order of fields returned by Document.fields()?
> : * Fields are returned in the same order they were added to the document.
> : (now getFields() as fields is deprecated)
> :
> : However I think this may no longer be the case in 2.4
> :
> : We are indexing documents in a specific order so that we can
> LOAD_AND_BREAK out of our FieldSelector as early as possible.
> : i.e. we have typically 50 indexed fields for a document, but when we are
> loading results with .doc(), we know we only need 4 of them.
> :
> : So, our code ensures that these are added to the index first - and once
> the 4th field is loaded we break out of the selector.
> :
> : This speeds us up by an order of magnitude.
> :
> :
> :
> : However, we are finding that our field selector is processing fields in
> alphabetical order, not order of addition.  This means that we'd have to
> rename our fields to 'aaa..' in order to guarantee they'd be processed
> first.
> :
> :
> : I think, but am not sure, that this bit of code causes the problem (as
> spotted in
> http://www.mail-archive.com/java-user@lucene.apache.org/msg24105.html).
> : It seems to have been introduced in version 2.4 (fields are in addition
> order in 2.3.2)
> :
> : DocFieldProcessorPerThread.java:
> :
> :    // If we are writing vectors then we must visit
> :    // fields in sorted order so they are written in
> :    // sorted order.  TODO: we actually only need to
> :    // sort the subset of fields that have vectors
> :    // enabled; we could save [small amount of] CPU
> :    // here.
> :    quickSort(fields, 0, fieldCount-1);
> :
> :
> : This appears to sort fields into alphabetical order.
> :
> : Assuming that implementing the TODO would keep them in order of addition
> (and just keep vectors fields themselves sorted) - is it worth raising a
> JIRA to fix this ?
> :
> :
> : regards,
> :
> : matt
> :
> :
> :
> :
> : _________________________________________________________________
> : Get the best of MSN on your mobile
> : http://clk.atdmt.com/UKM/go/147991039/direct/01/
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Order of fields within a Document in Lucene 2.4+

Posted by Chris Hostetter <ho...@fucit.org>.

Hmmm... i'm not an expert on the internals of indexing, and i don't use 
FieldSelectors much, but this seems like a pretty big bug to me ... or at 
the very least: a change in behavior that completely eliminates the value 
of LOAD_AND_BREAK.

https://issues.apache.org/jira/browse/LUCENE-1727



: The Lucene FAQ says...
:  
: What is the order of fields returned by Document.fields()?
: * Fields are returned in the same order they were added to the document. 
: (now getFields() as fields is deprecated)
:  
: However I think this may no longer be the case in 2.4 
:  
: We are indexing documents in a specific order so that we can LOAD_AND_BREAK out of our FieldSelector as early as possible.
: i.e. we have typically 50 indexed fields for a document, but when we are loading results with .doc(), we know we only need 4 of them.
:  
: So, our code ensures that these are added to the index first - and once the 4th field is loaded we break out of the selector.
:  
: This speeds us up by an order of magnitude.
:  
:  
:  
: However, we are finding that our field selector is processing fields in alphabetical order, not order of addition.  This means that we'd have to rename our fields to 'aaa..' in order to guarantee they'd be processed first.
:  
:  
: I think, but am not sure, that this bit of code causes the problem (as spotted in http://www.mail-archive.com/java-user@lucene.apache.org/msg24105.html).
: It seems to have been introduced in version 2.4 (fields are in addition order in 2.3.2)
:  
: DocFieldProcessorPerThread.java:
: 
:    // If we are writing vectors then we must visit
:    // fields in sorted order so they are written in
:    // sorted order.  TODO: we actually only need to
:    // sort the subset of fields that have vectors
:    // enabled; we could save [small amount of] CPU
:    // here.
:    quickSort(fields, 0, fieldCount-1);
: 
:  
: This appears to sort fields into alphabetical order.
:  
: Assuming that implementing the TODO would keep them in order of addition (and just keep vectors fields themselves sorted) - is it worth raising a JIRA to fix this ?
:  
:  
: regards,
:  
: matt
:  
: 
:  
: 
: _________________________________________________________________
: Get the best of MSN on your mobile
: http://clk.atdmt.com/UKM/go/147991039/direct/01/



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org