You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2007/10/22 19:09:41 UTC

Per-field collators

Hoss wrote on the user list...

> Terms when indexed are allways ordered lexigraphically (using
> Term.compareTo which uses String.compareTo) ... regardless of what  
> field
> or language they are in, so "Range Queries" must do their comparisons
> lexigraphically as well.
>
> because all Terms are indexed in one continuous TermEnum, it would be
> fairly imposible to definite different Collators per field at index  
> time.

If you were to implement per-field Collators, how would you go about  
it?  There's been a long-standing request for KinoSearch to implement  
arbitrary sorting.

The conclusion I reached was that you needed to have a dedicated  
TermEnum for each field, implying individual term dictionary files  
(.tis, .tii).  But maybe there's a better way.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Per-field collators

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Oct 22, 2007, at 10:09 AM, Marvin Humphrey wrote:

> The conclusion I reached was that you needed to have a dedicated  
> TermEnum for each field, implying individual term dictionary files  
> (.tis, .tii).

I realized that I needed to explain this.

If KS allows users to supply Perl sort subs as collators, the cost  
per comparison will be high.  This doesn't scale well for large  
result sets.

One solution is to move the sorting cost to index-time for individual  
fields.  Since KS has global field semantics, it's possible to  
associate a collator with a field name, and sort terms within the  
term dictionary by it.  However, using multiple collators within the  
same term dictionary is messy, because it's difficult to decide which  
one you should be using at any given point during a scan.  Using a  
dedicated TermEnum for each field cleans that up.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org