You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2009/03/04 08:41:56 UTC

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

    [ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678623#action_12678623 ] 

Uwe Schindler commented on LUCENE-1372:
---------------------------------------

For TrieRange the proposed variant to sort by the lowest term in TermEnum is absolutely fine.

Sorting against the first term in the document is simply impossible (maybe working if you use the term positions during array creation, but this will slow down and it only works with real tokenized fields, not fields like TrieRange).
TrieRange does not use String/StringIndex sorting, the ordering is done using the raw long/int values. The arrays are filled and SortFields are instantiated using a custom FieldCache.Parser (see LUCENE-1478). So if it is ordered by the lowest term (which is always the highest precision one in TrieRange), the order would be correct.

In the current version, the results would be sorted using the last term in TermEnum, which is the lowest precision. The order is then simply to unprecise (because the documents indexed with TrieRange have the lower int/long bits stripped away).

The "simple" proposal is enough for trie range. Maybe we can add a option to switch between first/last term (and make this option also available to SortFields and other parts where the FieldCache is used).

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1372-MultiValueSorters.patch, lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which multiple values exist for one document. For example, imagine a field "fruit" which is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc] with the value for each document with that term. Effectively, this overwriting means that a string sort in this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org