You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2021/04/27 08:37:00 UTC

[jira] [Commented] (LUCENE-8069) Allow index sorting by field length

    [ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333040#comment-17333040 ] 

Adrien Grand commented on LUCENE-8069:
--------------------------------------

Since I was playing with the MSMarco passages dataset for other reasons I wanted to give this change a try again with the first 1000 queries from the `eval` file. Unlike the wikipedia tasks file, queries in this dataset have many terms, often 5+, sometimes even 10+. All of them are disjunctions.

Lucene defaults:
 - avg: 11ms
 - median: 6ms
 - p90: 28ms
 - p99: 80ms

Index sorted by increasing field length:
 - avg: 7ms
 - median: 2ms
 - p90: 6ms
 - p99: 17ms

This seems to confirm that this approach could be very valuable.

> Allow index sorting by field length
> -----------------------------------
>
>                 Key: LUCENE-8069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8069
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by field length would mean we would be likely to collect best matches first. Depending on the similarity implementation, this might even allow to early terminate collection of top documents on term queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org