You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/10/06 16:19:27 UTC
[jira] [Commented] (LUCENE-6828) Speed up requests for many rows

    [ https://issues.apache.org/jira/browse/LUCENE-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945088#comment-14945088 ] 

Adrien Grand commented on LUCENE-6828:
--------------------------------------

Since you opened this issue, I'm wondering if you have more information about the use-case of users that use such large pages? I think that some of our users that execute such requests are in fact trying to export a subset of their indexes, in which case they don't even need sorted results so we don't need a priority queue? And I'd be curious to understand more about the other ones.

Also since you're playing with priority queues at the moment, I remember getting better results at sorting with a ternary heap than a regular heap, I assume because it has better cache efficiency in spite of a worse runtime. And some people experimented with making priority queues more cache-efficient, eg. http://playfulprogramming.blogspot.it/2015/08/cache-optimizing-priority-queue.html

> Speed up requests for many rows
> -------------------------------
>
>                 Key: LUCENE-6828
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6828
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.10.4, 5.3
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: memory, performance
>
> Standard relevance ranked searches for top-X results uses the HitQueue class to keep track of the highest scoring documents. The HitQueue is a binary heap of ScoreDocs and is pre-filled with sentinel objects upon creation.
> Binary heaps of Objects in Java does not scale well: The HitQueue uses 28 bytes/element and memory access is scattered due to the binary heap algorithm and the use of Objects. To make matters worse, the use of sentinel objects means that even if only a tiny number of documents matches, the full amount of Objects is still allocated.
> As long as the HitQueue is small (< 1000), it performs very well. If top-1M results are requested, it performs poorly and leaves 1M ScoreDocs to be garbage collected.
> An alternative is to replace the ScoreDocs with a single array of packed longs, each long holding the score and the document ID. This strategy requires only 8 bytes/element and is a lot lighter on the GC.
> Some preliminary tests has been done and published at https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> These indicate that a long[]-backed implementation is at least 3x faster than vanilla HitDocs for top-1M requests.
> For smaller requests, such as top-10, the packed version also seems competitive, when the amount of matched documents exceeds 1M. This needs to be investigated further.
> Going forward with this idea requires some refactoring as Lucene is currently hardwired to the abstract PriorityQueue. Before attempting this, it seems prudent to discuss whether speeding up large top-X requests has any value? Paging seems an obvious contender for requesting large result sets, but I guess the two could work in tandem, opening up for efficient large pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org