You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Christoph Kaser (JIRA)" <ji...@apache.org> on 2019/03/12 12:27:00 UTC

[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices

    [ https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790485#comment-16790485 ] 

Christoph Kaser commented on LUCENE-8542:
-----------------------------------------

Is there anything I can change / add to get this committed? Or do you think it makes no sense for the general use case of lucene?

> Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8542
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Christoph Kaser
>            Priority: Minor
>         Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. When I run a query against this index with a huge number of results requested (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use paging and searchAfter, but our architecture does not allow this at the moment.)
> The reason for the huge memory requirement is that the search [will create a TopScoreDocCollector for each segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404], each one with numHits = 5 million. This is fine for the large segments, but many of those segments are fairly small and only contain several thousand documents. This wastes a huge amount of memory for queries with large values of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to introduce this as a new method with a default implementation that calls the old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the leafslice-size.
> If this is something that would make sense for you, I can try to provide a patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org