You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Andrzej Bialecki (Jira)" <ji...@apache.org> on 2021/03/10 10:56:00 UTC

[jira] [Updated] (SOLR-15210) ParallelStream should execute hashing & filtering directly in ExportWriter

     [ https://issues.apache.org/jira/browse/SOLR-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki updated SOLR-15210:
------------------------------------
    Attachment:     (was: SOLR-15210.patch)

> ParallelStream should execute hashing & filtering directly in ExportWriter
> --------------------------------------------------------------------------
>
>                 Key: SOLR-15210
>                 URL: https://issues.apache.org/jira/browse/SOLR-15210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently ParallelStream uses {{HashQParserPlugin}} to partition the work based on a hashed value of {{partitionKeys}}. Unfortunately, this filter has a high initial runtime cost because it has to materialize all values of {{partitionKeys}} on each worker in order to calculate their hash and decide whether a particular doc belongs to the worker's partition.
> The alternative approach would be for the worker to collect and sort all documents and only then filter out the ones that belong to the current partition just before they are written out by {{ExportWriter}} - at this point we have to materialize the fields anyway but also we can benefit from a (minimal) BytesRef caching that the FieldWriters use. On the other hand we pay the price of sorting all documents, and we also lose the query filter caching that the {{HashQParserPlugin}} uses.
> This tradeoff is not obvious but should be investigated to see if it offers better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org