You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Andrzej Bialecki (Jira)" <ji...@apache.org> on 2021/03/02 19:40:00 UTC

[jira] [Created] (SOLR-15210) ParallelStream should execute hashing & filtering directly in ExportWriter

Andrzej Bialecki created SOLR-15210:
---------------------------------------

             Summary: ParallelStream should execute hashing & filtering directly in ExportWriter
                 Key: SOLR-15210
                 URL: https://issues.apache.org/jira/browse/SOLR-15210
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Andrzej Bialecki
            Assignee: Andrzej Bialecki


Currently ParallelStream uses {{HashQParserPlugin}} to partition the work based on a hashed value of {{partitionKeys}}. Unfortunately, this filter has a high initial runtime cost because it has to materialize all values of {{partitionKeys}} on each worker in order to calculate their hash and decide whether a particular doc belongs to the worker's partition.

The alternative approach would be for the worker to collect and sort all documents and only then filter out the ones that belong to the current partition just before they are written out by {{ExportWriter}} - at this point we have to materialize the fields anyway but also we can benefit from a (minimal) BytesRef caching that the FieldWriters use. On the other hand we pay the price of sorting all documents, and we also lose the query filter caching that the {{HashQParserPlugin}} uses.

This tradeoff is not obvious but should be investigated to see if it offers better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org