You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Joel Bernstein (Jira)" <ji...@apache.org> on 2021/03/03 01:20:00 UTC

[jira] [Commented] (SOLR-15210) ParallelStream should execute hashing & filtering directly in ExportWriter

    [ https://issues.apache.org/jira/browse/SOLR-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294202#comment-17294202 ] 

Joel Bernstein commented on SOLR-15210:
---------------------------------------

Let's have have the best of both worlds. We can lazily build up a bitset of documents to ignore for each worker. We can then apply this bitset before the sorting stage.

Here is the basic idea:

1) In the writer thread hash each key and decide if the worker should send it out.
2) When a worker thread finds a key that shouldn't be sent out add the docId to an ignore bitset for the specific worker.
3) After each run combine the ignore bitsets with a cached set of ignore bitsets per worker.
4) Before performing the sort, turn off all bits for each worker that intersects the ignore bitset.

Basically this lazily builds a set of documents per worker that should NOT be sent out. This cache will warm over time making the exports faster over time.



> ParallelStream should execute hashing & filtering directly in ExportWriter
> --------------------------------------------------------------------------
>
>                 Key: SOLR-15210
>                 URL: https://issues.apache.org/jira/browse/SOLR-15210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Currently ParallelStream uses {{HashQParserPlugin}} to partition the work based on a hashed value of {{partitionKeys}}. Unfortunately, this filter has a high initial runtime cost because it has to materialize all values of {{partitionKeys}} on each worker in order to calculate their hash and decide whether a particular doc belongs to the worker's partition.
> The alternative approach would be for the worker to collect and sort all documents and only then filter out the ones that belong to the current partition just before they are written out by {{ExportWriter}} - at this point we have to materialize the fields anyway but also we can benefit from a (minimal) BytesRef caching that the FieldWriters use. On the other hand we pay the price of sorting all documents, and we also lose the query filter caching that the {{HashQParserPlugin}} uses.
> This tradeoff is not obvious but should be investigated to see if it offers better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org