You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/16 14:55:26 UTC

[GitHub] [arrow] alamb commented on pull request #8473: ARROW-10320 [Rust] [DataFusion] Migrated from batch iterators to batch streams.

alamb commented on pull request #8473:
URL: https://github.com/apache/arrow/pull/8473#issuecomment-710096379


   > A stream of record batches will only return the next batch after the previous one was (async) computed. Therefore, we can't parallelize, as we are blocked by the execution of the previous batch. Therefore, this PR is only useful for situations on which we are reading batches that we may need to wait for.
   
   @jorgecarleitao  -- I personally think leaving the interface to be `Stream<RecordBatch>` is the way to go. 
   
   Among other things, this interface allows "backpressure" so that we don't end up buffering intermediate results if one operator (e.g. filter) can produce record batches than its consumer can process them (e.g. aggregate). 
   
   If we want to add additional parallelism with a tunable cost of additional buffering, we could keep the interface of this PR and add an operator like "Buffer". Buffer would calculate and store some number of `RecordBatches`, perhaps in a channel, as separate tasks in advance of when a batch was requested. 
   
   Then the user of DataFusion could control how much more buffering they were willing to accept in order to get more parallelism. 
   
   I think an approach like https://github.com/apache/arrow/pull/8480 is cool, but the choice of the aggregate, for example, to compute the input as fast as possible in new tasks, even if the aggregate
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org