You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/16 14:17:58 UTC

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8473: ARROW-10320 [Rust] [DataFusion] Migrated from batch iterators to batch streams.

jorgecarleitao edited a comment on pull request #8473:
URL: https://github.com/apache/arrow/pull/8473#issuecomment-710075328

I am still not entirely happy with this, and I think that we need a further step to make this work well, which is the reason I placed this back to draft.

Essentially, I misinterpreted what a `Stream` is used for. A `stream` of record batches will only return the next batch _after_ the previous one was (async) computed. Therefore, we can't parallelize, as we are blocked by the execution of the previous batch. Therefore, this PR is only useful for situations on which we are reading batches that we may need to wait for.

I think that the solution is to put a `future` as the `Item` of the `Stream`, so that we can collect all those futures and perform `join_all().await`, thereby sending every task to the scheduler.

PR #8480 is the proposal for that. I doubles the performance of the larger projections benchmarks (purely due to scheduling / threading).

Note that each PR solves a different problem: this one is related to the fact that some batches may not be available yet (and thus we should not block), the other one is related to the fact that some operations can be parallelized (and thus we should `spawn`). I think that together they address the main issues.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org