You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/02/26 21:17:00 UTC
[jira] [Created] (ARROW-11800) [C++] Add unordered scan
Weston Pace created ARROW-11800:
-----------------------------------
Summary: [C++] Add unordered scan
Key: ARROW-11800
URL: https://issues.apache.org/jira/browse/ARROW-11800
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Currently Scan generates an ordered sequence of batches. However, this is not ideal. For example, consider reading four parquet files in parallel from S3. There is no good way to determine which read will finish first. If file 2 finishes before file 1 then we could start parsing the contents of file 2 immediately but we currently do not.
There could then be an option provided by Scan whether to preserve ordering or not. Cases that do not care about ordering (e.g. count rows) could take advantage of this to reduce memory pressure.
Note: This will be an optimization even for cases that do care about ordering. We could still parse / project / etc. out of order and simply reorder at the end. The only difference between unordered and ordered then will be the memory pressure applied.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)