You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/02/26 21:17:00 UTC

[jira] [Created] (ARROW-11800) [C++] Add unordered scan

Weston Pace created ARROW-11800:
-----------------------------------

             Summary: [C++] Add unordered scan
                 Key: ARROW-11800
                 URL: https://issues.apache.org/jira/browse/ARROW-11800
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


Currently Scan generates an ordered sequence of batches.  However, this is not ideal.  For example, consider reading four parquet files in parallel from S3.  There is no good way to determine which read will finish first.  If file 2 finishes before file 1 then we could start parsing the contents of file 2 immediately but we currently do not.

There could then be an option provided by Scan whether to preserve ordering or not.  Cases that do not care about ordering (e.g. count rows) could take advantage of this to reduce memory pressure.

Note: This will be an optimization even for cases that do care about ordering.  We could still parse / project / etc. out of order and simply reorder at the end.  The only difference between unordered and ordered then will be the memory pressure applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)