You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Maarten Breddels (Jira)" <ji...@apache.org> on 2020/07/14 16:55:00 UTC

[jira] [Created] (ARROW-9471) [C++] Scan Dataset in reverse

Maarten Breddels created ARROW-9471:
---------------------------------------

             Summary: [C++] Scan Dataset in reverse
                 Key: ARROW-9471
                 URL: https://issues.apache.org/jira/browse/ARROW-9471
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Maarten Breddels


If a dataset does not fit into the OS cache, it can be beneficial to alternate between normal and reverse 'scanning'. Even if 90% of the a set of files fits into cache, scanning the same set twice will not make use of the OS cache. On the other hand, if the second time, scanning goes in reverse order, 90% will still be in OS cache. We use this trick in vaex, and I'd like to support that for parquet reading as well. (Is there a proper name/term for this?)

Note that since you don't want to reverse on byte level, you may want to reverse the way of traversing the fragment, or fragment and row groups. Too small chunks (e.g. pages) could lead to a performance decrease because most read algorithms implement read-ahead optimization (not the reverse). I think doing this on fragment level might be enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)