You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/02/18 03:22:00 UTC

[jira] [Created] (ARROW-15726) [R] Support push-down projection/filtering in datasets / dplyr

Weston Pace created ARROW-15726:
-----------------------------------

             Summary: [R] Support push-down projection/filtering in datasets / dplyr
                 Key: ARROW-15726
                 URL: https://issues.apache.org/jira/browse/ARROW-15726
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Weston Pace


The following query should read a single column from the target parquet file.

{noformat}
open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% collect()
{noformat}

Furthermore, it should apply a pushdown filter to the source node allowing parquet row groups to potentially filter out target data.

At the moment it creates the following exec plan:

{noformat}
3:SinkNode{}
  2:ProjectNode{projection=[l_tax]}
    1:FilterNode{filter=(l_tax < 0.01)}
      0:SourceNode{}
{noformat}

There is no projection or filter in the source node.  As a result we end up reading much more data from disk (the entire file) than we need to (at most a single column).

This _could_ be fixed via heuristics in the dplyr code.  However, it may quickly get complex (for example, the project comes after the filter, so you need to make sure you push down a projection that includes both the columns accessed by the filter and the columns accessed by the projection OR can you push down the projection through a join [yes you can], how do you know which columns to apply to which source node).

A more complete fix would be to call into some kind of 3rd party optimizer (e.g. calcite) after the plan has been created by dplyr but before it is passed to the execution engine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)