You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/09/01 17:18:00 UTC
[jira] [Updated] (ARROW-13848) [C++] and() in a dataset filter

     [ https://issues.apache.org/jira/browse/ARROW-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Keane updated ARROW-13848:
-----------------------------------
    Description: 
Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned:

{code:r}
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

> library(dplyr)
> 
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
 46.634   4.462   6.457 
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.633   0.421   0.754 
> 
{code}

I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:
{code:r}
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.549   0.404   0.576 
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.477   0.412   0.585 
{code}

This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`.

  was:
Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned:

{code:r}
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

> library(dplyr)
> 
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
 46.634   4.462   6.457 
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.633   0.421   0.754 
> 
{code}

I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:
{code:r}
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.549   0.404   0.576 
> 
> system.time({
+ out <- ds %>%
+     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+     select(tip_amount, total_amount, passenger_count) %>%
+     collect()
+ })
   user  system elapsed 
  4.477   0.412   0.585 
{code}


> [C++] and() in a dataset filter
> -------------------------------
>
>                 Key: ARROW-13848
>                 URL: https://issues.apache.org/jira/browse/ARROW-13848
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>
> Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned:
> {code:r}
> > library(arrow)
> Attaching package: ‘arrow’
> The following object is masked from ‘package:utils’:
>     timestamp
> > library(dplyr)
> > 
> > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
> > 
> > system.time({
> + out <- ds %>%
> +     filter(arrow_and(total_amount > 100, year == 2015)) %>%
> +     select(tip_amount, total_amount, passenger_count) %>%
> +     collect()
> + })
>    user  system elapsed 
>  46.634   4.462   6.457 
> > 
> > system.time({
> + out <- ds %>%
> +     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
> +     select(tip_amount, total_amount, passenger_count) %>%
> +     collect()
> + })
>    user  system elapsed 
>   4.633   0.421   0.754 
> > 
> {code}
> I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:
> {code:r}
> > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
> > 
> > system.time({
> + out <- ds %>%
> +     filter(arrow_and(total_amount > 100, year == 2015)) %>%
> +     select(tip_amount, total_amount, passenger_count) %>%
> +     collect()
> + })
>    user  system elapsed 
>   4.549   0.404   0.576 
> > 
> > system.time({
> + out <- ds %>%
> +     filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
> +     select(tip_amount, total_amount, passenger_count) %>%
> +     collect()
> + })
>    user  system elapsed 
>   4.477   0.412   0.585 
> {code}
> This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)