You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/09/01 17:16:00 UTC
[jira] [Created] (ARROW-13848) [C++] and() in a dataset filter
Jonathan Keane created ARROW-13848:
--------------------------------------
Summary: [C++] and() in a dataset filter
Key: ARROW-13848
URL: https://issues.apache.org/jira/browse/ARROW-13848
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Jonathan Keane
Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned:
{code:r}
> library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
> library(dplyr)
>
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
>
> system.time({
+ out <- ds %>%
+ filter(arrow_and(total_amount > 100, year == 2015)) %>%
+ select(tip_amount, total_amount, passenger_count) %>%
+ collect()
+ })
user system elapsed
46.634 4.462 6.457
>
> system.time({
+ out <- ds %>%
+ filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+ select(tip_amount, total_amount, passenger_count) %>%
+ collect()
+ })
user system elapsed
4.633 0.421 0.754
>
{code}
I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:
{code:r}
> ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
>
> system.time({
+ out <- ds %>%
+ filter(arrow_and(total_amount > 100, year == 2015)) %>%
+ select(tip_amount, total_amount, passenger_count) %>%
+ collect()
+ })
user system elapsed
4.549 0.404 0.576
>
> system.time({
+ out <- ds %>%
+ filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
+ select(tip_amount, total_amount, passenger_count) %>%
+ collect()
+ })
user system elapsed
4.477 0.412 0.585
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)