You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2021/11/16 14:02:00 UTC
[jira] [Commented] (ARROW-13848) [C++] and() in a dataset filter
[ https://issues.apache.org/jira/browse/ARROW-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444545#comment-17444545 ]
Ben Kietzman commented on ARROW-13848:
--------------------------------------
This issue is actually pretty subtle since it touches on the different behavior wrt {{null}} in {{and}} vs {{and_kleene}}. Specifically, given a filter {{and_kleene(x == 1, y == 2)}} we can entirely skip a partition where {{x == 2}}; nothing in the partition could possibly satisfy {{and_kleene(x == 1, y == 2) -> and_kleene(2 == 1, y == 2) -> and_kleene(false, y == 2) -> false}}. By contrast, given the filter {{and(x == 1, y == 2)}} we *can't* avoid scanning the partition: {{and(x == 1, y == 2) -> and(2 == 1, y == 2) -> and(false, y == 2) -> false, _unless y is null_}}.
In summary, the best simplification we could get out of such conjunctions would be {{and(false, any) -> if_else(is_null(any), null, false)}} (for which we'd still need to scan the partition). To me this doesn't seem worthwhile by itself.
We *could* skip the partition if we happen to also have the guarantee {{is_valid(y)}}, but we don't currently support such guarantees. See https://issues.apache.org/jira/browse/ARROW-12659
> [C++] and() in a dataset filter
> -------------------------------
>
> Key: ARROW-13848
> URL: https://issues.apache.org/jira/browse/ARROW-13848
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Priority: Major
> Labels: beginner, good-first-issue
>
> Is it expected that a scanning a dataset that has a filter built with {{and()}} is much slower than a filter built with {{and_kleene()}}? Specifically, it seems that {{and()}} triggers a scan of the full dataset, where as {{and_kleene()}} takes advantage of the fact that only one directory of the larger dataset needs to be scanned:
> {code:r}
> > library(arrow)
> Attaching package: ‘arrow’
> The following object is masked from ‘package:utils’:
> timestamp
> > library(dplyr)
> >
> > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet/", partitioning = c("year", "month"))
> >
> > system.time({
> + out <- ds %>%
> + filter(arrow_and(total_amount > 100, year == 2015)) %>%
> + select(tip_amount, total_amount, passenger_count) %>%
> + collect()
> + })
> user system elapsed
> 46.634 4.462 6.457
> >
> > system.time({
> + out <- ds %>%
> + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
> + select(tip_amount, total_amount, passenger_count) %>%
> + collect()
> + })
> user system elapsed
> 4.633 0.421 0.754
> >
> {code}
> I suspect that it's scanning the whole dataset because if I use a dataset that only has the 2015 folder, I get similar speeds:
> {code:r}
> > ds <- open_dataset("~/repos/ab_store/data/taxi_parquet_2015/", partitioning = c("year", "month"))
> >
> > system.time({
> + out <- ds %>%
> + filter(arrow_and(total_amount > 100, year == 2015)) %>%
> + select(tip_amount, total_amount, passenger_count) %>%
> + collect()
> + })
> user system elapsed
> 4.549 0.404 0.576
> >
> > system.time({
> + out <- ds %>%
> + filter(arrow_and_kleene(total_amount > 100, year == 2015)) %>%
> + select(tip_amount, total_amount, passenger_count) %>%
> + collect()
> + })
> user system elapsed
> 4.477 0.412 0.585
> {code}
> This does not impact anyone who uses our default collapsing mechanism in the R package, but I bumped into it with a filter that was constructed by duckdb using `and()` instead of `and_kleene()`.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)