You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by David Greiss <da...@gmail.com> on 2023/05/19 02:07:31 UTC
[R] `max` in filter not supported
Hi
The base R max function is not supported when used within `filter`:
library(arrow)
tbl <- arrow_table(name = rownames(mtcars), mtcars)
tbl |>
filter(mpg > max(mpg)) |>
collect()
Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
into R
but this works:
tbl |>
summarize(x = max(mpg))
Should this be supported or am I missing something
Thanks for the help
David
Re: [R] `max` in filter not supported
Posted by Jonathan Keane <jk...@gmail.com>.
There's also the cookbook https://arrow.apache.org/cookbook/r/ which might
also be a good place for how-to examples like these. The repository to
contribute to those is https://github.com/apache/arrow-cookbook
-Jon
On Sat, May 20, 2023 at 7:31 AM Neal Richardson <ne...@gmail.com>
wrote:
> Sure, if you wanted to add to the docs, somewhere in this section is
> probably the best place:
> https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124
>
> On Fri, May 19, 2023 at 5:34 PM David Greiss <da...@gmail.com>
> wrote:
>
>> Thanks for the insight and the suggested workaround. My example was a bit
>> contrived but I am looking to filter on a grouped dataframe more analogous
>> to this:
>>
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> group_by(cyl) |>
>> filter(mpg == max(mpg)) |>
>> collect()
>>
>> The issue that Ian referenced suggests a workaround using left_join which
>> did the trick for me:
>>
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> left_join(tbl |>
>> group_by(cyl) |>
>> summarize(max_mpg = max(mpg))
>> ) |>
>> filter(mpg == max_mpg) |>
>> select(-max_mpg) |>
>> collect()
>>
>> If there's any interest, I'd be happy to submit a PR to document these
>> workarounds.
>>
>> Thanks again for the help and work on the package.
>>
>> David
>>
>>
>> On Fri, May 19, 2023 at 3:58 PM Ian Cook <ia...@ursacomputing.com> wrote:
>>
>>> There is an existing enhancement request for this feature at
>>> https://github.com/apache/arrow/issues/29537 but I don't think there
>>> is any work planned on this in the near future, so the workaround Neal
>>> suggested is the way to go for now.
>>>
>>> Ian
>>>
>>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
>>> <ne...@gmail.com> wrote:
>>> >
>>> > max is an aggregation, so it requires scanning all of the data.
>>> Filtering is a scalar (row by row operation), so to evaluate mpg >
>>> max(mpg), you have to pass over all of the data to compute the max, then
>>> pass through the data again to filter. This is trivial for data frames like
>>> mtcars, but imagine a dataset that can't be held in memory.
>>> >
>>> > One way query engines handle this is with window functions. If you use
>>> dbplyr, you get SQL with a window function:
>>> >
>>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
>>> > <SQL>
>>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
>>> > FROM (
>>> > SELECT *, MAX(mpg) OVER () AS q03
>>> > FROM mtcars
>>> > ) q01
>>> > WHERE (mpg > q03)
>>> >
>>> > Acero, the query engine in Arrow, does not currently support window
>>> functions. The easiest way for you to handle this today is probably to
>>> evaluate the max first, then pass that in to the filter:
>>> >
>>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
>>> > tbl |> filter(mpg == max_mpg) |> collect()
>>> >
>>> > Neal
>>> >
>>> >
>>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi
>>> >>
>>> >> The base R max function is not supported when used within `filter`:
>>> >>
>>> >> library(arrow)
>>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>> >>
>>> >> tbl |>
>>> >> filter(mpg > max(mpg)) |>
>>> >> collect()
>>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling
>>> data into R
>>> >>
>>> >> but this works:
>>> >>
>>> >> tbl |>
>>> >> summarize(x = max(mpg))
>>> >>
>>> >> Should this be supported or am I missing something
>>> >>
>>> >> Thanks for the help
>>> >> David
>>>
>>
Re: [R] `max` in filter not supported
Posted by Neal Richardson <ne...@gmail.com>.
Sure, if you wanted to add to the docs, somewhere in this section is
probably the best place:
https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124
On Fri, May 19, 2023 at 5:34 PM David Greiss <da...@gmail.com>
wrote:
> Thanks for the insight and the suggested workaround. My example was a bit
> contrived but I am looking to filter on a grouped dataframe more analogous
> to this:
>
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> group_by(cyl) |>
> filter(mpg == max(mpg)) |>
> collect()
>
> The issue that Ian referenced suggests a workaround using left_join which
> did the trick for me:
>
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> left_join(tbl |>
> group_by(cyl) |>
> summarize(max_mpg = max(mpg))
> ) |>
> filter(mpg == max_mpg) |>
> select(-max_mpg) |>
> collect()
>
> If there's any interest, I'd be happy to submit a PR to document these
> workarounds.
>
> Thanks again for the help and work on the package.
>
> David
>
>
> On Fri, May 19, 2023 at 3:58 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
>> There is an existing enhancement request for this feature at
>> https://github.com/apache/arrow/issues/29537 but I don't think there
>> is any work planned on this in the near future, so the workaround Neal
>> suggested is the way to go for now.
>>
>> Ian
>>
>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
>> <ne...@gmail.com> wrote:
>> >
>> > max is an aggregation, so it requires scanning all of the data.
>> Filtering is a scalar (row by row operation), so to evaluate mpg >
>> max(mpg), you have to pass over all of the data to compute the max, then
>> pass through the data again to filter. This is trivial for data frames like
>> mtcars, but imagine a dataset that can't be held in memory.
>> >
>> > One way query engines handle this is with window functions. If you use
>> dbplyr, you get SQL with a window function:
>> >
>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
>> > <SQL>
>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
>> > FROM (
>> > SELECT *, MAX(mpg) OVER () AS q03
>> > FROM mtcars
>> > ) q01
>> > WHERE (mpg > q03)
>> >
>> > Acero, the query engine in Arrow, does not currently support window
>> functions. The easiest way for you to handle this today is probably to
>> evaluate the max first, then pass that in to the filter:
>> >
>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
>> > tbl |> filter(mpg == max_mpg) |> collect()
>> >
>> > Neal
>> >
>> >
>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
>> wrote:
>> >>
>> >> Hi
>> >>
>> >> The base R max function is not supported when used within `filter`:
>> >>
>> >> library(arrow)
>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>> >>
>> >> tbl |>
>> >> filter(mpg > max(mpg)) |>
>> >> collect()
>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling
>> data into R
>> >>
>> >> but this works:
>> >>
>> >> tbl |>
>> >> summarize(x = max(mpg))
>> >>
>> >> Should this be supported or am I missing something
>> >>
>> >> Thanks for the help
>> >> David
>>
>
Re: [R] `max` in filter not supported
Posted by David Greiss <da...@gmail.com>.
Thanks for the insight and the suggested workaround. My example was a bit
contrived but I am looking to filter on a grouped dataframe more analogous
to this:
tbl <- arrow_table(name = rownames(mtcars), mtcars)
tbl |>
group_by(cyl) |>
filter(mpg == max(mpg)) |>
collect()
The issue that Ian referenced suggests a workaround using left_join which
did the trick for me:
tbl <- arrow_table(name = rownames(mtcars), mtcars)
tbl |>
left_join(tbl |>
group_by(cyl) |>
summarize(max_mpg = max(mpg))
) |>
filter(mpg == max_mpg) |>
select(-max_mpg) |>
collect()
If there's any interest, I'd be happy to submit a PR to document these
workarounds.
Thanks again for the help and work on the package.
David
On Fri, May 19, 2023 at 3:58 PM Ian Cook <ia...@ursacomputing.com> wrote:
> There is an existing enhancement request for this feature at
> https://github.com/apache/arrow/issues/29537 but I don't think there
> is any work planned on this in the near future, so the workaround Neal
> suggested is the way to go for now.
>
> Ian
>
> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
> <ne...@gmail.com> wrote:
> >
> > max is an aggregation, so it requires scanning all of the data.
> Filtering is a scalar (row by row operation), so to evaluate mpg >
> max(mpg), you have to pass over all of the data to compute the max, then
> pass through the data again to filter. This is trivial for data frames like
> mtcars, but imagine a dataset that can't be held in memory.
> >
> > One way query engines handle this is with window functions. If you use
> dbplyr, you get SQL with a window function:
> >
> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
> > <SQL>
> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
> > FROM (
> > SELECT *, MAX(mpg) OVER () AS q03
> > FROM mtcars
> > ) q01
> > WHERE (mpg > q03)
> >
> > Acero, the query engine in Arrow, does not currently support window
> functions. The easiest way for you to handle this today is probably to
> evaluate the max first, then pass that in to the filter:
> >
> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
> > tbl |> filter(mpg == max_mpg) |> collect()
> >
> > Neal
> >
> >
> > On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
> wrote:
> >>
> >> Hi
> >>
> >> The base R max function is not supported when used within `filter`:
> >>
> >> library(arrow)
> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
> >>
> >> tbl |>
> >> filter(mpg > max(mpg)) |>
> >> collect()
> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
> into R
> >>
> >> but this works:
> >>
> >> tbl |>
> >> summarize(x = max(mpg))
> >>
> >> Should this be supported or am I missing something
> >>
> >> Thanks for the help
> >> David
>
Re: [R] `max` in filter not supported
Posted by Ian Cook <ia...@ursacomputing.com>.
There is an existing enhancement request for this feature at
https://github.com/apache/arrow/issues/29537 but I don't think there
is any work planned on this in the near future, so the workaround Neal
suggested is the way to go for now.
Ian
On Fri, May 19, 2023 at 3:52 PM Neal Richardson
<ne...@gmail.com> wrote:
>
> max is an aggregation, so it requires scanning all of the data. Filtering is a scalar (row by row operation), so to evaluate mpg > max(mpg), you have to pass over all of the data to compute the max, then pass through the data again to filter. This is trivial for data frames like mtcars, but imagine a dataset that can't be held in memory.
>
> One way query engines handle this is with window functions. If you use dbplyr, you get SQL with a window function:
>
> > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
> <SQL>
> SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
> FROM (
> SELECT *, MAX(mpg) OVER () AS q03
> FROM mtcars
> ) q01
> WHERE (mpg > q03)
>
> Acero, the query engine in Arrow, does not currently support window functions. The easiest way for you to handle this today is probably to evaluate the max first, then pass that in to the filter:
>
> max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
> tbl |> filter(mpg == max_mpg) |> collect()
>
> Neal
>
>
> On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com> wrote:
>>
>> Hi
>>
>> The base R max function is not supported when used within `filter`:
>>
>> library(arrow)
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> filter(mpg > max(mpg)) |>
>> collect()
>> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data into R
>>
>> but this works:
>>
>> tbl |>
>> summarize(x = max(mpg))
>>
>> Should this be supported or am I missing something
>>
>> Thanks for the help
>> David
Re: [R] `max` in filter not supported
Posted by Neal Richardson <ne...@gmail.com>.
max is an aggregation, so it requires scanning all of the data. Filtering
is a scalar (row by row operation), so to evaluate mpg > max(mpg), you have
to pass over all of the data to compute the max, then pass through the data
again to filter. This is trivial for data frames like mtcars, but imagine a
dataset that can't be held in memory.
One way query engines handle this is with window functions. If you use
dbplyr, you get SQL with a window function:
> tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
<SQL>
SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
FROM (
SELECT *, MAX(mpg) OVER () AS q03
FROM mtcars
) q01
WHERE (mpg > q03)
Acero, the query engine in Arrow, does not currently support window
functions. The easiest way for you to handle this today is probably to
evaluate the max first, then pass that in to the filter:
max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
tbl |> filter(mpg == max_mpg) |> collect()
Neal
On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
wrote:
> Hi
>
> The base R max function is not supported when used within `filter`:
>
> library(arrow)
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> filter(mpg > max(mpg)) |>
> collect()
> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
> into R
>
> but this works:
>
> tbl |>
> summarize(x = max(mpg))
>
> Should this be supported or am I missing something
>
> Thanks for the help
> David
>