You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by David Greiss <da...@gmail.com> on 2023/05/19 02:07:31 UTC

[R] `max` in filter not supported

Hi

The base R max function is not supported when used within `filter`:

library(arrow)
tbl <- arrow_table(name = rownames(mtcars), mtcars)

tbl |>
filter(mpg > max(mpg)) |>
collect()
Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
into R

but this works:

tbl |>
summarize(x = max(mpg))

Should this be supported or am I missing something

Thanks for the help
David

Re: [R] `max` in filter not supported

Posted by Jonathan Keane <jk...@gmail.com>.
There's also the cookbook https://arrow.apache.org/cookbook/r/ which might
also be a good place for how-to examples like these. The repository to
contribute to those is https://github.com/apache/arrow-cookbook

-Jon


On Sat, May 20, 2023 at 7:31 AM Neal Richardson <ne...@gmail.com>
wrote:

> Sure, if you wanted to add to the docs, somewhere in this section is
> probably the best place:
> https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124
>
> On Fri, May 19, 2023 at 5:34 PM David Greiss <da...@gmail.com>
> wrote:
>
>> Thanks for the insight and the suggested workaround. My example was a bit
>> contrived but I am looking to filter on a grouped dataframe more analogous
>> to this:
>>
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> group_by(cyl) |>
>> filter(mpg == max(mpg)) |>
>> collect()
>>
>> The issue that Ian referenced suggests a workaround using left_join which
>> did the trick for me:
>>
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> left_join(tbl |>
>>   group_by(cyl) |>
>>   summarize(max_mpg = max(mpg))
>> ) |>
>> filter(mpg == max_mpg) |>
>> select(-max_mpg) |>
>> collect()
>>
>> If there's any interest, I'd be happy to submit a PR to document these
>> workarounds.
>>
>> Thanks again for the help and work on the package.
>>
>> David
>>
>>
>> On Fri, May 19, 2023 at 3:58 PM Ian Cook <ia...@ursacomputing.com> wrote:
>>
>>> There is an existing enhancement request for this feature at
>>> https://github.com/apache/arrow/issues/29537 but I don't think there
>>> is any work planned on this in the near future, so the workaround Neal
>>> suggested is the way to go for now.
>>>
>>> Ian
>>>
>>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
>>> <ne...@gmail.com> wrote:
>>> >
>>> > max is an aggregation, so it requires scanning all of the data.
>>> Filtering is a scalar (row by row operation), so to evaluate mpg >
>>> max(mpg), you have to pass over all of the data to compute the max, then
>>> pass through the data again to filter. This is trivial for data frames like
>>> mtcars, but imagine a dataset that can't be held in memory.
>>> >
>>> > One way query engines handle this is with window functions. If you use
>>> dbplyr, you get SQL with a window function:
>>> >
>>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
>>> > <SQL>
>>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
>>> > FROM (
>>> >   SELECT *, MAX(mpg) OVER () AS q03
>>> >   FROM mtcars
>>> > ) q01
>>> > WHERE (mpg > q03)
>>> >
>>> > Acero, the query engine in Arrow, does not currently support window
>>> functions. The easiest way for you to handle this today is probably to
>>> evaluate the max first, then pass that in to the filter:
>>> >
>>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
>>> > tbl |> filter(mpg == max_mpg) |> collect()
>>> >
>>> > Neal
>>> >
>>> >
>>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi
>>> >>
>>> >> The base R max function is not supported when used within `filter`:
>>> >>
>>> >> library(arrow)
>>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>> >>
>>> >> tbl |>
>>> >> filter(mpg > max(mpg)) |>
>>> >> collect()
>>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling
>>> data into R
>>> >>
>>> >> but this works:
>>> >>
>>> >> tbl |>
>>> >> summarize(x = max(mpg))
>>> >>
>>> >> Should this be supported or am I missing something
>>> >>
>>> >> Thanks for the help
>>> >> David
>>>
>>

Re: [R] `max` in filter not supported

Posted by Neal Richardson <ne...@gmail.com>.
Sure, if you wanted to add to the docs, somewhere in this section is
probably the best place:
https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124

On Fri, May 19, 2023 at 5:34 PM David Greiss <da...@gmail.com>
wrote:

> Thanks for the insight and the suggested workaround. My example was a bit
> contrived but I am looking to filter on a grouped dataframe more analogous
> to this:
>
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> group_by(cyl) |>
> filter(mpg == max(mpg)) |>
> collect()
>
> The issue that Ian referenced suggests a workaround using left_join which
> did the trick for me:
>
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> left_join(tbl |>
>   group_by(cyl) |>
>   summarize(max_mpg = max(mpg))
> ) |>
> filter(mpg == max_mpg) |>
> select(-max_mpg) |>
> collect()
>
> If there's any interest, I'd be happy to submit a PR to document these
> workarounds.
>
> Thanks again for the help and work on the package.
>
> David
>
>
> On Fri, May 19, 2023 at 3:58 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
>> There is an existing enhancement request for this feature at
>> https://github.com/apache/arrow/issues/29537 but I don't think there
>> is any work planned on this in the near future, so the workaround Neal
>> suggested is the way to go for now.
>>
>> Ian
>>
>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
>> <ne...@gmail.com> wrote:
>> >
>> > max is an aggregation, so it requires scanning all of the data.
>> Filtering is a scalar (row by row operation), so to evaluate mpg >
>> max(mpg), you have to pass over all of the data to compute the max, then
>> pass through the data again to filter. This is trivial for data frames like
>> mtcars, but imagine a dataset that can't be held in memory.
>> >
>> > One way query engines handle this is with window functions. If you use
>> dbplyr, you get SQL with a window function:
>> >
>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
>> > <SQL>
>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
>> > FROM (
>> >   SELECT *, MAX(mpg) OVER () AS q03
>> >   FROM mtcars
>> > ) q01
>> > WHERE (mpg > q03)
>> >
>> > Acero, the query engine in Arrow, does not currently support window
>> functions. The easiest way for you to handle this today is probably to
>> evaluate the max first, then pass that in to the filter:
>> >
>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
>> > tbl |> filter(mpg == max_mpg) |> collect()
>> >
>> > Neal
>> >
>> >
>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
>> wrote:
>> >>
>> >> Hi
>> >>
>> >> The base R max function is not supported when used within `filter`:
>> >>
>> >> library(arrow)
>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>> >>
>> >> tbl |>
>> >> filter(mpg > max(mpg)) |>
>> >> collect()
>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling
>> data into R
>> >>
>> >> but this works:
>> >>
>> >> tbl |>
>> >> summarize(x = max(mpg))
>> >>
>> >> Should this be supported or am I missing something
>> >>
>> >> Thanks for the help
>> >> David
>>
>

Re: [R] `max` in filter not supported

Posted by David Greiss <da...@gmail.com>.
Thanks for the insight and the suggested workaround. My example was a bit
contrived but I am looking to filter on a grouped dataframe more analogous
to this:

tbl <- arrow_table(name = rownames(mtcars), mtcars)

tbl |>
group_by(cyl) |>
filter(mpg == max(mpg)) |>
collect()

The issue that Ian referenced suggests a workaround using left_join which
did the trick for me:

tbl <- arrow_table(name = rownames(mtcars), mtcars)

tbl |>
left_join(tbl |>
  group_by(cyl) |>
  summarize(max_mpg = max(mpg))
) |>
filter(mpg == max_mpg) |>
select(-max_mpg) |>
collect()

If there's any interest, I'd be happy to submit a PR to document these
workarounds.

Thanks again for the help and work on the package.

David


On Fri, May 19, 2023 at 3:58 PM Ian Cook <ia...@ursacomputing.com> wrote:

> There is an existing enhancement request for this feature at
> https://github.com/apache/arrow/issues/29537 but I don't think there
> is any work planned on this in the near future, so the workaround Neal
> suggested is the way to go for now.
>
> Ian
>
> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
> <ne...@gmail.com> wrote:
> >
> > max is an aggregation, so it requires scanning all of the data.
> Filtering is a scalar (row by row operation), so to evaluate mpg >
> max(mpg), you have to pass over all of the data to compute the max, then
> pass through the data again to filter. This is trivial for data frames like
> mtcars, but imagine a dataset that can't be held in memory.
> >
> > One way query engines handle this is with window functions. If you use
> dbplyr, you get SQL with a window function:
> >
> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
> > <SQL>
> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
> > FROM (
> >   SELECT *, MAX(mpg) OVER () AS q03
> >   FROM mtcars
> > ) q01
> > WHERE (mpg > q03)
> >
> > Acero, the query engine in Arrow, does not currently support window
> functions. The easiest way for you to handle this today is probably to
> evaluate the max first, then pass that in to the filter:
> >
> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
> > tbl |> filter(mpg == max_mpg) |> collect()
> >
> > Neal
> >
> >
> > On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
> wrote:
> >>
> >> Hi
> >>
> >> The base R max function is not supported when used within `filter`:
> >>
> >> library(arrow)
> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
> >>
> >> tbl |>
> >> filter(mpg > max(mpg)) |>
> >> collect()
> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
> into R
> >>
> >> but this works:
> >>
> >> tbl |>
> >> summarize(x = max(mpg))
> >>
> >> Should this be supported or am I missing something
> >>
> >> Thanks for the help
> >> David
>

Re: [R] `max` in filter not supported

Posted by Ian Cook <ia...@ursacomputing.com>.
There is an existing enhancement request for this feature at
https://github.com/apache/arrow/issues/29537 but I don't think there
is any work planned on this in the near future, so the workaround Neal
suggested is the way to go for now.

Ian

On Fri, May 19, 2023 at 3:52 PM Neal Richardson
<ne...@gmail.com> wrote:
>
> max is an aggregation, so it requires scanning all of the data. Filtering is a scalar (row by row operation), so to evaluate mpg > max(mpg), you have to pass over all of the data to compute the max, then pass through the data again to filter. This is trivial for data frames like mtcars, but imagine a dataset that can't be held in memory.
>
> One way query engines handle this is with window functions. If you use dbplyr, you get SQL with a window function:
>
> > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
> <SQL>
> SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
> FROM (
>   SELECT *, MAX(mpg) OVER () AS q03
>   FROM mtcars
> ) q01
> WHERE (mpg > q03)
>
> Acero, the query engine in Arrow, does not currently support window functions. The easiest way for you to handle this today is probably to evaluate the max first, then pass that in to the filter:
>
> max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
> tbl |> filter(mpg == max_mpg) |> collect()
>
> Neal
>
>
> On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com> wrote:
>>
>> Hi
>>
>> The base R max function is not supported when used within `filter`:
>>
>> library(arrow)
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> filter(mpg > max(mpg)) |>
>> collect()
>> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data into R
>>
>> but this works:
>>
>> tbl |>
>> summarize(x = max(mpg))
>>
>> Should this be supported or am I missing something
>>
>> Thanks for the help
>> David

Re: [R] `max` in filter not supported

Posted by Neal Richardson <ne...@gmail.com>.
max is an aggregation, so it requires scanning all of the data. Filtering
is a scalar (row by row operation), so to evaluate mpg > max(mpg), you have
to pass over all of the data to compute the max, then pass through the data
again to filter. This is trivial for data frames like mtcars, but imagine a
dataset that can't be held in memory.

One way query engines handle this is with window functions. If you use
dbplyr, you get SQL with a window function:

> tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
<SQL>
SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
FROM (
  SELECT *, MAX(mpg) OVER () AS q03
  FROM mtcars
) q01
WHERE (mpg > q03)

Acero, the query engine in Arrow, does not currently support window
functions. The easiest way for you to handle this today is probably to
evaluate the max first, then pass that in to the filter:

max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
tbl |> filter(mpg == max_mpg) |> collect()

Neal


On Thu, May 18, 2023 at 10:08 PM David Greiss <da...@gmail.com>
wrote:

> Hi
>
> The base R max function is not supported when used within `filter`:
>
> library(arrow)
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> filter(mpg > max(mpg)) |>
> collect()
> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
> into R
>
> but this works:
>
> tbl |>
> summarize(x = max(mpg))
>
> Should this be supported or am I missing something
>
> Thanks for the help
> David
>