You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/12/12 16:05:00 UTC
[jira] [Updated] (ARROW-17361) [R] dplyr::summarize fails with division when divisor is a variable

     [ https://issues.apache.org/jira/browse/ARROW-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dewey Dunnington updated ARROW-17361:
-------------------------------------
    Fix Version/s: 11.0.0

> [R] dplyr::summarize fails with division when divisor is a variable
> -------------------------------------------------------------------
>
>                 Key: ARROW-17361
>                 URL: https://issues.apache.org/jira/browse/ARROW-17361
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 8.0.0
>            Reporter: Oliver Reiter
>            Priority: Minor
>              Labels: aggregation, dplyr
>             Fix For: 11.0.0
>
>
> Hello,
> I found this odd behaviour when trying to compute an aggregate with dplyr::summarize: When I want to use a pre-defined variable to do a divison while aggregating, the execution fails with 'unsupported expression'. When I the value of the variable as is in the aggregation, it works.
>  
> See below:
>  
> {code:java}
> library(dplyr)
> library(arrow)
> small_dataset <- tibble::tibble(
>   ## x = rep(c("a", "b"), each = 5),
>   y = rep(1:5, 2)
> )
> ## convert "small_dataset" into a ...dataset
> tmpdir <- tempfile()
> dir.create(tmpdir)
> write_dataset(small_dataset, tmpdir)
> ## works
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / 10) %>%
>   collect()
> ## fails
> scale_factor <- 10
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / scale_factor) %>%
>   collect()
> #> Fehler: Error in summarize_eval(names(exprs)[i],
> #> exprs[[i]], ctx, length(.data$group_by_vars) > :
> #   Expression sum(y)/scale_factor is not an aggregate
> #   expression or is not supported in Arrow
> # Call collect() first to pull data into R.
>    {code}
> I was not sure how to name this issue/bug (if it is one), so if there is a clearer, more descriptive title you're welcome to adjust.
>  
> Thanks for your work!
>  
> Oliver
>  
> {code:java}
> > arrow_info()
> Arrow package version: 8.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>                   
> Allocator jemalloc
> Current   64 bytes
> Max       41.25 Kb
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                            
> C++ Library Version   8.0.0
> C++ Compiler            GNU
> C++ Compiler Version 12.1.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)