You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/07/22 13:46:00 UTC

[jira] [Created] (ARROW-13434) [R] group_by() with an expression

Jonathan Keane created ARROW-13434:
--------------------------------------

             Summary: [R] group_by() with an expression
                 Key: ARROW-13434
                 URL: https://issues.apache.org/jira/browse/ARROW-13434
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Jonathan Keane


With dplyr, when we group_by with an expression, a column is added to the dataframe that has the result of the expression.

{code}
> example_data %>% 
+   group_by(int < 4) %>% collect()
# A tibble: 10 x 8
# Groups:   int < 4 [3]
     int   dbl  dbl2 lgl   false chr   fct   `int < 4`
   <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>    
 1     1   1.1     5 TRUE  FALSE a     a     TRUE     
 2     2   2.1     5 NA    FALSE b     b     TRUE     
 3     3   3.1     5 TRUE  FALSE c     c     TRUE     
 4    NA   4.1     5 FALSE FALSE d     d     NA       
 5     5   5.1     5 TRUE  FALSE e     NA    FALSE    
 6     6   6.1     5 NA    FALSE NA    NA    FALSE    
 7     7   7.1     5 NA    FALSE g     g     FALSE    
 8     8   8.1     5 FALSE FALSE h     h     FALSE    
 9     9  NA       5 FALSE FALSE i     i     FALSE    
10    10  10.1     5 NA    FALSE j     j     FALSE    
{code}

Arrow doesn't do this, however:

{code}
> Table$create(example_data) %>% 
+   group_by(int < 4) %>% collect()
 Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
dbl: double
dbl2: double
lgl: bool
false: bool
chr: string
fct: dictionary<values=string, indices=int8, ordered=0> 
{code}

This isn't a big deal right now since grouped aggregations aren't (quite) here yet, but once we start having support for that, we will have people using examples like this. This might actually be something we need/want to do in C++ instead of in the R client.

The workaround is relatively simple: add the expression in a mutate, then group_by that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)