You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/10/31 11:05:00 UTC

[jira] [Commented] (ARROW-18195) [R] case_when bug with NA's

    [ https://issues.apache.org/jira/browse/ARROW-18195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626531#comment-17626531 ] 

Nicola Crane commented on ARROW-18195:
--------------------------------------

Thanks for reporting this [~LMendy]!  I can confirm that this is reproducible, and I've added an extended reprex below.  It appears that is happens in some very specific circumstances: when there are 65 or more total values on the input column, and at least 1 is an NA value.


{code:r}
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

# Specific conditions where this happens: a table with one NA and 64 or more non-NA values
test_df = tibble::tibble(x = c(NA, rep("foo", 64)))
test_arrow = arrow_table(test_df)

# the non-arrow version; all the final values are 1
test_df %>%
  mutate(y = case_when(x == 'foo' ~ 1, is.na(x) ~ NA_real_)) %>%
  tail()
#> # A tibble: 6 × 2
#>   x         y
#>   <chr> <dbl>
#> 1 foo       1
#> 2 foo       1
#> 3 foo       1
#> 4 foo       1
#> 5 foo       1
#> 6 foo       1

# the arrow version; the final values is NA
test_arrow %>%
  mutate(y = case_when(x == 'foo' ~ 1, is.na(x) ~ NA_real_)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x         y
#>   <chr> <dbl>
#> 1 foo       1
#> 2 foo       1
#> 3 foo       1
#> 4 foo       1
#> 5 foo       1
#> 6 foo      NA

# it's fine if there are less than 65 values in the table (i.e. but still contains an NA)
test_arrow[1:64,] %>%
  mutate(y = case_when(x == 'foo' ~ 1, is.na(x) ~ NA_real_)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x         y
#>   <chr> <dbl>
#> 1 foo       1
#> 2 foo       1
#> 3 foo       1
#> 4 foo       1
#> 5 foo       1
#> 6 foo       1

# everything is fine when the comparison is being done on doubles and return value is char
test_df2 = tibble::tibble(x = c(NA, rep(1, 64)))
test_arrow2 = arrow_table(test_df2)
test_arrow2 %>%
  mutate(y = case_when(x == 1 ~ "winning", is.na(x) ~ NA_character_)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>       x y      
#>   <dbl> <chr>  
#> 1     1 winning
#> 2     1 winning
#> 3     1 winning
#> 4     1 winning
#> 5     1 winning
#> 6     1 winning

# also breaks when source value is boolean and target value is double
test_df3 = tibble::tibble(x = c(NA, rep(TRUE, 64)))
test_arrow3 = arrow_table(test_df3)
test_arrow3 %>%
  mutate(y = case_when(x == TRUE ~ 1, is.na(x) ~ NA_real_)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x         y
#>   <lgl> <dbl>
#> 1 TRUE      1
#> 2 TRUE      1
#> 3 TRUE      1
#> 4 TRUE      1
#> 5 TRUE      1
#> 6 TRUE     NA

# also broken for when target is integer
test_df4 = tibble::tibble(x = c(NA, rep(TRUE, 64)))
test_arrow4 = arrow_table(test_df4)
test_arrow4 %>%
  mutate(y = case_when(x == TRUE ~ 1L, is.na(x) ~ 2L)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x         y
#>   <lgl> <int>
#> 1 TRUE      1
#> 2 TRUE      1
#> 3 TRUE      1
#> 4 TRUE      1
#> 5 TRUE      1
#> 6 TRUE     NA

# broken for logical to logical
test_df5 = tibble::tibble(x = c(NA, rep(TRUE, 64)))
test_arrow5 = arrow_table(test_df5)
test_arrow5 %>%
  mutate(y = case_when(x == TRUE ~ TRUE, is.na(x) ~ FALSE)) %>%
  collect() %>%
  tail()
#> # A tibble: 6 × 2
#>   x     y    
#>   <lgl> <lgl>
#> 1 TRUE  TRUE 
#> 2 TRUE  TRUE 
#> 3 TRUE  TRUE 
#> 4 TRUE  TRUE 
#> 5 TRUE  TRUE 
#> 6 TRUE  NA
{code}

CC [~westonpace]

> [R] case_when bug with NA's
> ---------------------------
>
>                 Key: ARROW-18195
>                 URL: https://issues.apache.org/jira/browse/ARROW-18195
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 10.0.0
>            Reporter: Lee Mendelowitz
>            Priority: Major
>         Attachments: test_issue.R
>
>
> There appears to be a bug when processing an Arrow table with NA values and using `dplyr::case_when`. A reproducible example is below: the output from arrow table processing does not match the output when processing a tibble. If the NA's are removed from the dataframe, then the outputs match.
> {noformat}
> ``` r
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(assertthat)
> play_results = c('single', 'double', 'triple', 'home_run')
> nrows = 1000
> # Change frac_na to 0, and the result error disappears.
> frac_na = 0.05
> # Create a test dataframe with NA values
> test_df = tibble(
>         play_result = sample(play_results, nrows, replace = TRUE)
>     ) %>%
>     mutate(
>         play_result = ifelse(runif(nrows) < frac_na, NA_character_, play_result)
>     )
>     
> test_arrow = arrow_table(test_df)
> process_plays = function(df) {
>     df %>%
>         mutate(
>             avg = case_when(
>                 play_result == 'single' ~ 1,
>                 play_result == 'double' ~ 1,
>                 play_result == 'triple' ~ 1,
>                 play_result == 'home_run' ~ 1,
>                 is.na(play_result) ~ NA_real_,
>                 TRUE ~ 0
>             )
>         ) %>%
>         count(play_result, avg) %>%
>         arrange(play_result)
> }
> # Compare arrow_table reuslt to tibble result
> result_tibble = process_plays(test_df)
> result_arrow = process_plays(test_arrow) %>% collect()
> assertthat::assert_that(identical(result_tibble, result_arrow))
> #> Error: result_tibble not identical to result_arrow
> ```
> <sup>Created on 2022-10-29 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
> {noformat}
> I have reproduced this issue both on Mac OS and Ubuntu 20.04.
>  
> {noformat}
> ```
> r$> sessionInfo()
> R version 4.2.1 (2022-06-23)
> Platform: aarch64-apple-darwin21.5.0 (64-bit)
> Running under: macOS Monterey 12.5.1
> Matrix products: default
> BLAS:   /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib
> LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
> other attached packages:
> [1] assertthat_0.2.1 arrow_10.0.0     dplyr_1.0.10
> loaded via a namespace (and not attached):
>  [1] compiler_4.2.1    pillar_1.8.1      highr_0.9         R.methodsS3_1.8.2 R.utils_2.12.0    tools_4.2.1       bit_4.0.4         digest_0.6.29
>  [9] evaluate_0.15     lifecycle_1.0.1   tibble_3.1.8      R.cache_0.16.0    pkgconfig_2.0.3   rlang_1.0.5       reprex_2.0.2      DBI_1.1.2
> [17] cli_3.3.0         rstudioapi_0.13   yaml_2.3.5        xfun_0.31         fastmap_1.1.0     withr_2.5.0       styler_1.8.0      knitr_1.39
> [25] generics_0.1.3    fs_1.5.2          vctrs_0.4.1       bit64_4.0.5       tidyselect_1.1.2  glue_1.6.2        R6_2.5.1          processx_3.5.3
> [33] fansi_1.0.3       rmarkdown_2.14    purrr_0.3.4       callr_3.7.0       clipr_0.8.0       magrittr_2.0.3    ellipsis_0.3.2    ps_1.7.0
> [41] htmltools_0.5.3   renv_0.16.0       utf8_1.2.2        R.oo_1.25.0
> ```
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)