You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lee Mendelowitz (Jira)" <ji...@apache.org> on 2022/10/29 15:51:00 UTC
[jira] [Created] (ARROW-18195) R case_when bug with NA's
Lee Mendelowitz created ARROW-18195:
---------------------------------------
Summary: R case_when bug with NA's
Key: ARROW-18195
URL: https://issues.apache.org/jira/browse/ARROW-18195
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 10.0.0
Reporter: Lee Mendelowitz
There appears to be a bug when processing an Arrow table with NA values and using `dplyr::case_when`. A reproducible example is below: the output from arrow table processing does not match the output when processing a tibble. If the NA's are removed from the dataframe, then the outputs match.
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(assertthat)
play_results = c('single', 'double', 'triple', 'home_run')
nrows = 1000
# Change frac_na to 0, and the result error disappears.
frac_na = 0.05
# Create a test dataframe with NA values
test_df = tibble(
play_result = sample(play_results, nrows, replace = TRUE)
) %>%
mutate(
play_result = ifelse(runif(nrows) < frac_na, NA_character_, play_result)
)
test_arrow = arrow_table(test_df)
process_plays = function(df) {
df %>%
mutate(
avg = case_when(
play_result == 'single' ~ 1,
play_result == 'double' ~ 1,
play_result == 'triple' ~ 1,
play_result == 'home_run' ~ 1,
is.na(play_result) ~ NA_real_,
TRUE ~ 0
)
) %>%
count(play_result, avg) %>%
arrange(play_result)
}
# Compare arrow_table reuslt to tibble result
result_tibble = process_plays(test_df)
result_arrow = process_plays(test_arrow) %>% collect()
assertthat::assert_that(identical(result_tibble, result_arrow))
#> Error: result_tibble not identical to result_arrow
```
<sup>Created on 2022-10-29 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
I have reproduced this issue both on Mac OS and Ubuntu 20.04.
```
r$> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin21.5.0 (64-bit)
Running under: macOS Monterey 12.5.1
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.20/lib/libopenblasp-r0.3.20.dylib
LAPACK: /opt/homebrew/Cellar/r/4.2.1/lib/R/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] assertthat_0.2.1 arrow_10.0.0 dplyr_1.0.10
loaded via a namespace (and not attached):
[1] compiler_4.2.1 pillar_1.8.1 highr_0.9 R.methodsS3_1.8.2 R.utils_2.12.0 tools_4.2.1 bit_4.0.4 digest_0.6.29
[9] evaluate_0.15 lifecycle_1.0.1 tibble_3.1.8 R.cache_0.16.0 pkgconfig_2.0.3 rlang_1.0.5 reprex_2.0.2 DBI_1.1.2
[17] cli_3.3.0 rstudioapi_0.13 yaml_2.3.5 xfun_0.31 fastmap_1.1.0 withr_2.5.0 styler_1.8.0 knitr_1.39
[25] generics_0.1.3 fs_1.5.2 vctrs_0.4.1 bit64_4.0.5 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1 processx_3.5.3
[33] fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4 callr_3.7.0 clipr_0.8.0 magrittr_2.0.3 ellipsis_0.3.2 ps_1.7.0
[41] htmltools_0.5.3 renv_0.16.0 utf8_1.2.2 R.oo_1.25.0
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)