You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "egillax (via GitHub)" <gi...@apache.org> on 2023/03/09 16:44:43 UTC
[GitHub] [arrow] egillax opened a new issue, #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
egillax opened a new issue, #34519:
URL: https://github.com/apache/arrow/issues/34519
### Describe the bug, including details regarding any error messages, version, and platform.
I was testing the latest arrow develop version using [this](https://arrow.apache.org/docs/dev/r/articles/install_nightly.html#install-from-git-repository) method to install from git.
And now it seems I cannot cast columns in a dataset, it results in ```NA``` values:
I tried using both parquet and arrow files. This does work using latest version on CRAN (11.0.0.3) and using arrow tables instead of datasets.
Reprex:
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
mtcars %>% write_dataset('./mtcars/')
ds <- open_dataset('./mtcars')
ds %>% dplyr::collect()
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
ds %>% dplyr::mutate(mpg=as.numeric(mpg)) %>% dplyr::collect()
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 NA 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 NA 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 NA 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 NA 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 NA 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 NA 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 NA 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 NA 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 NA 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 NA 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 11 NA 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 12 NA 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 13 NA 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 14 NA 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 15 NA 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 16 NA 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 17 NA 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 18 NA 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 19 NA 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 20 NA 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 21 NA 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 22 NA 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 23 NA 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 24 NA 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 25 NA 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 26 NA 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 27 NA 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 28 NA 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 29 NA 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 30 NA 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 31 NA 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> 32 NA 4 121.0 109 4.11 2.780 18.60 1 1 4 2
```
<sup>Created on 2023-03-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
<details>
<summary>Arrow Info</summary>
Arrow package version: 11.0.0.9000
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 FALSE
gcs FALSE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip FALSE
brotli FALSE
zstd FALSE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 FALSE
jemalloc FALSE
mimalloc TRUE
To reinstall with more optional capabilities enabled, see
https://arrow.apache.org/docs/r/articles/install.html
Memory:
Allocator mimalloc
Current 13.31 Kb
Max 46.31 Mb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 12.0.0-SNAPSHOT
C++ Compiler GNU
C++ Compiler Version 12.2.0
Git ID b679a96d426f4df1a2d15d452f312c968cdfc8f6
</details>
<details>
<summary>sessionInfo</summary>
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=nl_NL.UTF-8
[9] LC_ADDRESS=nl_NL.UTF-8 LC_TELEPHONE=nl_NL.UTF-8 LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=nl_NL.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_11.0.0.9000 dplyr_1.0.10 PatientLevelPrediction_6.2.0.9000
loaded via a namespace (and not attached):
[1] pkgload_1.3.2 bit64_4.0.5 jsonlite_1.8.4 DatabaseConnector_6.0.0 R.utils_2.12.2
[6] shiny_1.7.4 assertthat_0.2.1 highr_0.10 blob_1.2.3 remotes_2.4.2
[11] yaml_2.3.6 sessioninfo_1.2.2 pillar_1.8.1 RSQLite_2.2.18 lattice_0.20-45
[16] glue_1.6.2 reticulate_1.26 digest_0.6.31 promises_1.2.0.1 htmltools_0.5.4
[21] httpuv_1.6.8 Matrix_1.5-1 R.oo_1.25.0 clipr_0.8.0 pkgconfig_2.0.3
[26] devtools_2.4.5 purrr_1.0.1 xtable_1.8-4 processx_3.8.0 later_1.3.0
[31] ParallelLogger_3.0.1 tibble_3.1.8 styler_1.9.0 generics_0.1.3 usethis_2.1.6
[36] ellipsis_0.3.2 cachem_1.0.6 withr_2.5.0 cli_3.6.0 magrittr_2.0.3
[41] crayon_1.5.2 mime_0.12 memoise_2.0.1 evaluate_0.20 ps_1.7.2
[46] R.methodsS3_1.8.2 Andromeda_1.0.0 fs_1.5.2 fansi_1.0.3 R.cache_0.16.0
[51] pkgbuild_1.4.0 SqlRender_1.12.0 profvis_0.3.7 tools_4.2.2 data.table_1.14.4
[56] prettyunits_1.1.1 lifecycle_1.0.3 stringr_1.5.0 reprex_2.0.2 callr_3.7.3
[61] compiler_4.2.2 rlang_1.0.6 grid_4.2.2 rstudioapi_0.14 htmlwidgets_1.6.1
[66] miniUI_0.1.1.1 rmarkdown_2.19 DBI_1.1.3 R6_2.5.1 knitr_1.41
[71] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 stringi_1.7.12 rJava_1.0-6
[76] parallel_4.2.2 Rcpp_1.0.9 vctrs_0.5.1 png_0.1-7 urlchecker_1.0.1
[81] tidyselect_1.2.0 FeatureExtraction_3.2.0 xfun_0.36
</details>
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
Posted by "nealrichardson (via GitHub)" <gi...@apache.org>.
nealrichardson commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1470415515
I can take a look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace closed issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
URL: https://github.com/apache/arrow/issues/34519
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] eitsupi commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1465212660
I can reproduce this on 5b2fbade23eda9bc95b1e3854b19efff177cd0bd.
(Install with libarrow nightly binary arrow-11.0.0.100000193 on Ubuntu 22.04)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1469780866
Thanks for reporting this @egillax; I can confirm that this issue is not present in 11.0.03, but is on the dev build. I'll investigate further.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1469802891
A little more investigation of the specific circumstances in which it does and does not occur:
``` r
library(arrow)
library(dplyr)
# no problem when replacing column with self when there is just 1 column
df <- tibble::tibble(x = 1:10)
tf <- tempfile()
dir.create(tf)
write_dataset(df, tf)
open_dataset(tf) %>%
mutate(x = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 1
#> x
#> <dbl>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
# NA values when there are 2 columns
df <- tibble::tibble(x = 1:10, y = 1:10)
tf <- tempfile()
dir.create(tf)
write_dataset(df, tf)
open_dataset(tf) %>%
mutate(x = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 2
#> x y
#> <dbl> <int>
#> 1 NA 1
#> 2 NA 2
#> 3 NA 3
#> 4 NA 4
#> 5 NA 5
#> 6 NA 6
#> 7 NA 7
#> 8 NA 8
#> 9 NA 9
#> 10 NA 10
# works fine if we're creating a brand new column
open_dataset(tf) %>%
mutate(z = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 3
#> x y z
#> <int> <int> <dbl>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
#> 4 4 4 4
#> 5 5 5 5
#> 6 6 6 6
#> 7 7 7 7
#> 8 8 8 8
#> 9 9 9 9
#> 10 10 10 10
# works fine if we're replacing a different column
open_dataset(tf) %>%
mutate(y = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 2
#> x y
#> <int> <dbl>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
#> 7 7 7
#> 8 8 8
#> 9 9 9
#> 10 10 10
# works fine with in-memory datasets when replacing existing columns
InMemoryDataset$create(df) %>%
mutate(x = as.numeric(x)) %>%
collect()
#> # A tibble: 10 × 2
#> x y
#> <dbl> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
#> 7 7 7
#> 8 8 8
#> 9 9 9
#> 10 10 10
```
Given it works with 11.0.0.3 and not the dev version of the R package, and there are very few R code changes since 11.0.0.3, I'm inclined to think that this could be something happening at the C++ level. I'll try to narrow it down to the PR which caused this change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1470297241
I've managed to narrow it down to #33770 which is where it first broke. CC @nealrichardson
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org