You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "egillax (via GitHub)" <gi...@apache.org> on 2023/03/09 16:44:43 UTC

[GitHub] [arrow] egillax opened a new issue, #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

egillax opened a new issue, #34519:
URL: https://github.com/apache/arrow/issues/34519

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I was testing the latest arrow develop version using [this](https://arrow.apache.org/docs/dev/r/articles/install_nightly.html#install-from-git-repository) method to install from git. 
   
   And now it seems I cannot cast columns in a dataset, it results in ```NA``` values:
   
   I tried using both parquet and arrow files. This does work using latest version on CRAN (11.0.0.3) and using arrow tables instead of datasets.
   
   Reprex:
   
   ``` r
   library(dplyr)
   #> 
   #> Attaching package: 'dplyr'
   #> The following objects are masked from 'package:stats':
   #> 
   #>     filter, lag
   #> The following objects are masked from 'package:base':
   #> 
   #>     intersect, setdiff, setequal, union
   library(arrow)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   
   mtcars %>% write_dataset('./mtcars/')
   ds <- open_dataset('./mtcars')
   
   ds %>% dplyr::collect()
   #>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
   #> 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
   #> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
   #> 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
   #> 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
   #> 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
   #> 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
   #> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
   #> 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
   #> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
   #> 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
   #> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
   #> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
   #> 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
   #> 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
   #> 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
   #> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
   #> 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
   #> 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
   #> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
   #> 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
   #> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
   #> 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
   #> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
   #> 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
   #> 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
   #> 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
   #> 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
   #> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
   #> 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
   #> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
   #> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
   #> 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
   
   ds %>% dplyr::mutate(mpg=as.numeric(mpg)) %>% dplyr::collect()
   #>    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
   #> 1   NA   6 160.0 110 3.90 2.620 16.46  0  1    4    4
   #> 2   NA   6 160.0 110 3.90 2.875 17.02  0  1    4    4
   #> 3   NA   4 108.0  93 3.85 2.320 18.61  1  1    4    1
   #> 4   NA   6 258.0 110 3.08 3.215 19.44  1  0    3    1
   #> 5   NA   8 360.0 175 3.15 3.440 17.02  0  0    3    2
   #> 6   NA   6 225.0 105 2.76 3.460 20.22  1  0    3    1
   #> 7   NA   8 360.0 245 3.21 3.570 15.84  0  0    3    4
   #> 8   NA   4 146.7  62 3.69 3.190 20.00  1  0    4    2
   #> 9   NA   4 140.8  95 3.92 3.150 22.90  1  0    4    2
   #> 10  NA   6 167.6 123 3.92 3.440 18.30  1  0    4    4
   #> 11  NA   6 167.6 123 3.92 3.440 18.90  1  0    4    4
   #> 12  NA   8 275.8 180 3.07 4.070 17.40  0  0    3    3
   #> 13  NA   8 275.8 180 3.07 3.730 17.60  0  0    3    3
   #> 14  NA   8 275.8 180 3.07 3.780 18.00  0  0    3    3
   #> 15  NA   8 472.0 205 2.93 5.250 17.98  0  0    3    4
   #> 16  NA   8 460.0 215 3.00 5.424 17.82  0  0    3    4
   #> 17  NA   8 440.0 230 3.23 5.345 17.42  0  0    3    4
   #> 18  NA   4  78.7  66 4.08 2.200 19.47  1  1    4    1
   #> 19  NA   4  75.7  52 4.93 1.615 18.52  1  1    4    2
   #> 20  NA   4  71.1  65 4.22 1.835 19.90  1  1    4    1
   #> 21  NA   4 120.1  97 3.70 2.465 20.01  1  0    3    1
   #> 22  NA   8 318.0 150 2.76 3.520 16.87  0  0    3    2
   #> 23  NA   8 304.0 150 3.15 3.435 17.30  0  0    3    2
   #> 24  NA   8 350.0 245 3.73 3.840 15.41  0  0    3    4
   #> 25  NA   8 400.0 175 3.08 3.845 17.05  0  0    3    2
   #> 26  NA   4  79.0  66 4.08 1.935 18.90  1  1    4    1
   #> 27  NA   4 120.3  91 4.43 2.140 16.70  0  1    5    2
   #> 28  NA   4  95.1 113 3.77 1.513 16.90  1  1    5    2
   #> 29  NA   8 351.0 264 4.22 3.170 14.50  0  1    5    4
   #> 30  NA   6 145.0 175 3.62 2.770 15.50  0  1    5    6
   #> 31  NA   8 301.0 335 3.54 3.570 14.60  0  1    5    8
   #> 32  NA   4 121.0 109 4.11 2.780 18.60  1  1    4    2
   ```
   
   <sup>Created on 2023-03-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   <details>
    <summary>Arrow Info</summary>
    Arrow package version: 11.0.0.9000
   
   Capabilities:
                  
   dataset    TRUE
   substrait FALSE
   parquet    TRUE
   json       TRUE
   s3        FALSE
   gcs       FALSE
   utf8proc   TRUE
   re2        TRUE
   snappy     TRUE
   gzip      FALSE
   brotli    FALSE
   zstd      FALSE
   lz4        TRUE
   lz4_frame  TRUE
   lzo       FALSE
   bz2       FALSE
   jemalloc  FALSE
   mimalloc   TRUE
   
   To reinstall with more optional capabilities enabled, see
      https://arrow.apache.org/docs/r/articles/install.html
   
   Memory:
                     
   Allocator mimalloc
   Current   13.31 Kb
   Max       46.31 Mb
   
   Runtime:
                           
   SIMD Level          avx2
   Detected SIMD Level avx2
   
   Build:
                                                                
   C++ Library Version                           12.0.0-SNAPSHOT
   C++ Compiler                                              GNU
   C++ Compiler Version                                   12.2.0
   Git ID               b679a96d426f4df1a2d15d452f312c968cdfc8f6
    </details>
   
   <details>
   <summary>sessionInfo</summary>
   R version 4.2.2 (2022-10-31)
   Platform: x86_64-pc-linux-gnu (64-bit)
   Running under: Ubuntu 22.10
   
   Matrix products: default
   BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1
   LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1
   
   locale:
    [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=nl_NL.UTF-8           LC_COLLATE=en_US.UTF-8       
    [5] LC_MONETARY=nl_NL.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=nl_NL.UTF-8          LC_NAME=nl_NL.UTF-8          
    [9] LC_ADDRESS=nl_NL.UTF-8        LC_TELEPHONE=nl_NL.UTF-8      LC_MEASUREMENT=nl_NL.UTF-8    LC_IDENTIFICATION=nl_NL.UTF-8
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   other attached packages:
   [1] arrow_11.0.0.9000                 dplyr_1.0.10                      PatientLevelPrediction_6.2.0.9000
   
   loaded via a namespace (and not attached):
    [1] pkgload_1.3.2           bit64_4.0.5             jsonlite_1.8.4          DatabaseConnector_6.0.0 R.utils_2.12.2         
    [6] shiny_1.7.4             assertthat_0.2.1        highr_0.10              blob_1.2.3              remotes_2.4.2          
   [11] yaml_2.3.6              sessioninfo_1.2.2       pillar_1.8.1            RSQLite_2.2.18          lattice_0.20-45        
   [16] glue_1.6.2              reticulate_1.26         digest_0.6.31           promises_1.2.0.1        htmltools_0.5.4        
   [21] httpuv_1.6.8            Matrix_1.5-1            R.oo_1.25.0             clipr_0.8.0             pkgconfig_2.0.3        
   [26] devtools_2.4.5          purrr_1.0.1             xtable_1.8-4            processx_3.8.0          later_1.3.0            
   [31] ParallelLogger_3.0.1    tibble_3.1.8            styler_1.9.0            generics_0.1.3          usethis_2.1.6          
   [36] ellipsis_0.3.2          cachem_1.0.6            withr_2.5.0             cli_3.6.0               magrittr_2.0.3         
   [41] crayon_1.5.2            mime_0.12               memoise_2.0.1           evaluate_0.20           ps_1.7.2               
   [46] R.methodsS3_1.8.2       Andromeda_1.0.0         fs_1.5.2                fansi_1.0.3             R.cache_0.16.0         
   [51] pkgbuild_1.4.0          SqlRender_1.12.0        profvis_0.3.7           tools_4.2.2             data.table_1.14.4      
   [56] prettyunits_1.1.1       lifecycle_1.0.3         stringr_1.5.0           reprex_2.0.2            callr_3.7.3            
   [61] compiler_4.2.2          rlang_1.0.6             grid_4.2.2              rstudioapi_0.14         htmlwidgets_1.6.1      
   [66] miniUI_0.1.1.1          rmarkdown_2.19          DBI_1.1.3               R6_2.5.1                knitr_1.41             
   [71] fastmap_1.1.0           bit_4.0.4               utf8_1.2.2              stringi_1.7.12          rJava_1.0-6            
   [76] parallel_4.2.2          Rcpp_1.0.9              vctrs_0.5.1             png_0.1-7               urlchecker_1.0.1       
   [81] tidyselect_1.2.0        FeatureExtraction_3.2.0 xfun_0.36      
   </details>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

Posted by "nealrichardson (via GitHub)" <gi...@apache.org>.
nealrichardson commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1470415515

   I can take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values
URL: https://github.com/apache/arrow/issues/34519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1465212660

   I can reproduce this on 5b2fbade23eda9bc95b1e3854b19efff177cd0bd.
   (Install with libarrow nightly binary arrow-11.0.0.100000193 on Ubuntu 22.04)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1469780866

   Thanks for reporting this @egillax; I can confirm that this issue is not present in 11.0.03, but is on the dev build. I'll investigate further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1469802891

   A little more investigation of the specific circumstances in which it does and does not occur:
   
   ``` r
   library(arrow)
   library(dplyr)
   
   # no problem when replacing column with self when there is just 1 column
   df <- tibble::tibble(x = 1:10) 
   tf <- tempfile()
   dir.create(tf)
   write_dataset(df, tf)
   
   open_dataset(tf) %>%
     mutate(x = as.numeric(x)) %>%
     collect()
   #> # A tibble: 10 × 1
   #>        x
   #>    <dbl>
   #>  1     1
   #>  2     2
   #>  3     3
   #>  4     4
   #>  5     5
   #>  6     6
   #>  7     7
   #>  8     8
   #>  9     9
   #> 10    10
   
   # NA values when there are 2 columns
   df <- tibble::tibble(x = 1:10, y = 1:10) 
   tf <- tempfile()
   dir.create(tf)
   
   write_dataset(df, tf)
   
   open_dataset(tf) %>%
     mutate(x = as.numeric(x)) %>%
     collect()
   #> # A tibble: 10 × 2
   #>        x     y
   #>    <dbl> <int>
   #>  1    NA     1
   #>  2    NA     2
   #>  3    NA     3
   #>  4    NA     4
   #>  5    NA     5
   #>  6    NA     6
   #>  7    NA     7
   #>  8    NA     8
   #>  9    NA     9
   #> 10    NA    10
   
   # works fine if we're creating a brand new column
   open_dataset(tf) %>%
     mutate(z = as.numeric(x)) %>%
     collect()
   #> # A tibble: 10 × 3
   #>        x     y     z
   #>    <int> <int> <dbl>
   #>  1     1     1     1
   #>  2     2     2     2
   #>  3     3     3     3
   #>  4     4     4     4
   #>  5     5     5     5
   #>  6     6     6     6
   #>  7     7     7     7
   #>  8     8     8     8
   #>  9     9     9     9
   #> 10    10    10    10
   
   # works fine if we're replacing a different column
   open_dataset(tf) %>%
     mutate(y = as.numeric(x)) %>%
     collect()
   #> # A tibble: 10 × 2
   #>        x     y
   #>    <int> <dbl>
   #>  1     1     1
   #>  2     2     2
   #>  3     3     3
   #>  4     4     4
   #>  5     5     5
   #>  6     6     6
   #>  7     7     7
   #>  8     8     8
   #>  9     9     9
   #> 10    10    10
   
   # works fine with in-memory datasets when replacing existing columns
   InMemoryDataset$create(df) %>%
     mutate(x = as.numeric(x)) %>%
     collect()
   #> # A tibble: 10 × 2
   #>        x     y
   #>    <dbl> <int>
   #>  1     1     1
   #>  2     2     2
   #>  3     3     3
   #>  4     4     4
   #>  5     5     5
   #>  6     6     6
   #>  7     7     7
   #>  8     8     8
   #>  9     9     9
   #> 10    10    10
   ```
   
   Given it works with 11.0.0.3 and not the dev version of the R package, and there are very few R code changes since 11.0.0.3, I'm inclined to think that this could be something happening at the C++ level.  I'll try to narrow it down to the PR which caused this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #34519: [R] Casting columns using dplyr::mutate in arrow datasets results in NA values

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34519:
URL: https://github.com/apache/arrow/issues/34519#issuecomment-1470297241

   I've managed to narrow it down to #33770 which is where it first broke.  CC @nealrichardson 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org