You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nelson Areal (Jira)" <ji...@apache.org> on 2022/01/17 15:33:00 UTC
[jira] [Commented] (ARROW-15201) [R] Problem counting number of records of a parquet dataset created using Spark

    [ https://issues.apache.org/jira/browse/ARROW-15201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477283#comment-17477283 ] 

Nelson Areal commented on ARROW-15201:
--------------------------------------

The problem also occurs when counting the number of lines of a dataset (of a single parquet file), even when the parquet file is created using Arrow (write_parquet function).
{code:r}
test_df <- tibble(a = 1:10e6)

write_parquet(test_df, sink="test.parquet")

test_arrow_ds <- open_dataset(sources = "test.parquet")

# Works as expected
system.time(
  test_arrow_ds %>% 
    to_duckdb() %>% 
    count() 
)
#  user  system elapsed 
#  0.048   0.058   0.153 

# The following will hang the process at 100% CPU usage, and exhausts all available memory
test_arrow_ds %>% 
  count() %>% 
  collect()

{code}
It seems that this bug occurs when counting the number of lines of a dataset (of parquet files).

> [R] Problem counting number of records of a parquet dataset created using Spark
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-15201
>                 URL: https://issues.apache.org/jira/browse/ARROW-15201
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Nelson Areal
>            Priority: Major
>
> When I open a dataset of parquet files created by Spark I cannot get a count of the number of records, the process hangs with 100% CPU usage.
> If I use DuckDB (to_duckdb) to perform the count,  the operation completes as expected.
> The example below reproduces the problem:
> {code:r}
> library(tidyverse) # v 1.3.1
> library(arrow) # v 6.0.1
> library(duckdb) # v 0.3.1-1
> library(sparklyr) # v 1.7.3
> # Using Spark: 3.0.0, but the same occurs when using Spark 2.4
> sc <- spark_connect(master = "local")
> # Create a simple data frame and save it to parquet using Spark
> test_df <- tibble(a = 1:10e6)
> test_spark_tbl <- copy_to(sc, test_df)
> spark_write_parquet(test_spark_tbl, path="test")
> test_arrow_ds <- open_dataset(sources = "test")
> # This works as expected
> system.time(
>   test_arrow_ds %>% 
>     to_duckdb() %>% 
>     count() 
> )
> #  user  system elapsed 
> #  0.039   0.040   0.065 
> # The following will hang the process with 100% CPU usage 
> test_arrow_ds %>% 
>   count() %>% 
>   collect()
> {code}
>  
> The session information:
> {noformat}
> R version 4.1.2 (2021-11-01)
> Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS Monterey 12.1
> Matrix products: default
> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
>  [1] sparklyr_1.7.3  duckdb_0.3.1-1  DBI_1.1.2       arrow_6.0.1    
>  [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
>  [9] readr_2.1.1     tidyr_1.1.4     tibble_3.1.6    ggplot2_3.3.5  
> [13] tidyverse_1.3.1
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.7        lubridate_1.8.0   forge_0.2.0       rprojroot_2.0.2  
>  [5] assertthat_0.2.1  digest_0.6.29     utf8_1.2.2        R6_2.5.1         
>  [9] cellranger_1.1.0  backports_1.4.1   reprex_2.0.1      evaluate_0.14    
> [13] httr_1.4.2        pillar_1.6.4      rlang_0.4.12      readxl_1.3.1     
> [17] rstudioapi_0.13   blob_1.2.2        rmarkdown_2.11    htmlwidgets_1.5.4
> [21] r2d3_0.2.5        bit_4.0.4         munsell_0.5.0     broom_0.7.10     
> [25] compiler_4.1.2    modelr_0.1.8      xfun_0.29         pkgconfig_2.0.3  
> [29] base64enc_0.1-3   htmltools_0.5.2   tidyselect_1.1.1  fansi_0.5.0      
> [33] crayon_1.4.2      tzdb_0.2.0        dbplyr_2.1.1      withr_2.4.3      
> [37] grid_4.1.2        jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.1  
> [41] magrittr_2.0.1    scales_1.1.1      cli_3.1.0         stringi_1.7.6    
> [45] fs_1.5.2          xml2_1.3.3        ellipsis_0.3.2    generics_0.1.1   
> [49] vctrs_0.3.8       tools_4.1.2       bit64_4.0.5       glue_1.6.0       
> [53] hms_1.1.1         fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2 
> [57] rvest_1.0.2       knitr_1.37        haven_2.4.3      
> {noformat}
> I can also reproduce this in on Linux machine. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)