You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nelson Areal (Jira)" <ji...@apache.org> on 2021/12/24 10:39:00 UTC

[jira] [Created] (ARROW-15201) Problem counting number of records of a parquet dataset created using Spark

Nelson Areal created ARROW-15201:
------------------------------------

             Summary: Problem counting number of records of a parquet dataset created using Spark
                 Key: ARROW-15201
                 URL: https://issues.apache.org/jira/browse/ARROW-15201
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 6.0.1
            Reporter: Nelson Areal


When I open a dataset of parquet files created by Spark I cannot get a count of the number of records, the process hangs with 100% CPU usage.

If I use DuckDB (to_duckdb) to perform the count,  the operation completes as expected.

The example below reproduces the problem:
{code:r}
library(tidyverse) # v 1.3.1
library(arrow) # v 6.0.1
library(duckdb) # v 0.3.1-1
library(sparklyr) # v 1.7.3

# Using Spark: 3.0.0, but the same occurs when using Spark 2.4
sc <- spark_connect(master = "local")

# Create a simple data frame and save it to parquet using Spark
test_df <- tibble(a = 1:10e6)
test_spark_tbl <- copy_to(sc, test_df)
spark_write_parquet(test_spark_tbl, path="test")

test_arrow_ds <- open_dataset(sources = "test")

# This works as expected
system.time(
  test_arrow_ds %>% 
    to_duckdb() %>% 
    count() 
)
#  user  system elapsed 
#  0.039   0.040   0.065 


# The following will hang the process with 100% CPU usage 
test_arrow_ds %>% 
  count() %>% 
  collect()
{code}
 
The session information:
{noformat}
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sparklyr_1.7.3  duckdb_0.3.1-1  DBI_1.1.2       arrow_6.0.1    
 [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
 [9] readr_2.1.1     tidyr_1.1.4     tibble_3.1.6    ggplot2_3.3.5  
[13] tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        lubridate_1.8.0   forge_0.2.0       rprojroot_2.0.2  
 [5] assertthat_0.2.1  digest_0.6.29     utf8_1.2.2        R6_2.5.1         
 [9] cellranger_1.1.0  backports_1.4.1   reprex_2.0.1      evaluate_0.14    
[13] httr_1.4.2        pillar_1.6.4      rlang_0.4.12      readxl_1.3.1     
[17] rstudioapi_0.13   blob_1.2.2        rmarkdown_2.11    htmlwidgets_1.5.4
[21] r2d3_0.2.5        bit_4.0.4         munsell_0.5.0     broom_0.7.10     
[25] compiler_4.1.2    modelr_0.1.8      xfun_0.29         pkgconfig_2.0.3  
[29] base64enc_0.1-3   htmltools_0.5.2   tidyselect_1.1.1  fansi_0.5.0      
[33] crayon_1.4.2      tzdb_0.2.0        dbplyr_2.1.1      withr_2.4.3      
[37] grid_4.1.2        jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.1  
[41] magrittr_2.0.1    scales_1.1.1      cli_3.1.0         stringi_1.7.6    
[45] fs_1.5.2          xml2_1.3.3        ellipsis_0.3.2    generics_0.1.1   
[49] vctrs_0.3.8       tools_4.1.2       bit64_4.0.5       glue_1.6.0       
[53] hms_1.1.1         fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2 
[57] rvest_1.0.2       knitr_1.37        haven_2.4.3      
{noformat}
I can also reproduce this in on Linux machine. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)