You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nelson Areal (Jira)" <ji...@apache.org> on 2022/01/17 15:33:00 UTC
[jira] [Commented] (ARROW-15201) [R] Problem counting number of records of a parquet dataset created using Spark
[ https://issues.apache.org/jira/browse/ARROW-15201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477283#comment-17477283 ]
Nelson Areal commented on ARROW-15201:
--------------------------------------
The problem also occurs when counting the number of lines of a dataset (of a single parquet file), even when the parquet file is created using Arrow (write_parquet function).
{code:r}
test_df <- tibble(a = 1:10e6)
write_parquet(test_df, sink="test.parquet")
test_arrow_ds <- open_dataset(sources = "test.parquet")
# Works as expected
system.time(
test_arrow_ds %>%
to_duckdb() %>%
count()
)
# user system elapsed
# 0.048 0.058 0.153
# The following will hang the process at 100% CPU usage, and exhausts all available memory
test_arrow_ds %>%
count() %>%
collect()
{code}
It seems that this bug occurs when counting the number of lines of a dataset (of parquet files).
> [R] Problem counting number of records of a parquet dataset created using Spark
> -------------------------------------------------------------------------------
>
> Key: ARROW-15201
> URL: https://issues.apache.org/jira/browse/ARROW-15201
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 6.0.1
> Reporter: Nelson Areal
> Priority: Major
>
> When I open a dataset of parquet files created by Spark I cannot get a count of the number of records, the process hangs with 100% CPU usage.
> If I use DuckDB (to_duckdb) to perform the count, the operation completes as expected.
> The example below reproduces the problem:
> {code:r}
> library(tidyverse) # v 1.3.1
> library(arrow) # v 6.0.1
> library(duckdb) # v 0.3.1-1
> library(sparklyr) # v 1.7.3
> # Using Spark: 3.0.0, but the same occurs when using Spark 2.4
> sc <- spark_connect(master = "local")
> # Create a simple data frame and save it to parquet using Spark
> test_df <- tibble(a = 1:10e6)
> test_spark_tbl <- copy_to(sc, test_df)
> spark_write_parquet(test_spark_tbl, path="test")
> test_arrow_ds <- open_dataset(sources = "test")
> # This works as expected
> system.time(
> test_arrow_ds %>%
> to_duckdb() %>%
> count()
> )
> # user system elapsed
> # 0.039 0.040 0.065
> # The following will hang the process with 100% CPU usage
> test_arrow_ds %>%
> count() %>%
> collect()
> {code}
>
> The session information:
> {noformat}
> R version 4.1.2 (2021-11-01)
> Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS Monterey 12.1
> Matrix products: default
> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] sparklyr_1.7.3 duckdb_0.3.1-1 DBI_1.1.2 arrow_6.0.1
> [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
> [9] readr_2.1.1 tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5
> [13] tidyverse_1.3.1
> loaded via a namespace (and not attached):
> [1] Rcpp_1.0.7 lubridate_1.8.0 forge_0.2.0 rprojroot_2.0.2
> [5] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 R6_2.5.1
> [9] cellranger_1.1.0 backports_1.4.1 reprex_2.0.1 evaluate_0.14
> [13] httr_1.4.2 pillar_1.6.4 rlang_0.4.12 readxl_1.3.1
> [17] rstudioapi_0.13 blob_1.2.2 rmarkdown_2.11 htmlwidgets_1.5.4
> [21] r2d3_0.2.5 bit_4.0.4 munsell_0.5.0 broom_0.7.10
> [25] compiler_4.1.2 modelr_0.1.8 xfun_0.29 pkgconfig_2.0.3
> [29] base64enc_0.1-3 htmltools_0.5.2 tidyselect_1.1.1 fansi_0.5.0
> [33] crayon_1.4.2 tzdb_0.2.0 dbplyr_2.1.1 withr_2.4.3
> [37] grid_4.1.2 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.1
> [41] magrittr_2.0.1 scales_1.1.1 cli_3.1.0 stringi_1.7.6
> [45] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.1
> [49] vctrs_0.3.8 tools_4.1.2 bit64_4.0.5 glue_1.6.0
> [53] hms_1.1.1 fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2
> [57] rvest_1.0.2 knitr_1.37 haven_2.4.3
> {noformat}
> I can also reproduce this in on Linux machine.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)