You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nelson Areal (Jira)" <ji...@apache.org> on 2021/12/24 10:39:00 UTC
[jira] [Created] (ARROW-15201) Problem counting number of records of a parquet dataset created using Spark
Nelson Areal created ARROW-15201:
------------------------------------
Summary: Problem counting number of records of a parquet dataset created using Spark
Key: ARROW-15201
URL: https://issues.apache.org/jira/browse/ARROW-15201
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 6.0.1
Reporter: Nelson Areal
When I open a dataset of parquet files created by Spark I cannot get a count of the number of records, the process hangs with 100% CPU usage.
If I use DuckDB (to_duckdb) to perform the count, the operation completes as expected.
The example below reproduces the problem:
{code:r}
library(tidyverse) # v 1.3.1
library(arrow) # v 6.0.1
library(duckdb) # v 0.3.1-1
library(sparklyr) # v 1.7.3
# Using Spark: 3.0.0, but the same occurs when using Spark 2.4
sc <- spark_connect(master = "local")
# Create a simple data frame and save it to parquet using Spark
test_df <- tibble(a = 1:10e6)
test_spark_tbl <- copy_to(sc, test_df)
spark_write_parquet(test_spark_tbl, path="test")
test_arrow_ds <- open_dataset(sources = "test")
# This works as expected
system.time(
test_arrow_ds %>%
to_duckdb() %>%
count()
)
# user system elapsed
# 0.039 0.040 0.065
# The following will hang the process with 100% CPU usage
test_arrow_ds %>%
count() %>%
collect()
{code}
The session information:
{noformat}
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sparklyr_1.7.3 duckdb_0.3.1-1 DBI_1.1.2 arrow_6.0.1
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[9] readr_2.1.1 tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5
[13] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lubridate_1.8.0 forge_0.2.0 rprojroot_2.0.2
[5] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 R6_2.5.1
[9] cellranger_1.1.0 backports_1.4.1 reprex_2.0.1 evaluate_0.14
[13] httr_1.4.2 pillar_1.6.4 rlang_0.4.12 readxl_1.3.1
[17] rstudioapi_0.13 blob_1.2.2 rmarkdown_2.11 htmlwidgets_1.5.4
[21] r2d3_0.2.5 bit_4.0.4 munsell_0.5.0 broom_0.7.10
[25] compiler_4.1.2 modelr_0.1.8 xfun_0.29 pkgconfig_2.0.3
[29] base64enc_0.1-3 htmltools_0.5.2 tidyselect_1.1.1 fansi_0.5.0
[33] crayon_1.4.2 tzdb_0.2.0 dbplyr_2.1.1 withr_2.4.3
[37] grid_4.1.2 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.1
[41] magrittr_2.0.1 scales_1.1.1 cli_3.1.0 stringi_1.7.6
[45] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.1
[49] vctrs_0.3.8 tools_4.1.2 bit64_4.0.5 glue_1.6.0
[53] hms_1.1.1 fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2
[57] rvest_1.0.2 knitr_1.37 haven_2.4.3
{noformat}
I can also reproduce this in on Linux machine.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)