You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "John Sheffield (Jira)" <ji...@apache.org> on 2021/09/02 15:24:00 UTC

[jira] [Created] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

John Sheffield created ARROW-13865:
--------------------------------------

             Summary: Writing moderate-size parquet files of nested dataframes from R slows down/process hangs
                 Key: ARROW-13865
                 URL: https://issues.apache.org/jira/browse/ARROW-13865
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 5.0.0
            Reporter: John Sheffield
         Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png

I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000)))),
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000))))
)

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 

# This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")

# This write does not complete within a few minutes on my testing but throws no errors
arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row counts:

```

# screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
)

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_5.0.0

And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)