You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "John Sheffield (Jira)" <ji...@apache.org> on 2021/09/02 15:25:00 UTC

[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

     [ https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sheffield updated ARROW-13865:
-----------------------------------
    Description: 
I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function( x ) runif(1000)))),
 l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000))))
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0

  was:
I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000)))),
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000))))
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0


> Writing moderate-size parquet files of nested dataframes from R slows down/process hangs
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-13865
>                 URL: https://issues.apache.org/jira/browse/ARROW-13865
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 5.0.0
>            Reporter: John Sheffield
>            Priority: Major
>         Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000)))),
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000))))
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)