You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/12/01 00:54:00 UTC
[jira] [Updated] (ARROW-14736) [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails
[ https://issues.apache.org/jira/browse/ARROW-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-14736:
--------------------------------
Labels: dataset (was: )
> [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails
> -------------------------------------------------------------------------------------
>
> Key: ARROW-14736
> URL: https://issues.apache.org/jira/browse/ARROW-14736
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 6.0.0
> Environment: M1 Mac, macOS Monterey 12.0.1, 16Gb RAM
> R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev)
> Reporter: Dragoș Moldovan-Grünfeld
> Priority: Major
> Labels: dataset
> Attachments: image-2021-11-17-14-43-37-127.png, image-2021-11-17-14-54-42-747.png, image-2021-11-17-14-55-08-597.png
>
>
> Attempting to open a multi-file dataset and write a re-partitioned version of it fails as it seems there is an attempt to collect data into memory first. This happens both for wide and long data.
> Steps to reproduce the issue:
> 1. Create a large dataset (100k columns, 300k rows) and write it to disk and create 20 copies of it. Each file will have a footprint of roughly 7.5GB.
> {code:r}
> library(arrow)
> library(dplyr)
> library(fs)
> rows <- 300000
> cols <- 100000
> partitions <- 20
> wide_df <- as.data.frame(
> matrix(
> sample(1:32767, rows * cols / partitions, replace = TRUE),
> ncol = cols)
> )
> schem <- sapply(colnames(wide_df), function(nm) {int16()})
> schem <- do.call(schema, schem)
> wide_tab <- Table$create(wide_df, schema = schem)
> write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")
> fs::dir_create("~/Documents/arrow_playground/wide_ds")
> for (i in seq_len(partitions)) {
> file.copy("~/Documents/arrow_playground/wide.parquet",
> glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
> }
> ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
> {code}
> All the following steps fail:
> 2. Creating and writing a partitioned version of {{{}ds_wide{}}}.
> {code:r}
> ds_wide %>%
> mutate(grouper = round(V1 / 1024)) %>%
> write_dataset("~/Documents/arrow_playground/partitioned",
> partitioning = "grouper",
> format = "parquet")
> {code}
> 3. Writing a non-partitioned dataset:
> {code:r}
> ds_wide %>%
> write_dataset("~/Documents/arrow_playground/partitioned",
> format = "parquet")
> {code}
> 4. Creating the partitioning variable first and then attempting to write:
> {code:r}
> ds2 <- ds_wide %>%
> mutate(grouper = round(V1 / 1024))
> ds2 %>%
> write_dataset("~/Documents/arrow_playground/partitioned",
> partitioning = "grouper",
> format = "parquet")
> {code}
> 5. Attempting to write to csv:
> {code:r}
> ds_wide %>%
> write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
> format = "csv")
> {code}
> None of the failures seem to originate in R code and they all result in a similar behaviour: the R sessions consume increasing amounts of RAM until they crash.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)