You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/09/13 19:42:00 UTC

[jira] [Updated] (ARROW-12321) [R][C++] Arrow opens too many files at once when writing a dataset

     [ https://issues.apache.org/jira/browse/ARROW-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-12321:
------------------------------------
    Fix Version/s:     (was: 6.0.0)

> [R][C++] Arrow opens too many files at once when writing a dataset
> ------------------------------------------------------------------
>
>                 Key: ARROW-12321
>                 URL: https://issues.apache.org/jira/browse/ARROW-12321
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 3.0.0
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Assignee: Weston Pace
>            Priority: Minor
>              Labels: query-engine
>
> _Related to:_ https://issues.apache.org/jira/browse/ARROW-12315
> Please see https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing where I added the raw data and the output.
> This works:
> {code:java}
> library(data.table)
> library(dplyr)
> library(arrow)
> d <- fread(
>         input = "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
>         colClasses = list(
>           character = "Commodity Code",
>           numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
>         ))
> d <- d %>%
>   mutate(
>     `Reporter ISO` = case_when(
>       `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>       TRUE ~ `Reporter ISO`
>     ),
>     `Partner ISO` = case_when(
>       `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>       TRUE ~ `Partner ISO`
>     )
>   )
> # d %>%
> #   select(Year, `Reporter ISO`, `Partner ISO`) %>%
> #   distinct() %>%
> #   dim()
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 1024L)
> {code}
> But, if I add an additional column for partioning and increases the max partitions to 12808 (to pass exactly the number of partitions that it needs), I get the error:
> {code:java}
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 12808)
> Error: IOError: Failed to open local file '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'. Detail: [errno 24] Too many open files
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)