You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/07/13 02:11:00 UTC
[jira] [Commented] (ARROW-12321) [R][C++] Arrow opens too many files at once when writing a dataset

    [ https://issues.apache.org/jira/browse/ARROW-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379517#comment-17379517 ] 

Weston Pace commented on ARROW-12321:
-------------------------------------

I started to work on an implementation here but things got a bit tricky with S3.  My plan was to keep an LRU of write queues.  When a queue was expired it would close the output stream and then simply reopen it in append mode when it was ready to write more (or do nothing if the dataset write finished).

 

Unfortunately, for S3, there is no append equivalent.  On the other hand, I could add the concept of "Pause"/"Unpause" to an output stream.  For standard streams pause & unpause would simply mean close and open-append.  For S3 pause and unpause would be a no-op (the S3 output stream already sends small independent messages and so there are no OS resources held open and nothing to pause).  However, this adds complexity to output stream.

 

So I think we have options:

A) Leave OutputStream unchanged, it's up to the user to ensure they set max_open_files > max_partitions if they are using S3 (and if they don't they get an error when it tries to open an append file which is illegal on an S3 filesystem).

B) Add pause/unpause to the output stream interface

 

Now that I write all this out it seems A is probably the clear choice but I'll leave this note here for future reference.

> [R][C++] Arrow opens too many files at once when writing a dataset
> ------------------------------------------------------------------
>
>                 Key: ARROW-12321
>                 URL: https://issues.apache.org/jira/browse/ARROW-12321
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 3.0.0
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 6.0.0
>
>
> _Related to:_ https://issues.apache.org/jira/browse/ARROW-12315
> Please see https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing where I added the raw data and the output.
> This works:
> {code:java}
> library(data.table)
> library(dplyr)
> library(arrow)
> d <- fread(
>         input = "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
>         colClasses = list(
>           character = "Commodity Code",
>           numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
>         ))
> d <- d %>%
>   mutate(
>     `Reporter ISO` = case_when(
>       `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>       TRUE ~ `Reporter ISO`
>     ),
>     `Partner ISO` = case_when(
>       `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>       TRUE ~ `Partner ISO`
>     )
>   )
> # d %>%
> #   select(Year, `Reporter ISO`, `Partner ISO`) %>%
> #   distinct() %>%
> #   dim()
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 1024L)
> {code}
> But, if I add an additional column for partioning and increases the max partitions to 12808 (to pass exactly the number of partitions that it needs), I get the error:
> {code:java}
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 12808)
> Error: IOError: Failed to open local file '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'. Detail: [errno 24] Too many open files
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)