You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/12/12 15:51:00 UTC
[jira] [Commented] (ARROW-17442) [R] Add append option to write_parquet
[ https://issues.apache.org/jira/browse/ARROW-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646181#comment-17646181 ]
Dewey Dunnington commented on ARROW-17442:
------------------------------------------
Unfortunately, the Parquet format makes it difficult to support appending and I don't believe this is on any roadmap for the forseeable future.
I believe that instead of appending, the pattern that Arrow enables is writing multiple files and then using {{open_dataset()}} to query them lazily. An example:
{code:R}
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
parquet_file1 <- tempfile()
parquet_file2 <- tempfile()
write_parquet(nycflights13::flights[1:1000, ], parquet_file1)
# instead of append, write a new file
write_parquet(nycflights13::flights[1001:2000, ], parquet_file2)
#...then query them both using open_dataset
open_dataset(c(parquet_file1, parquet_file2)) |>
filter(month == 1) |>
collect()
#> # A tibble: 2,000 × 19
#> year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 517 515 2 830 819 11 UA
#> 2 2013 1 1 533 529 4 850 830 20 UA
#> 3 2013 1 1 542 540 2 923 850 33 AA
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
#> 6 2013 1 1 554 558 -4 740 728 12 UA
#> 7 2013 1 1 555 600 -5 913 854 19 B6
#> 8 2013 1 1 557 600 -3 709 723 -14 EV
#> 9 2013 1 1 557 600 -3 838 846 -8 B6
#> 10 2013 1 1 558 600 -2 753 745 8 AA
#> # … with 1,990 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
{code}
> [R] Add append option to write_parquet
> --------------------------------------
>
> Key: ARROW-17442
> URL: https://issues.apache.org/jira/browse/ARROW-17442
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Lagu
> Priority: Major
>
> Hi, parquet format helps a lot to handle data, I think would be great add an option to can write data appending it to a particular file, this is necessary when we works with a lot of data.
>
> https://arrow.apache.org/docs/r/reference/write_parquet.html
>
> Thx!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)