You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/12/12 15:51:00 UTC

[jira] [Commented] (ARROW-17442) [R] Add append option to write_parquet

    [ https://issues.apache.org/jira/browse/ARROW-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646181#comment-17646181 ] 

Dewey Dunnington commented on ARROW-17442:
------------------------------------------

Unfortunately, the Parquet format makes it difficult to support appending and I don't believe this is on any roadmap for the forseeable future.

I believe that instead of appending, the pattern that Arrow enables is writing multiple files and then using {{open_dataset()}} to query them lazily. An example:

{code:R}
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

parquet_file1 <- tempfile()
parquet_file2 <- tempfile()

write_parquet(nycflights13::flights[1:1000, ], parquet_file1)

# instead of append, write a new file
write_parquet(nycflights13::flights[1001:2000, ], parquet_file2)

#...then query them both using open_dataset
open_dataset(c(parquet_file1, parquet_file2)) |> 
  filter(month == 1) |> 
  collect()
#> # A tibble: 2,000 × 19
#>     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#>  1  2013     1     1      517        515       2     830     819      11 UA     
#>  2  2013     1     1      533        529       4     850     830      20 UA     
#>  3  2013     1     1      542        540       2     923     850      33 AA     
#>  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
#>  5  2013     1     1      554        600      -6     812     837     -25 DL     
#>  6  2013     1     1      554        558      -4     740     728      12 UA     
#>  7  2013     1     1      555        600      -5     913     854      19 B6     
#>  8  2013     1     1      557        600      -3     709     723     -14 EV     
#>  9  2013     1     1      557        600      -3     838     846      -8 B6     
#> 10  2013     1     1      558        600      -2     753     745       8 AA     
#> # … with 1,990 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
{code}


> [R] Add append option to write_parquet
> --------------------------------------
>
>                 Key: ARROW-17442
>                 URL: https://issues.apache.org/jira/browse/ARROW-17442
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Lagu
>            Priority: Major
>
> Hi, parquet format helps a lot to handle data, I think would be great add an option to can write data appending it to a particular file, this is necessary when we works with a lot of data.
>  
> https://arrow.apache.org/docs/r/reference/write_parquet.html
>  
> Thx!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)