You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Uwe Korn (Jira)" <ji...@apache.org> on 2020/05/15 08:48:00 UTC

[jira] [Comment Edited] (ARROW-8810) Append to parquet file?

    [ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108087#comment-17108087 ] 

Uwe Korn edited comment on ARROW-8810 at 5/15/20, 8:47 AM:
-----------------------------------------------------------

Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties:
 * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes.
 * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer.

Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that:
 * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications.
 * Reading wouldn't be faster than reading from a second file, even when doing it sequentially.
 * While append to a Parquet file, the file would be unreadable.
 * If your process crashes during write, all existing data in the Parquet file will be lost.
 * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead.

Still if one would try to implement this, it would work as follows:
 # Read in the footer/metadata of the existing file.
 # Seek to the start position of the existing footer and overwrite it with the new data.
 # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file.

If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata.

With newer Arrow releases, there will be better support for Parquet datasets in R, I'll leave this to [~npr] or [~jorisvandenbossche] to link to the right docs.


was (Author: xhochy):
Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties:
 * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes.
 * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer.

Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that:
 * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications.
 * Reading wouldn't be faster than reading from a second file, even when doing it sequentially.
 * While append to a Parquet file, the file would be unreadable.
 * If your process crashes during write, all existing data in the Parquet file will be lost.
 * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead.

Still if one would try to implement this, it would work as follows:
 # Read in the footer/metadata of the existing file.
 # Seek to the start position of the existing footer and overwrite it with the new data.
 # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file.

If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata.

> Append to parquet file?
> -----------------------
>
>                 Key: ARROW-8810
>                 URL: https://issues.apache.org/jira/browse/ARROW-8810
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Is it possible to append new rows to an existing .parquet file using the R client's arrow::write_parquet(), in a manner similar to the `append=TRUE` argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)