You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Uwe L. Korn (JIRA)" <ji...@apache.org> on 2017/06/16 16:43:00 UTC

[jira] [Commented] (PARQUET-1022) Append mode in parquet-cpp

    [ https://issues.apache.org/jira/browse/PARQUET-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052116#comment-16052116 ] 

Uwe L. Korn commented on PARQUET-1022:
--------------------------------------

What do you mean with "the original last rowgroup may not be complete"? Each RowGroup has a number of rows specified in its metadata and exactly this number of rows must be in that RowGroup once the file is written to disk, otherwise we end up with invalid files.

Concerning the general approach, writing a Parquet would initially produce (we take here 2 RowGroups as an example) of the pattern {{MRRF}} where M are the 4 magic bytes {{PAR1}}, R is a RowGroup each and F is the footer including the metadata.

To make an append mode, you would then read F from the existing file, insert a new RowGroup at the place where currently F is and then write out the new footer with its modified metadata F'.

Still a small note of caution: Normally appends to Parquet datasets are simply done by creating additional Parquet files. Most tools that ingest Parquet are built such that they treat multiple files and a single Parquet file nearly identically. Code-wise this should be much easier for you to handle. But if this append mode is really helpful for you, it would also be really helpful for us to understand why you would need it instead of just using the new file mode!

> Append mode in parquet-cpp
> --------------------------
>
>                 Key: PARQUET-1022
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1022
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp
>    Affects Versions: cpp-1.1.0
>            Reporter: yugu
>
> As said, currently trying to work out a append feature for parquet files in c++.
> (been searching through repo etc, can't find example tho..)
> Current solution is to (assume no schema changes that is):
> Read in metadata
> Change metadata based on appended rows+ original rows
> Append a new row group (or multiple row group writer)
> Write the new rows.
> ---
> The problem is that, is approached this way, the original last row group may not be complete filled. Was wondering if there is a fix or I'm using the api wrong...
> Thanks ! : D



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)