You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "David Lee (JIRA)" <ji...@apache.org> on 2019/08/06 22:04:00 UTC

[jira] [Commented] (PARQUET-1626) [C++] Ability to concat parquet files

    [ https://issues.apache.org/jira/browse/PARQUET-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901509#comment-16901509 ] 

David Lee commented on PARQUET-1626:
------------------------------------

I'm appending RowGroups using pyarrow today.

Open a new parquet file

For each row group in File 1:

    read row group. write row group to new file

For each row group in File 2:

    read row group. write row group to new file

Close new parquet file

No need to mess with metadata since all those stats are saved at a row group level.

I usually generate parquet files which are 30 to 40 megs each and I merge them afterwards to match the HDFS blocksize.

> [C++] Ability to concat parquet files 
> --------------------------------------
>
>                 Key: PARQUET-1626
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1626
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp
>    Affects Versions: cpp-1.3.1
>            Reporter: nileema shingte
>            Priority: Major
>              Labels: features
>
> Ability to concat the parquet files is something we've wanted for some time too. When we generate parquet files partitioned by an expression, we often end up with tiny files and would like to add a post-processing step to concat these files together.
> Is there a plan to add this ability to the library any time soon? 
> If not, it would be great if someone can provide a somewhat detailed pseudocode (expanding on what [~xhochy] mentioned in the comment in PARQUET-1022) as a guideline for conditions/scenarios that need to be handled with extra care, so we can contribute this as a PR. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)