You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2019/10/01 10:57:00 UTC

[jira] [Commented] (PARQUET-1670) parquet-tools merge extremely slow with block-option

    [ https://issues.apache.org/jira/browse/PARQUET-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941737#comment-16941737 ] 

Gabor Szadovszky commented on PARQUET-1670:
-------------------------------------------

This is a tough problem. You are right that concatenating the row-groups as they are does not help solving the issue. In the other hand re-building the row groups (at least in the naive way) requires to read back all the values and re-encode them which requires time. You may come up with smarter solutions (currently not implemented) like writing the pages without decoding them. But you have to handle dictionaries which is really problematic. (I cannot see any smart solution for dictionaries.)

Long story short we do not have any fast way as a tool to merge parquet files into one correctly. I think the best way you can do that is to use an existing engine (Spark, Hive etc.) and re-build the whole table or the last partitions of the table.

> parquet-tools merge extremely slow with block-option
> ----------------------------------------------------
>
>                 Key: PARQUET-1670
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1670
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Alexander Gunkel
>            Priority: Major
>
> parquet-tools merge is extremely time- and memory-consuming when used with block-option.
>  
> The merge function builds a bigger file out of several smaller parquet-files. Used without the block-option it just concatenates the files into a bigger one without building larger row-groups. That doesn't help with query-performance-issues. With block-option, parquet-tools build bigger row-groups which improves the query-performance, but the merge-process itself is extremely slow and memory-consuming.
>  
> Consider a case in which you have many small parquet files, e.g. 1000 files with a size of 100kb. Merging them into one file fails on my machine because even 20GB of memory are not enough for the process (the total amount of data as well as the resulting file should be smaller than 100MB).
>  
> Different situation: Consider having 100 files of size 1MB. Then merging them is possible with 20GB of RAM, but it takes almoust half an hour to process, which is to much for many use-cases.
>  
> Is there any possibility to accelerate the merge and reduce the need of memory?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)