You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/08/25 01:30:00 UTC

[jira] [Commented] (PARQUET-1115) Warn users when misusing parquet-tools merge

    [ https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584533#comment-17584533 ] 

ASF GitHub Bot commented on PARQUET-1115:
-----------------------------------------

NickCrews commented on PR #433:
URL: https://github.com/apache/parquet-mr/pull/433#issuecomment-1226667307

   It might be nice if we actually suggested an alternative instead of just saying "don't do this."
   
   You can see my solution at https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c.




> Warn users when misusing parquet-tools merge
> --------------------------------------------
>
>                 Key: PARQUET-1115
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1115
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Zoltan Ivanfi
>            Assignee: Nándor Kollár
>            Priority: Major
>             Fix For: 1.10.0
>
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its use is not practical, we should describe its limitations in the help text of this command. Additionally, we should add a warning to the output of the merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality, because they want to achieve good performance and historically that has been associated with large Parquet files. However, in practice Hive performance won't change significantly after using {{parquet-tools merge}}, but Impala performance will be much worse. The reason for that is that good performance is not a result of large files but large rowgroups instead (up to the HDFS block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places them one after the other. It was intended to be used for Parquet files that are already arranged in row groups of the desired size. When used to merge many small files, the resulting file will still contain small row groups and one loses most of the advantages of larger files (the only one that remains is that it takes a single HDFS operation to read them).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)