You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ekaterina Galieva (JIRA)" <ji...@apache.org> on 2018/08/15 23:41:00 UTC

[jira] [Created] (PARQUET-1381) Add merge blocks command to parquet-tools

Ekaterina Galieva created PARQUET-1381:
------------------------------------------

             Summary: Add merge blocks command to parquet-tools
                 Key: PARQUET-1381
                 URL: https://issues.apache.org/jira/browse/PARQUET-1381
             Project: Parquet
          Issue Type: New Feature
          Components: parquet-mr
    Affects Versions: 1.10.0
            Reporter: Ekaterina Galieva
             Fix For: 1.10.1


Current implementation of merge command in parquet-tools doesn't merge row groups, just places one after the other. Add API and command option to be able to merge small blocks into larger ones up to specified size limit.
h6. Implementation details:

Blocks are not reordered not to break possible initial predicate pushdown optimizations.
Blocks are not divided to fit upper bound perfectly. 
This is an intentional performance optimization. 
This gives an opportunity to form new blocks by coping full content of smaller blocks by column, not by row.
h6. Examples:
 # Input files with blocks sizes:
{code:java}
[128 | 35], [128 | 40], [120]{code}
Expected output file blocks sizes:
{{merge }}
{code:java}
[128 | 35 | 128 | 40 | 120]
{code}
{{merge -b}}
{code:java}
[128 | 35 | 128 | 40 | 120]
{code}
{{merge -b -l 256 }}
{code:java}
[163 | 168 | 120]
{code}

 # Input files with blocks sizes:
{code:java}
[128 | 35], [40], [120], [6] {code}
Expected output file blocks sizes:
{{merge}}
{code:java}
[128 | 35 | 40 | 120 | 6] 
{code}
{{merge -b}}
{code:java}
[128 | 75 | 126] 
{code}
{{merge -b -l 256}}
{code:java}
[203 | 126]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)