You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Julien Le Dem (JIRA)" <ji...@apache.org> on 2015/02/06 20:32:35 UTC

[jira] [Created] (PARQUET-183) Special case empty columns to store 0 pages and no column chunks in the footer

Julien Le Dem created PARQUET-183:
-------------------------------------

             Summary: Special case empty columns to store 0 pages and no column chunks in the footer
                 Key: PARQUET-183
                 URL: https://issues.apache.org/jira/browse/PARQUET-183
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Julien Le Dem


Currently when a column is empty, each row group will contain one page that encodes repetition and definition levels for this row group. These will be as many 0s as there are rows in the row group (stored in the row group metadata). These values are encoded using RLE so it ends up being very small. 
However in cases where there are a lot of columns in a very sparse dataset we end up with a lot of empty column chunks (a column chunk is the data for a given column in a given row group). The metadata could become much smaller by omitting empty column chunks as the metadata of an empty column chunk can be derived from the row count in the corresponding row group.

I propose the following:
When a column chunk is empty, do not write any page to it.
Do not add the column chunk metadata in the footer for such empty columns.
A column chunk is empty if when writing the row group to disk, there is only one page and this page contains rl and dl that are only 0s. (completely empty column).
When reading the dataset:
 - the column is present in the schema.
 - if there's no column chunk in the footer for a given row group that means we can just replace rls and dls with infinite streams of 0s.
 - any stats information can be replaced by #rows count of nulls in predicate push down.

This will help in cases where we have huge schemas where actually a small subset of columns are populated. The file data will now look like as if we had declared only the schema for columns that actually have data in them. Only the schema in the footer will mention those empty columns.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)