You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Nandor Kollar (JIRA)" <ji...@apache.org> on 2018/08/23 15:27:00 UTC

[jira] [Updated] (PARQUET-1401) RowGroup offset and total compressed size fields

     [ https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nandor Kollar updated PARQUET-1401:
-----------------------------------
    Labels: pull-request-available  (was: )

> RowGroup offset and total compressed size fields
> ------------------------------------------------
>
>                 Key: PARQUET-1401
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1401
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-cpp, parquet-format
>            Reporter: Gidon Gershinsky
>            Assignee: Gidon Gershinsky
>            Priority: Major
>              Labels: pull-request-available
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, that  calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all column chunks in the RowGroup, and summing up the size values from each chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the reader), these calculations can't be performed, because the column metadata is protected. 
>  
> But: these calculations don't really need the individual column values. The results pertain to the whole RowGroup, not specific columns. 
> Therefore, we will define two new optional fields in the RowGroup Thrift structure:
>  
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>  
> and calculate/set them upon file writing. Then, Spark will be able to query a file with hidden columns (of course, only if the query itself doesn't need the hidden columns - works with a masked version of them, or reads columns with available keys).
>  
> These values can be set only for encrypted files (or for all files, to skip the loop upon reading). I've tested this, works fine in Spark writers and readers.
>  
> I've also checked other references to ColumnMetaData fields in parquet-mr. There are none - therefore, its the only change we need in parquet.thrift to handle hidden columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)