You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Nandor Kollar (JIRA)" <ji...@apache.org> on 2018/08/23 15:27:00 UTC
[jira] [Updated] (PARQUET-1401) RowGroup offset and total
compressed size fields
[ https://issues.apache.org/jira/browse/PARQUET-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nandor Kollar updated PARQUET-1401:
-----------------------------------
Labels: pull-request-available (was: )
> RowGroup offset and total compressed size fields
> ------------------------------------------------
>
> Key: PARQUET-1401
> URL: https://issues.apache.org/jira/browse/PARQUET-1401
> Project: Parquet
> Issue Type: Sub-task
> Components: parquet-cpp, parquet-format
> Reporter: Gidon Gershinsky
> Assignee: Gidon Gershinsky
> Priority: Major
> Labels: pull-request-available
>
> Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, that calculate the offset and total compressed size of a RowGroup data.
> The offset calculation is done by extracting the ColumnMetaData of the first column, and using its offset fields.
> The total compressed size calculation is done by running a loop over all column chunks in the RowGroup, and summing up the size values from each chunk's ColumnMetaData .
> If one or more columns are hidden (encrypted with a key unavailable to the reader), these calculations can't be performed, because the column metadata is protected.
>
> But: these calculations don't really need the individual column values. The results pertain to the whole RowGroup, not specific columns.
> Therefore, we will define two new optional fields in the RowGroup Thrift structure:
>
> _optional i64 file_offset_
> _optional i64 total_compressed_size_
>
> and calculate/set them upon file writing. Then, Spark will be able to query a file with hidden columns (of course, only if the query itself doesn't need the hidden columns - works with a masked version of them, or reads columns with available keys).
>
> These values can be set only for encrypted files (or for all files, to skip the loop upon reading). I've tested this, works fine in Spark writers and readers.
>
> I've also checked other references to ColumnMetaData fields in parquet-mr. There are none - therefore, its the only change we need in parquet.thrift to handle hidden columns.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)