You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2017/06/29 16:33:01 UTC
[jira] [Assigned] (PARQUET-291) Difference between parquet-mr
implementation and parquet-format documentation
[ https://issues.apache.org/jira/browse/PARQUET-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoltan Ivanfi reassigned PARQUET-291:
-------------------------------------
Assignee: Zoltan Ivanfi
> Difference between parquet-mr implementation and parquet-format documentation
> -----------------------------------------------------------------------------
>
> Key: PARQUET-291
> URL: https://issues.apache.org/jira/browse/PARQUET-291
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format, parquet-mr
> Affects Versions: 1.6.1
> Reporter: Konstantin Shaposhnikov
> Assignee: Zoltan Ivanfi
>
> Documentation at https://github.com/apache/parquet-format/blob/master/src/thrift/parquet.thrift
> {noformat}
> struct ColumnChunk {
> /** File where column data is stored. If not set, assumed to be same file as
> * metadata. This path is relative to the current file.
> **/
> 1: optional string file_path
> /** Byte offset in file_path to the ColumnMetaData **/
> 2: required i64 file_offset
> ...
> {noformat}
> and https://github.com/apache/parquet-format
> {noformat}
> 4-byte magic number "PAR1"
> <Column 1 Chunk 1 + Column Metadata>
> <Column 2 Chunk 1 + Column Metadata>
> ...
> {noformat}
> suggests that ColumnChunk data should be followed by ColumnChunkMetaData.
> However it looks like parquet-mr doesn't write ColumnMetaData after Columns at all and populates ColumnChunk.file_offset with an offset of the first data page:
> from *ParquetMetadataConverter.java:153*:
> {code}
> for (ColumnChunkMetaData columnMetaData : columns) {
> ColumnChunk columnChunk = new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset
> columnChunk.file_path = block.getPath(); // they are in the same file for now
> {code}
> Is it a bug in parquet-mr or in the documentation?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)