You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2017/06/29 16:33:01 UTC

[jira] [Assigned] (PARQUET-291) Difference between parquet-mr implementation and parquet-format documentation

     [ https://issues.apache.org/jira/browse/PARQUET-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltan Ivanfi reassigned PARQUET-291:
-------------------------------------

    Assignee: Zoltan Ivanfi

> Difference between parquet-mr implementation and parquet-format documentation
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-291
>                 URL: https://issues.apache.org/jira/browse/PARQUET-291
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format, parquet-mr
>    Affects Versions: 1.6.1
>            Reporter: Konstantin Shaposhnikov
>            Assignee: Zoltan Ivanfi
>
> Documentation at https://github.com/apache/parquet-format/blob/master/src/thrift/parquet.thrift
> {noformat}
> struct ColumnChunk {
>   /** File where column data is stored.  If not set, assumed to be same file as
>     * metadata.  This path is relative to the current file.
>     **/
>   1: optional string file_path
>   /** Byte offset in file_path to the ColumnMetaData **/
>   2: required i64 file_offset
> ...
> {noformat}
> and https://github.com/apache/parquet-format
> {noformat}
> 4-byte magic number "PAR1"
> <Column 1 Chunk 1 + Column Metadata>
> <Column 2 Chunk 1 + Column Metadata>
> ...
> {noformat}
> suggests that ColumnChunk data should be followed by ColumnChunkMetaData.
> However it looks like parquet-mr doesn't write ColumnMetaData after Columns at all and populates ColumnChunk.file_offset with an offset of the first data page:
> from *ParquetMetadataConverter.java:153*:
> {code}
>     for (ColumnChunkMetaData columnMetaData : columns) {
>       ColumnChunk columnChunk = new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset
>       columnChunk.file_path = block.getPath(); // they are in the same file for now
> {code}
>  Is it a bug in parquet-mr or in the documentation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)