You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2017/06/29 16:54:00 UTC

[jira] [Commented] (PARQUET-291) Difference between parquet-mr implementation and parquet-format documentation

    [ https://issues.apache.org/jira/browse/PARQUET-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068604#comment-16068604 ] 

Zoltan Ivanfi commented on PARQUET-291:
---------------------------------------

As we discussed in the Parquet sync-up yesterday, these fields were added for a planned feature that didn't get implemented. We should deprecate these fields. I will update parquet.thrift accordingly.

> Difference between parquet-mr implementation and parquet-format documentation
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-291
>                 URL: https://issues.apache.org/jira/browse/PARQUET-291
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format, parquet-mr
>    Affects Versions: 1.6.1
>            Reporter: Konstantin Shaposhnikov
>            Assignee: Zoltan Ivanfi
>
> Documentation at https://github.com/apache/parquet-format/blob/master/src/thrift/parquet.thrift
> {noformat}
> struct ColumnChunk {
>   /** File where column data is stored.  If not set, assumed to be same file as
>     * metadata.  This path is relative to the current file.
>     **/
>   1: optional string file_path
>   /** Byte offset in file_path to the ColumnMetaData **/
>   2: required i64 file_offset
> ...
> {noformat}
> and https://github.com/apache/parquet-format
> {noformat}
> 4-byte magic number "PAR1"
> <Column 1 Chunk 1 + Column Metadata>
> <Column 2 Chunk 1 + Column Metadata>
> ...
> {noformat}
> suggests that ColumnChunk data should be followed by ColumnChunkMetaData.
> However it looks like parquet-mr doesn't write ColumnMetaData after Columns at all and populates ColumnChunk.file_offset with an offset of the first data page:
> from *ParquetMetadataConverter.java:153*:
> {code}
>     for (ColumnChunkMetaData columnMetaData : columns) {
>       ColumnChunk columnChunk = new ColumnChunk(columnMetaData.getFirstDataPageOffset()); // verify this is the right offset
>       columnChunk.file_path = block.getPath(); // they are in the same file for now
> {code}
>  Is it a bug in parquet-mr or in the documentation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)