You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Timothy Miller (Jira)" <ji...@apache.org> on 2022/04/20 20:38:00 UTC

[jira] [Comment Edited] (PARQUET-2139) Bogus file offset for ColumnMetaData written to ColumnChunk metadata of single parquet files

    [ https://issues.apache.org/jira/browse/PARQUET-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525272#comment-17525272 ] 

Timothy Miller edited comment on PARQUET-2139 at 4/20/22 8:37 PM:
------------------------------------------------------------------

I just noticed that the file_offset field in ColumnChunk is "required." So there's a few possible mistakes:
 # The description of the metadata structures is wrong and this really is supposed to be a pointer to the data page (PageHeader), or
 # There's a completely separate bug where the parquet writer fails to store an extra copy of the ColumnMetaData in the file right before the PageHeader, or
 # The offset should point to where the unique copy of ColumnMetaData is already going to be found in the file footer, although that seems like it would be really hard to calculate.

In any case, there's an inconsistency where the metadata definition specifies an offset to ColumnMetaData, where instead a PageHeader is placed.

I'm going to go check out the reader and see what it does with this field. My guess is that it doesn't use the field at all, which is why this discrepancy is never a problem.


was (Author: JIRAUSER287471):
I just noticed that the file_offset field in ColumnChunk is "required." So there's a few possible mistakes:
 # The description of the metadata structures is wrong and this really is supposed to be a pointer to the data page (PageHeader), or
 # There's a completely separate bug where the parquet writer files to store an extra copy of the ColumnMetaData in the file right before the PageHeader, or
 # The offset should point to where the unique copy of ColumnMetaData is already going to be found in the file footer, although that seems like it would be really hard to calculate.

In any case, there's an inconsistency where the metadata definition specifies an offset to ColumnMetaData, where instead a PageHeader is placed.

I'm going to go check out the reader and see what it does with this field. My guess is that it doesn't use the field at all, which is why this discrepancy is never a problem.

> Bogus file offset for ColumnMetaData written to ColumnChunk metadata of single parquet files
> --------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2139
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2139
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Timothy Miller
>            Priority: Major
>
> In an effort to understand the parquet format better, I've so far written my own Thrift parser, and upon examining the output, I noticed something peculiar.
> To begin with, check out the definition for ColumnChunk here: [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift]
> You'll notice that if there's an element 2 in the struct, this is supposed to be a file offset to where a redundant copy of the ColumnMetaData.
> Next, have a look at the file called "modified.parquet" attached to https://issues.apache.org/jira/browse/PARQUET-2069. When I dump the metadata at the end of the file, I get this:
> {{Struct(FileMetaData):}}
> {{     1: i32(version) = I32(1)}}
> {{     2: List(SchemaElement schema):}}
> {{          ...
>      3: i64(num_rows) = I64(1)
>      4: List(RowGroup row_groups):
>         1: Struct(RowGroup row_groups):
>            1: List(ColumnChunk columns):
>               1: Struct(ColumnChunk columns):
>                  2: i64(file_offset) = I64(4)
>                  3: Struct(ColumnMetaData meta_data):
>                     1: Type(type) = I32(6) = BYTE_ARRAY
>                     2: List(Encoding encodings):
>                        1: Encoding(encodings) = I32(0) = PLAIN
>                        2: Encoding(encodings) = I32(3) = RLE
>                     3: List(string path_in_schema):
>                        1: string(path_in_schema) = Binary("destination_addresses")
>                        2: string(path_in_schema) = Binary("array")
>                        3: string(path_in_schema) = Binary("element")
>                     4: CompressionCodec(codec) = I32(0) = UNCOMPRESSED
>                     5: i64(num_values) = I64(6)
>                     6: i64(total_uncompressed_size) = I64(197)
>                     7: i64(total_compressed_size) = I64(197)
>                     9: i64(data_page_offset) = I64(4)
> }}
> As you can see, element 2 of the ColumnChunk indicates that there is another copy of the ColumnMetaData at offset 4 of the file. But then we see that element 9 of the ColumnMetaData shown above indicates that the data page offset is ALSO 4, where we should find a Thrift encoding of a PageHeader structure. Obviously, both structures can't be in the same place, and in fact a PageHeader is what is located there.
> Based on what I'm seeing here, I believe that element 2 of ColumnChunk should be omitted entirely in this scenario, so as to not falsely indicate that there would be another copy of the ColumnMetadata in this location in the file where indeed something else is present.
> It may take me a while to locate the offending code, but I thought I'd go ahead and point this out before I set off to investigate.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)