You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/11/10 18:59:00 UTC

[jira] [Resolved] (ARROW-10353) [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0

     [ https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou resolved ARROW-10353.
------------------------------------
    Resolution: Fixed

Issue resolved by pull request 8629
[https://github.com/apache/arrow/pull/8629]

> [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0
> --------------------------------------------------------------------
>
>                 Key: ARROW-10353
>                 URL: https://issues.apache.org/jira/browse/ARROW-10353
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Jan Finis
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> According to the parquet-format specification, DataPageV2 pages have an is_compressed flag. Even if the column chunk has a decompression codec set, the page is only compressed if this flag is true (this likely enables not compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
>  _Which means the section of the page between_
>  _definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)_
>  _is compressed with the compression_codec._
>  _If missing it is considered compressed */_
>  _7: optional bool is_compressed = 1;_
>  
> It seems that the apache parquet cpp library (haven't checked other languages but might have the bug as well) totally disregards this flag and decompresses the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc: 
> std::shared_ptr<Page> SerializedPageReader::NextPage() 
> This method first decompresses the page if there is a decompressor set and only then does a case distinction on whether this page is a DataPageV2 and has the is_compressed flag. Thus, even if the page would have this flag set to 0, the page would be decompressed anyway.
> The method that should use the is_compressed flag but doesn't is:
> std::shared_ptr<Buffer> SerializedPageReader::DecompressPage
> This method doesn't look at the is_compressed flag at all.
>  
> The reason why this bug probably doesn't show in any unit test is that the write implementation seems to do the same mistake: It always compresses the page, even if the page has its is_compressed flag set to false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)