You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/10/20 14:17:00 UTC

[jira] [Commented] (ARROW-10353) Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0

    [ https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217628#comment-17217628 ] 

Antoine Pitrou commented on ARROW-10353:
----------------------------------------

Thanks for the report. Do you want to submit a pR?

> Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10353
>                 URL: https://issues.apache.org/jira/browse/ARROW-10353
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Jan Finis
>            Priority: Major
>             Fix For: 3.0.0
>
>
> According to the parquet-format specification, DataPageV2 pages have an is_compressed flag. Even if the column chunk has a decompression codec set, the page is only compressed if this flag is true (this likely enables not compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
>  _Which means the section of the page between_
>  _definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)_
>  _is compressed with the compression_codec._
>  _If missing it is considered compressed */_
>  _7: optional bool is_compressed = 1;_
>  
> It seems that the apache parquet cpp library (haven't checked other languages but might have the bug as well) totally disregards this flag and decompresses the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc: 
> std::shared_ptr<Page> SerializedPageReader::NextPage() 
> This method first decompresses the page if there is a decompressor set and only then does a case distinction on whether this page is a DataPageV2 and has the is_compressed flag. Thus, even if the page would have this flag set to 0, the page would be decompressed anyway.
> The method that should use the is_compressed flag but doesn't is:
> std::shared_ptr<Buffer> SerializedPageReader::DecompressPage
> This method doesn't look at the is_compressed flag at all.
>  
> The reason why this bug probably doesn't show in any unit test is that the write implementation seems to do the same mistake: It always compresses the page, even if the page has its is_compressed flag set to false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)