You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/11/10 16:14:00 UTC
[jira] [Assigned] (ARROW-10353) [C++] Parquet decompresses
DataPageV2 pages even if is_compressed==0
[ https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou reassigned ARROW-10353:
--------------------------------------
Assignee: Antoine Pitrou
> [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0
> --------------------------------------------------------------------
>
> Key: ARROW-10353
> URL: https://issues.apache.org/jira/browse/ARROW-10353
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Jan Finis
> Assignee: Antoine Pitrou
> Priority: Major
> Fix For: 3.0.0
>
>
> According to the parquet-format specification, DataPageV2 pages have an is_compressed flag. Even if the column chunk has a decompression codec set, the page is only compressed if this flag is true (this likely enables not compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
> _Which means the section of the page between_
> _definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)_
> _is compressed with the compression_codec._
> _If missing it is considered compressed */_
> _7: optional bool is_compressed = 1;_
>
> It seems that the apache parquet cpp library (haven't checked other languages but might have the bug as well) totally disregards this flag and decompresses the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc:
> std::shared_ptr<Page> SerializedPageReader::NextPage()
> This method first decompresses the page if there is a decompressor set and only then does a case distinction on whether this page is a DataPageV2 and has the is_compressed flag. Thus, even if the page would have this flag set to 0, the page would be decompressed anyway.
> The method that should use the is_compressed flag but doesn't is:
> std::shared_ptr<Buffer> SerializedPageReader::DecompressPage
> This method doesn't look at the is_compressed flag at all.
>
> The reason why this bug probably doesn't show in any unit test is that the write implementation seems to do the same mistake: It always compresses the page, even if the page has its is_compressed flag set to false.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)