You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jan Finis (Jira)" <ji...@apache.org> on 2020/10/20 11:34:00 UTC

[jira] [Created] (ARROW-10353) Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0

Jan Finis created ARROW-10353:
---------------------------------

             Summary: Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0
                 Key: ARROW-10353
                 URL: https://issues.apache.org/jira/browse/ARROW-10353
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Jan Finis


According to the parquet-format specification, DataPageV2 pages have an is_compressed flag. Even if the column chunk has a decompression codec set, the page is only compressed if this flag is true (this likely enables not compressing some pages where the compression wouldn't save memory).

Here is the relevant excerpt from parquet.thrift describing the semantics of the is_compressed flag in a DataPageV2:

 _/** whether the values are compressed._
 _Which means the section of the page between_
 _definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)_
 _is compressed with the compression_codec._
 _If missing it is considered compressed */_
 _7: optional bool is_compressed = 1;_

 

It seems that the apache parquet cpp library (haven't checked other languages but might have the bug as well) totally disregard this flag and decompress the page in all cases if a decompressor is set for the column chunk.

The erroneous code is in column_reader.cc: 

std::shared_ptr<Page> SerializedPageReader::NextPage() 


This method first decompresses the page if there is a decompressor set and only then does a case distinction on whether this page is a DataPageV2 and has the is_compressed flag. Thus, even if the page would have this flag set to 0, the page would be decompressed anyway.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)