You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "TP Boudreau (Jira)" <ji...@apache.org> on 2020/03/16 04:57:00 UTC

[jira] [Created] (ARROW-8127) [C++} [Parquet] Incorrect column chunk metadata for multipage batch writes

TP Boudreau created ARROW-8127:
----------------------------------

             Summary: [C++} [Parquet] Incorrect column chunk metadata for multipage batch writes
                 Key: ARROW-8127
                 URL: https://issues.apache.org/jira/browse/ARROW-8127
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: TP Boudreau
            Assignee: TP Boudreau
         Attachments: multipage-batch-write.cc

When writing to a buffered column writer using PLAIN encoding, if the size of the batch supplied for writing exceeds the page size for the writer, the resulting file has an incorrect data_page_offset set in its column chunk metadata.  This causes an exception to be thrown when reading the file (file appears to be too short to the reader).

For example, the attached code, which attempts to write a batch of 262145 Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes (with buffered writer, PLAIN encoding), fails on reading, throwing the error: "Tried reading 1048678 bytes starting at position 1048633 from file but only got 333".

The error is caused by the second page write tripping the conditional here https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302, in the serialized in-memory writer wrapped by the buffered writer.

The fix builds the metadata with offsets from the terminal sink rather than the in memory buffered sink.  A PR is coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)