You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2020/08/18 22:25:00 UTC

[jira] [Created] (ARROW-9790) [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries

Andrew Lamb created ARROW-9790:
----------------------------------

             Summary: [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries
                 Key: ARROW-9790
                 URL: https://issues.apache.org/jira/browse/ARROW-9790
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Andrew Lamb
            Assignee: Andrew Lamb
         Attachments: parquet_file_arrow_reader.zip

Basically, if the parquet input file has multiple row groups, and the {{batch_size}} specified to {{ParquetFileArrowReader}} falls exactly between them, not all rows are read

{code}
+-----+
| RG1 |
|     |
+-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
+-----+
| RG2 |
|     |
+-----+
{code}

A reproducer is attached. 20 values should be read by the {{ParquetFileArrowReader}} regardless of the batch size. However, when using batch sizes such as {{5}} or {{3}} (which fall on a boundary between row groups) not all the rows are read. 

To run the reproducer, decompress the attachment  [^parquet_file_arrow_reader.zip] and do `cargo run`

The output is as follows:

{code}
wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 5
{code}

The expected output is as follows (should always read 20 rows, regardless of the batch size):
{code}
wrote 20 rows in 4 row groups to /tmp/repro.parquet
Size when reading with batch_size 100 : 20
Size when reading with batch_size 7 : 20
Size when reading with batch_size 5 : 20
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)