You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2022/07/08 18:39:00 UTC
[jira] [Commented] (ARROW-9790) [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries
[ https://issues.apache.org/jira/browse/ARROW-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564399#comment-17564399 ]
Andrew Lamb commented on ARROW-9790:
------------------------------------
See also https://github.com/apache/arrow-rs/issues/2025
> [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries
> -----------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-9790
> URL: https://issues.apache.org/jira/browse/ARROW-9790
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust
> Reporter: Andrew Lamb
> Assignee: Andrew Lamb
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: parquet_file_arrow_reader.zip
>
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> When I was reading a parquet file into RecordBatches using {{ParquetFileArrowReader}} that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
> {code}
> ParquetError("Parquet error: Not all children array length are the same!")
> {code}
> Upon investigation, I found that when reading with {{ParquetFileArrowReader}}, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
> Visually:
> {code}
> +-----+
> | RG1 |
> | |
> +-----+ <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
> +-----+
> | RG2 |
> | |
> +-----+
> {code}
> A reproducer is attached. 20 values should be read by the {{ParquetFileArrowReader}} regardless of the batch size. However, when using batch sizes such as {{5}} or {{3}} (which fall on a boundary between row groups) not all the rows are read.
> To run the reproducer, decompress the attachment [^parquet_file_arrow_reader.zip] and do `cargo run`
> The output is as follows:
> {code}
> wrote 20 rows in 4 row groups to /tmp/repro.parquet
> Size when reading with batch_size 100 : 20
> Size when reading with batch_size 7 : 20
> Size when reading with batch_size 5 : 5
> {code}
> The expected output is as follows (should always read 20 rows, regardless of the batch size):
> {code}
> wrote 20 rows in 4 row groups to /tmp/repro.parquet
> Size when reading with batch_size 100 : 20
> Size when reading with batch_size 7 : 20
> Size when reading with batch_size 5 : 20
> {code}
> h2. Workaround
> Use a different batch size that will not fall on record batch boundaries
--
This message was sent by Atlassian Jira
(v8.20.10#820010)