You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2022/07/08 18:39:00 UTC

[jira] [Commented] (ARROW-9790) [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries

    [ https://issues.apache.org/jira/browse/ARROW-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564399#comment-17564399 ] 

Andrew Lamb commented on ARROW-9790:
------------------------------------

See also https://github.com/apache/arrow-rs/issues/2025

> [Rust] [Parquet] ParquetFileArrowReader fails to decode all pages if batches fall exactly on row group boundaries
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9790
>                 URL: https://issues.apache.org/jira/browse/ARROW-9790
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust
>            Reporter: Andrew Lamb
>            Assignee: Andrew Lamb
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>         Attachments: parquet_file_arrow_reader.zip
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When I was reading a parquet file into RecordBatches using {{ParquetFileArrowReader}} that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
> {code}
>  ParquetError("Parquet error: Not all children array length are the same!")
> {code}
> Upon investigation, I found that when reading with {{ParquetFileArrowReader}}, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
> Visually:
> {code}
> +-----+
> | RG1 |
> |     |
> +-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
> +-----+
> | RG2 |
> |     |
> +-----+
> {code}
> A reproducer is attached. 20 values should be read by the {{ParquetFileArrowReader}} regardless of the batch size. However, when using batch sizes such as {{5}} or {{3}} (which fall on a boundary between row groups) not all the rows are read. 
> To run the reproducer, decompress the attachment  [^parquet_file_arrow_reader.zip] and do `cargo run`
> The output is as follows:
> {code}
> wrote 20 rows in 4 row groups to /tmp/repro.parquet
> Size when reading with batch_size 100 : 20
> Size when reading with batch_size 7 : 20
> Size when reading with batch_size 5 : 5
> {code}
> The expected output is as follows (should always read 20 rows, regardless of the batch size):
> {code}
> wrote 20 rows in 4 row groups to /tmp/repro.parquet
> Size when reading with batch_size 100 : 20
> Size when reading with batch_size 7 : 20
> Size when reading with batch_size 5 : 20
> {code}
> h2. Workaround
> Use a different batch size that will not fall on record batch boundaries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)