You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Tim Wilson (Jira)" <ji...@apache.org> on 2022/07/07 18:35:00 UTC

[jira] [Created] (ARROW-17007) [Rust][Parquet] array reader for list columns fails to decode if batches fall on row group boundaries

Tim Wilson created ARROW-17007:
----------------------------------

             Summary: [Rust][Parquet] array reader for list columns fails to decode if batches fall on row group boundaries
                 Key: ARROW-17007
                 URL: https://issues.apache.org/jira/browse/ARROW-17007
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Rust
            Reporter: Tim Wilson


This appears to be a variant of ARROW-9790, but specifically for list columns. Affects the latest released version of the rust crates arrow and parquet (17.0.0).
{code:java}
use arrow::array::{Int32Builder, ListBuilder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::{ArrowReader, ArrowWriter, ParquetFileArrowReader};
use parquet::file::properties::WriterProperties;
use parquet::file::reader::SerializedFileReader;
use std::error::Error;
use std::sync::Arc;
use tempfile::NamedTempFile;

fn main() -> Result<(), Box<dyn Error>> {
    let schema = Arc::new(Schema::new(vec![
        Field::new("int", DataType::Int32, false),
        Field::new(
            "list",
            DataType::List(Box::new(Field::new("item", DataType::Int32, true))),
            false,
        ),
    ]));
    let temp_file = NamedTempFile::new()?;

    let mut writer = ArrowWriter::try_new(
        temp_file.reopen()?,
        schema.clone(),
        Some(
            WriterProperties::builder()
                .set_max_row_group_size(8)
                .build(),
        ),
    )?;
    for _ in 0..2 {
        let mut int_builder = Int32Builder::new(10);
        let mut list_builder = ListBuilder::new(Int32Builder::new(10));
        for i in 0..10 {
            int_builder.append_value(i)?;
            list_builder.append(true)?;
        }
        let batch = RecordBatch::try_new(
            schema.clone(),
            vec![
                Arc::new(int_builder.finish()),
                Arc::new(list_builder.finish()),
            ],
        )?;
        writer.write(&batch)?;
    }
    writer.close()?;

    let file_reader = Arc::new(SerializedFileReader::new(temp_file.reopen()?)?);
    let mut file_reader = ParquetFileArrowReader::new(file_reader);
    let mut record_reader = file_reader.get_record_reader(8)?;
    assert_eq!(8, record_reader.next().unwrap()?.num_rows());
    assert_eq!(8, record_reader.next().unwrap()?.num_rows());
    assert_eq!(4, record_reader.next().unwrap()?.num_rows());
    Ok(())
}
{code}
Fails with `Error: ParquetError("Parquet error: Not all children array length are the same!")`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)