You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/11 05:40:31 UTC

[GitHub] [arrow-rs] nevi-me opened a new issue #282: Nested list levels not calculated correctly if list has 0 length element

nevi-me opened a new issue #282:
URL: https://github.com/apache/arrow-rs/issues/282


   **Describe the bug**
   
   First documented in https://github.com/apache/arrow-rs/pull/270#issuecomment-836762589.
   
   When trying to write some combinations of nested Arrow data to Parquet, we trigger a bounds error on the level calculations.
   The most obvious thing that could be going wrong is that we're not correctly accounting for empty list slot vs null list slot.
   
   This is because the error gets triggered around the logic that does this.
   
   **To Reproduce**
   
   Try the below test:
   
   ```rust
   #[test]
   fn test_write_ipc_nested_lists() {
       let fields = vec![Field::new(
           "list_a",
           DataType::List(Box::new(Field::new(
               "list_b",
               DataType::List(Box::new(Field::new(
                   "struct_c",
                   DataType::Struct(vec![
                       Field::new("prim_d", DataType::Boolean, true),
                       Field::new(
                           "list_e",
                           DataType::LargeList(Box::new(Field::new(
                               "string_f",
                               DataType::LargeUtf8,
                               true,
                           ))),
                           false,
                       ),
                   ]),
                   true,
               ))),
               false,
           ))),
           true,
       )];
       let schema = Arc::new(Schema::new(fields));
       // making this nullable guarantees that one of the list items will be empty, triggering the error
       let batch = arrow::util::data_gen::create_random_batch(schema, 3, 0.35, 0.6).unwrap();
   
       // write ipc (to read in pyarrow, and write parquet from pyarrow)
       let file = File::create("arrow_nested_random.arrow").unwrap();
       let mut writer =
           arrow::ipc::writer::FileWriter::try_new(file, batch.schema().as_ref()).unwrap();
       writer.write(&batch).unwrap();
       writer.finish().unwrap();
   
       let file = File::create("arrow_nested_random_rust.parquet").unwrap();
       let mut writer =
           ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), None)
               .expect("Unable to write file");
   
       // this will trigger the error in question
       writer.write(&batch).unwrap();
       writer.close().unwrap();
   }
   ```
   
   **Expected behavior**
   
   The parquet file should be written correctly, and pyarrow or Spark should be able to read the data correctly.
   
   **Additional context**
   
   Not sure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org