You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/31 00:49:58 UTC

[GitHub] [arrow-rs] hohav opened a new issue #385: Crash when writing Parquet with non-nullable ListArray

hohav opened a new issue #385:
URL: https://github.com/apache/arrow-rs/issues/385


   Possibly related: #282, #270.
   
   Minimal reproducing code [here](https://github.com/hohav/arrow-parquet-list-test).
   
   Trying to write a Parquet file containing a variable-length array with non-nullable items results in this panic:
   
   ```
   thread 'main' panicked at 'assertion failed: `(left == right)`
     left: `1`,
    right: `0`', .../parquet-4.1.0/src/util/bit_util.rs:332:9
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on issue #385: Panic when writing Parquet from non-nullable ListArray

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869797646


   #270 fixed the initial behaviour that you observed with the panics, so we correctly roundtrip even though the file is technically incorrect. We do this  because we independently count the nulls from the definition, instead of relying on what the metadata says.
   
   The issue is with the column writer at https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer.rs#L471.
   
   It effectively says "if a value is not populated, then it's null", which is incorrect with the empty list case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on issue #385: Panic when writing Parquet from non-nullable ListArray

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869797646


   #270 fixed the initial behaviour that you observed with the panics, so we correctly roundtrip even though the file is technically incorrect. We do this  because we independently count the nulls from the definition, instead of relying on what the metadata says.
   
   The issue is with the column writer at https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer.rs#L471.
   
   It effectively says "if a value is not populated, then it's null", which is incorrect with the empty list case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on issue #385: Panic when writing Parquet from non-nullable ListArray

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869554282


   Hi @hohav I missed this, thanks for looking further. I'll take a look at this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] hohav commented on issue #385: Panic when writing Parquet from non-nullable ListArray

Posted by GitBox <gi...@apache.org>.
hohav commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869812493


   Thanks for taking a look. I'm still seeing the initial panic when I update to latest master of arrow-rs, so I don't think #270 fixed it unfortunately.
   
   But I think there's something else going on, because I get the same crash from `parquet cat` even when I remove the empty list. And if I pass `false` to `try_from_iter_with_nullable` then `parquet meta` tells me every element is null, even for a list like `[[1] [2]]` (and `parquet cat` still crashes). Repro code [here](https://github.com/hohav/arrow-parquet-list-test/tree/v3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] hohav commented on issue #385: Panic when writing Parquet from non-nullable ListArray

Posted by GitBox <gi...@apache.org>.
hohav commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869250321


   I think there may be a more fundamental issue with `ListArray`. I created a new version of my repro [here](https://github.com/hohav/arrow-parquet-list-test/tree/v2), where I create a very simple ListArray: `[[1], [], [2]]`. I can successfully write this to a Parquet file using `ArrowWriter`, but then `parquet meta` shows incorrect information:
   ```
   $ parquet meta test.parquet 
   
   File path:  test.parquet
   Created by: parquet-rs version 5.0.0-SNAPSHOT (build de62168a4f428e3c334e1cfa5c5db23272f313d7)
   Properties:
     ARROW:schema: /////7gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAEAAAAEAAAA3P///xwAAAAMAAAAAAABDFwAAAABAAAAHAAAAAQABAAEAAAAEAAUABAADgAPAAQAAAAIABAAAAAYAAAAIAAAAAAAAQIcAAAACAAMAAQACwAIAAAAIAAAAAAAAAEAAAAABAAAAGl0ZW0AAAAABgAAAHZhbHVlcwAA
   Schema:
   message arrow_schema {
     optional group values (LIST) {
       repeated group list {
         optional int32 item;
       }
     }
   }
   
   
   Row group 0:  count: 3  23.67 B records  start: 4  total: 71 B
   --------------------------------------------------------------------------------
                     type      encodings count     avg size   nulls   min / max
   values.list.item  INT32     _ RR_     3         23.67 B    1       "1" / "2"
   ```
   Notice `nulls 1`, which AFAICT is incorrect: there are no null items, only one empty list. And `parquet cat` fails entirely:
   ```
   $ parquet cat test.parquet 
   Unknown error
   java.lang.RuntimeException: Failed on record 0
   	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
   	at org.apache.parquet.cli.Main.run(Main.java:155)
   	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
   	at org.apache.parquet.cli.Main.main(Main.java:185)
   Caused by: java.lang.ClassCastException: optional int32 item is not a group
   	at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
   	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
   	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
   	at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
   	at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:539)
   	at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:489)
   	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
   	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:137)
   	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:91)
   	at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
   	at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
   	at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
   	at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
   	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
   	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
   	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:344)
   	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
   	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
   	... 3 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] hohav commented on issue #385: Panic when writing Parquet from non-nullable ListArray

Posted by GitBox <gi...@apache.org>.
hohav commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869812493


   Thanks for taking a look. I'm still seeing the initial panic when I update to latest master of arrow-rs, so I don't think #270 fixed it unfortunately.
   
   But I think there's something else going on, because I get the same crash from `parquet cat` even when I remove the empty list. And if I pass `false` to `try_from_iter_with_nullable` then `parquet meta` tells me every element is null, even for a list like `[[1] [2]]` (and `parquet cat` still crashes). Repro code [here](https://github.com/hohav/arrow-parquet-list-test/tree/v3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org