You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/31 00:49:58 UTC
[GitHub] [arrow-rs] hohav opened a new issue #385: Crash when writing Parquet with non-nullable ListArray
hohav opened a new issue #385:
URL: https://github.com/apache/arrow-rs/issues/385
Possibly related: #282, #270.
Minimal reproducing code [here](https://github.com/hohav/arrow-parquet-list-test).
Trying to write a Parquet file containing a variable-length array with non-nullable items results in this panic:
```
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `1`,
right: `0`', .../parquet-4.1.0/src/util/bit_util.rs:332:9
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on issue #385: Panic when writing Parquet from non-nullable ListArray
Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869797646
#270 fixed the initial behaviour that you observed with the panics, so we correctly roundtrip even though the file is technically incorrect. We do this because we independently count the nulls from the definition, instead of relying on what the metadata says.
The issue is with the column writer at https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer.rs#L471.
It effectively says "if a value is not populated, then it's null", which is incorrect with the empty list case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on issue #385: Panic when writing Parquet from non-nullable ListArray
Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869797646
#270 fixed the initial behaviour that you observed with the panics, so we correctly roundtrip even though the file is technically incorrect. We do this because we independently count the nulls from the definition, instead of relying on what the metadata says.
The issue is with the column writer at https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer.rs#L471.
It effectively says "if a value is not populated, then it's null", which is incorrect with the empty list case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] nevi-me commented on issue #385: Panic when writing Parquet from non-nullable ListArray
Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869554282
Hi @hohav I missed this, thanks for looking further. I'll take a look at this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] hohav commented on issue #385: Panic when writing Parquet from non-nullable ListArray
Posted by GitBox <gi...@apache.org>.
hohav commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869812493
Thanks for taking a look. I'm still seeing the initial panic when I update to latest master of arrow-rs, so I don't think #270 fixed it unfortunately.
But I think there's something else going on, because I get the same crash from `parquet cat` even when I remove the empty list. And if I pass `false` to `try_from_iter_with_nullable` then `parquet meta` tells me every element is null, even for a list like `[[1] [2]]` (and `parquet cat` still crashes). Repro code [here](https://github.com/hohav/arrow-parquet-list-test/tree/v3).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] hohav commented on issue #385: Panic when writing Parquet from non-nullable ListArray
Posted by GitBox <gi...@apache.org>.
hohav commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869250321
I think there may be a more fundamental issue with `ListArray`. I created a new version of my repro [here](https://github.com/hohav/arrow-parquet-list-test/tree/v2), where I create a very simple ListArray: `[[1], [], [2]]`. I can successfully write this to a Parquet file using `ArrowWriter`, but then `parquet meta` shows incorrect information:
```
$ parquet meta test.parquet
File path: test.parquet
Created by: parquet-rs version 5.0.0-SNAPSHOT (build de62168a4f428e3c334e1cfa5c5db23272f313d7)
Properties:
ARROW:schema: /////7gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAEAAAAEAAAA3P///xwAAAAMAAAAAAABDFwAAAABAAAAHAAAAAQABAAEAAAAEAAUABAADgAPAAQAAAAIABAAAAAYAAAAIAAAAAAAAQIcAAAACAAMAAQACwAIAAAAIAAAAAAAAAEAAAAABAAAAGl0ZW0AAAAABgAAAHZhbHVlcwAA
Schema:
message arrow_schema {
optional group values (LIST) {
repeated group list {
optional int32 item;
}
}
}
Row group 0: count: 3 23.67 B records start: 4 total: 71 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
values.list.item INT32 _ RR_ 3 23.67 B 1 "1" / "2"
```
Notice `nulls 1`, which AFAICT is incorrect: there are no null items, only one empty list. And `parquet cat` fails entirely:
```
$ parquet cat test.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
at org.apache.parquet.cli.Main.run(Main.java:155)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:185)
Caused by: java.lang.ClassCastException: optional int32 item is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:248)
at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:228)
at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:74)
at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:539)
at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:489)
at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:293)
at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:137)
at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:91)
at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:142)
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:185)
at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:344)
at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
... 3 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] hohav commented on issue #385: Panic when writing Parquet from non-nullable ListArray
Posted by GitBox <gi...@apache.org>.
hohav commented on issue #385:
URL: https://github.com/apache/arrow-rs/issues/385#issuecomment-869812493
Thanks for taking a look. I'm still seeing the initial panic when I update to latest master of arrow-rs, so I don't think #270 fixed it unfortunately.
But I think there's something else going on, because I get the same crash from `parquet cat` even when I remove the empty list. And if I pass `false` to `try_from_iter_with_nullable` then `parquet meta` tells me every element is null, even for a list like `[[1] [2]]` (and `parquet cat` still crashes). Repro code [here](https://github.com/hohav/arrow-parquet-list-test/tree/v3).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org