You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Hailu, Andreas [Engineering]" <An...@gs.com> on 2021/05/14 16:09:05 UTC
AvroParquetOutputFormat - Unable to Write Arrays with Null Elements
Hi folks, I'm using v1.11.1 of the parquet-mr library as part of a Java application that takes Avro records and writes them into Parquet files using the AvroParquetOutputFormat. There are Avro records with array type fields that will have null elements, e.g. [ "Foo", "Bar", null, "Baz"]. Here's an example Avro schema:
{
"type": "record",
"name": "NullLists",
"namespace": "com.test",
"fields": [
{
"name": "KeyID",
"type": "string"
},
{
"name": "NullableList",
"type": [
"null",
{
"type": "array",
"items": [
"null",
"string"
]
}
],
"default": null
}
]
}
I'm trying to write the following record:
{
"KeyID": "0",
"NullableList": [
"foo",
null,
"baz"
]
}
I thought I could use the 3-level list writer to support this, however, it results in the following exception:
Caused by: java.lang.ClassCastException: repeated binary array (STRING) is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:250)
at org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:612)
at org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:397)
at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
Is this kind of record supported? I have also tried the "parquet.avro.add-list-element-records" option set to false as well, with no luck.
____________
Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.
________________________________
Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
RE: AvroParquetOutputFormat - Unable to Write Arrays with Null
Elements
Posted by "Hailu, Andreas [Engineering]" <An...@gs.com>.
I was able to get something working locally. I'll open a JIRA and have a PR once I have sufficient tests in place.
// ah
From: Hailu, Andreas [Engineering]
Sent: Friday, May 14, 2021 12:09 PM
To: dev@parquet.apache.org
Subject: AvroParquetOutputFormat - Unable to Write Arrays with Null Elements
Hi folks, I'm using v1.11.1 of the parquet-mr library as part of a Java application that takes Avro records and writes them into Parquet files using the AvroParquetOutputFormat. There are Avro records with array type fields that will have null elements, e.g. [ "Foo", "Bar", null, "Baz"]. Here's an example Avro schema:
{
"type": "record",
"name": "NullLists",
"namespace": "com.test",
"fields": [
{
"name": "KeyID",
"type": "string"
},
{
"name": "NullableList",
"type": [
"null",
{
"type": "array",
"items": [
"null",
"string"
]
}
],
"default": null
}
]
}
I'm trying to write the following record:
{
"KeyID": "0",
"NullableList": [
"foo",
null,
"baz"
]
}
I thought I could use the 3-level list writer to support this, however, it results in the following exception:
Caused by: java.lang.ClassCastException: repeated binary array (STRING) is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:250)
at org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:612)
at org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:397)
at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
Is this kind of record supported? I have also tried the "parquet.avro.add-list-element-records" option set to false as well, with no luck.
____________
Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.
________________________________
Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>