You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Hailu, Andreas [Engineering]" <An...@gs.com> on 2021/05/14 16:09:05 UTC

AvroParquetOutputFormat - Unable to Write Arrays with Null Elements

Hi folks, I'm using v1.11.1 of the parquet-mr library as part of a Java application that takes Avro records and writes them into Parquet files using the AvroParquetOutputFormat. There are Avro records with array type fields that will have null elements, e.g. [ "Foo", "Bar", null, "Baz"]. Here's an example Avro schema:

{
  "type": "record",
  "name": "NullLists",
  "namespace": "com.test",
  "fields": [
    {
      "name": "KeyID",
      "type": "string"
    },
    {
      "name": "NullableList",
      "type": [
        "null",
        {
            "type": "array",
            "items": [
                "null",
                "string"
            ]
        }
      ],
      "default": null
    }
  ]
}

I'm trying to write the following record:

{
  "KeyID": "0",
  "NullableList": [
    "foo",
    null,
    "baz"
  ]
}

I thought I could use the 3-level list writer to support this, however, it results in the following exception:

Caused by: java.lang.ClassCastException: repeated binary array (STRING) is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:250)
        at org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:612)
        at org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:397)
        at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
        at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
        at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
        at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)

Is this kind of record supported? I have also tried the "parquet.avro.add-list-element-records" option set to false as well, with no luck.

____________

Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.


________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

RE: AvroParquetOutputFormat - Unable to Write Arrays with Null Elements

Posted by "Hailu, Andreas [Engineering]" <An...@gs.com>.
I was able to get something working locally. I'll open a JIRA and have a PR once I have sufficient tests in place.

// ah

From: Hailu, Andreas [Engineering]
Sent: Friday, May 14, 2021 12:09 PM
To: dev@parquet.apache.org
Subject: AvroParquetOutputFormat - Unable to Write Arrays with Null Elements

Hi folks, I'm using v1.11.1 of the parquet-mr library as part of a Java application that takes Avro records and writes them into Parquet files using the AvroParquetOutputFormat. There are Avro records with array type fields that will have null elements, e.g. [ "Foo", "Bar", null, "Baz"]. Here's an example Avro schema:

{
  "type": "record",
  "name": "NullLists",
  "namespace": "com.test",
  "fields": [
    {
      "name": "KeyID",
      "type": "string"
    },
    {
      "name": "NullableList",
      "type": [
        "null",
        {
            "type": "array",
            "items": [
                "null",
                "string"
            ]
        }
      ],
      "default": null
    }
  ]
}

I'm trying to write the following record:

{
  "KeyID": "0",
  "NullableList": [
    "foo",
    null,
    "baz"
  ]
}

I thought I could use the 3-level list writer to support this, however, it results in the following exception:

Caused by: java.lang.ClassCastException: repeated binary array (STRING) is not a group
        at org.apache.parquet.schema.Type.asGroupType(Type.java:250)
        at org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:612)
        at org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:397)
        at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)
        at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)
        at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
        at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)

Is this kind of record supported? I have also tried the "parquet.avro.add-list-element-records" option set to false as well, with no luck.

____________

Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.


________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>