You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Benoit Lacelle (JIRA)" <ji...@apache.org> on 2018/01/29 09:13:00 UTC

[jira] [Created] (PARQUET-1202) Add differentiation of nested records with the same name

Benoit Lacelle created PARQUET-1202:
---------------------------------------

             Summary: Add differentiation of nested records with the same name
                 Key: PARQUET-1202
                 URL: https://issues.apache.org/jira/browse/PARQUET-1202
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.8.2, 1.7.0
            Reporter: Benoit Lacelle


Hello,

While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.

I consider the simple following piece of code:

{code}

ParquetReader<GenericRecord> reader =

             AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build();

             System.out.println(reader.read().getSchema());

{code}

I get a stack lile:

{code}

Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value

       at org.apache.avro.Schema$Names.put(+Schema.java:1128+)

       at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)

       at org.apache.avro.Schema.toString(+Schema.java:324+)

       at org.apache.avro.Schema.toString(+Schema.java:314+)

{code}

 

The issue seems the same as the one reported in:

[https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]

 

It have been fixed in Spark-avro within:

[https://github.com/databricks/spark-avro/pull/73]

In our case, the parquet schema looks like:

{code}

message spark_schema {
 optional group calculatedobjectinfomap (MAP) {
 repeated group key_value {
 required binary key (UTF8);
 optional group value {
 optional int64 calcobjid;
 optional int64 calcobjparentid;
 optional binary portfolioname (UTF8);
 optional binary portfolioscheme (UTF8);
 optional binary calcobjtype (UTF8);
 optional binary calcobjmnemonic (UTF8);
 optional binary calcobinstrumentype (UTF8);
 optional int64 calcobjectqty;
 optional binary calcobjboid (UTF8);
 optional binary analyticalfoldermnemonic (UTF8);
 optional binary calculatedidentifier (UTF8);
 optional binary calcobjlevel (UTF8);
 optional binary calcobjboidscheme (UTF8);
 }
 }
 }
 optional group riskfactorinfomap (MAP) {
 repeated group key_value {
 required binary key (UTF8);
 optional group value {
 optional binary riskfactorname (UTF8);
 optional binary riskfactortype (UTF8);
 optional binary riskfactorrole (UTF8);
 }
 }
 }
}

{code}
We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType. 

The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.


Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)