You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Benoit Lacelle (JIRA)" <ji...@apache.org> on 2018/01/29 09:13:00 UTC
[jira] [Created] (PARQUET-1202) Add differentiation of nested
records with the same name
Benoit Lacelle created PARQUET-1202:
---------------------------------------
Summary: Add differentiation of nested records with the same name
Key: PARQUET-1202
URL: https://issues.apache.org/jira/browse/PARQUET-1202
Project: Parquet
Issue Type: Bug
Components: parquet-avro
Affects Versions: 1.8.2, 1.7.0
Reporter: Benoit Lacelle
Hello,
While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.
I consider the simple following piece of code:
{code}
ParquetReader<GenericRecord> reader =
AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build();
System.out.println(reader.read().getSchema());
{code}
I get a stack lile:
{code}
Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value
at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
at org.apache.avro.Schema.toString(+Schema.java:324+)
at org.apache.avro.Schema.toString(+Schema.java:314+)
{code}
The issue seems the same as the one reported in:
[https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]
It have been fixed in Spark-avro within:
[https://github.com/databricks/spark-avro/pull/73]
In our case, the parquet schema looks like:
{code}
message spark_schema {
optional group calculatedobjectinfomap (MAP) {
repeated group key_value {
required binary key (UTF8);
optional group value {
optional int64 calcobjid;
optional int64 calcobjparentid;
optional binary portfolioname (UTF8);
optional binary portfolioscheme (UTF8);
optional binary calcobjtype (UTF8);
optional binary calcobjmnemonic (UTF8);
optional binary calcobinstrumentype (UTF8);
optional int64 calcobjectqty;
optional binary calcobjboid (UTF8);
optional binary analyticalfoldermnemonic (UTF8);
optional binary calculatedidentifier (UTF8);
optional binary calcobjlevel (UTF8);
optional binary calcobjboidscheme (UTF8);
}
}
}
optional group riskfactorinfomap (MAP) {
repeated group key_value {
required binary key (UTF8);
optional group value {
optional binary riskfactorname (UTF8);
optional binary riskfactortype (UTF8);
optional binary riskfactorrole (UTF8);
}
}
}
}
{code}
We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType.
The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.
Thanks,
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)