You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2017/02/24 21:12:44 UTC
[jira] [Commented] (PIG-5115) Builtin AvroStorage generates
incorrect avro schema when the same pig field name appears in the alias
[ https://issues.apache.org/jira/browse/PIG-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883527#comment-15883527 ]
Koji Noguchi commented on PIG-5115:
-----------------------------------
Sorry no one has updated the jira.
I'm not familiar with AvroStorage but if no one picks it up soon, I'll try to review it next week.
> Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-5115
> URL: https://issues.apache.org/jira/browse/PIG-5115
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.17.0
> Reporter: Anyi Li
> Assignee: Anyi Li
> Fix For: 0.17.0
>
> Attachments: PIG-5115.patch
>
>
> Pig ResourceSchema allows to use same field names but different types when they are not in the same level. The pig schema like
> {quote}
> data: {col1: (col2: (col1_data: chararray)),col2: {col2: (col2_data: chararray)}}
> {quote}
> Although _col2_ has been redefined, they are not appeared in the same level, it is a totally valid pig schema.
> However, once it is translated by AvroStorage, it will throw exception
> {noformat}
> Can't redefine: col2
> at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:64)
> at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
> at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
> at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
> at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
> at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
> at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
> at org.apache.pig.PigServer$Graph.compile(PigServer.java:1767)
> at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1443)
> at org.apache.pig.PigServer.execute(PigServer.java:1356)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> Caused by: org.apache.avro.SchemaParseException: Can't redefine: col2
> at org.apache.avro.Schema$Names.put(Schema.java:1042)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:511)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:626)
> at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:737)
> at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:814)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:648)
> at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:635)
> at org.apache.avro.Schema.toString(Schema.java:297)
> at org.apache.avro.Schema.toString(Schema.java:287)
> at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:442)
> at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:433)
> at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:54)
> ... 18 more
> {noformat}
> It is caused by a bug in AvroStorageSchemaConversionUtilities class which uses tuple name as GenericRecord name as well as the fieldname that wraps the record.
> So it would like to produces the avro schema like the following
> {noformat}
> {
> "type": "record",
> "name": "data",
> "fields": [
> {
> "name": "col1",
> "type": {
> "type": "record",
> "name": "col1_1",
> "fields": [
> {
> "name": "col2",
> "type": {
> "type": "record",
> "name": "col2",
> "fields": [
> {
> "name": "col1_data",
> "type": "string"
> }
> ]
> }
> }
> ]
> }
> },
> {
> "name": "col2",
> "type": {
> "type": "array",
> "items": {
> "type": "record",
> "name": "col2",
> "fields": [
> {
> "name": "col2_data",
> "type": "string"
> }
> ]
> }
> }
> }
> ]
> }
> {noformat}
> But according to the avro 1.7.7 specs ([https://avro.apache.org/docs/1.7.7/spec.html#Names]), _col2_ has been defined as record and redefined as array later, it is an invalid schema, unless the fullname (namespace + name) is unique.
> Since AvroStorageSchemaConversionUtilities will generate avro record if the pig schema is a tuple, we need a way to generate unique _recordName_.
> {code}
> public static Schema resourceSchemaToAvroSchema(final ResourceSchema rs,
> String recordName, final String recordNameSpace,
> final Map<String, List<Schema>> definedRecordNames,
> final Boolean doubleColonsToDoubleUnderscores) throws IOException {
> if (rs == null) {
> return null;
> }
> recordName = toAvroName(recordName, doubleColonsToDoubleUnderscores);
> List<Schema.Field> fields = new ArrayList<Schema.Field>();
> Schema newSchema = Schema.createRecord(
> recordName, null, recordNameSpace, false);
> {code}
> The AvroStorage class from piggybank solved this problem by defining a static method and generate unique _recordName_. We can implement the similar method for the builtin AvroStorage
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)