You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Julien Muller (Created) (JIRA)" <ji...@apache.org> on 2011/10/11 12:03:11 UTC

[jira] [Created] (AVRO-923) Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration

Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration
---------------------------------------------------------------------------------------

                 Key: AVRO-923
                 URL: https://issues.apache.org/jira/browse/AVRO-923
             Project: Avro
          Issue Type: Improvement
          Components: java
    Affects Versions: 1.5.4
         Environment: any
            Reporter: Julien Muller
             Fix For: 1.6.0


The current implementation of Avro MapRed is designed to use JobConf. While it is possible to use job.xml file, it is pretty painful since you have to copy/paste the all schemes for input and output. This is error prone and time consuming. Also any update in a bean requires to recopy/repaste the schema (if using JobConf a simple recompile would be enough).

A proposition to improve this and to stay backward compatible would be to introduce new keys in AvroJob and reference the actual avro bean used. This can be implemented as a fallback.

New keys would be created:
- avro.input.schema > avro.input.class
- avro.map.output.schema > avro.map.output.class
- avro.output.schema > avro.output.class


Only 3 methods would be impacted in AvroJob:
- getInputSchema(Configuration job) {
	// Implement a fallback like
	String s = job.get(INPUT_SCHEMA);
	if(s==null) s = (String)Class.forName(job.get(INPUT_CLASS)).getDeclaredField("SCHEMA$").get(null);
	    return Schema.parse(s);
	}
  }
- getMapOutputSchema()
- getOutputSchema()

Also, it would be more consistent to add new setters. This is not mandatory since in that use case, the new keys are filled up directly in the job, not using AvroJob. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-923) Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125213#comment-13125213 ] 

Doug Cutting commented on AVRO-923:
-----------------------------------

It's slightly riskier to get the schema from the runtime than from the job, in particular the map output schema.  If different versions of code are somehow run on different nodes, then different map output schemas could be used, which would create havoc, since the schema does not travel with the map output data.  When the schema is in the job.xml, there's very little chance of a lack of coordination, since the framework distributes the same job.xml to every task.  If the schema comes from the runtime, there's some chance that different versions of classes could be installed on different nodes.

Another concern is that not all schemas have a class that defines them.  For example, one might have jobs whose inputs or outputs are "bytes" or "string" or Pair<"string","bytes">, etc.

These are the reasons that schema-in-job.xml is the required and preferred means of specification.  However there may be cases where it's preferable to additionally support specification of schemas via a specific class, as suggested in this issue.

A JobConf can be programmatically constructed.  Why is it so painful to insert the schema there as a part of your job creation/submission pipeline?  I'd like to better understand why that's so difficult before we add a new mechanism, since any added mechanism has the potential to create bugs and user confusion.
                
> Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration
> ---------------------------------------------------------------------------------------
>
>                 Key: AVRO-923
>                 URL: https://issues.apache.org/jira/browse/AVRO-923
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.4
>         Environment: any
>            Reporter: Julien Muller
>             Fix For: 1.6.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The current implementation of Avro MapRed is designed to use JobConf. While it is possible to use job.xml file, it is pretty painful since you have to copy/paste the all schemes for input and output. This is error prone and time consuming. Also any update in a bean requires to recopy/repaste the schema (if using JobConf a simple recompile would be enough).
> A proposition to improve this and to stay backward compatible would be to introduce new keys in AvroJob and reference the actual avro bean used. This can be implemented as a fallback.
> New keys would be created:
> - avro.input.schema > avro.input.class
> - avro.map.output.schema > avro.map.output.class
> - avro.output.schema > avro.output.class
> Only 3 methods would be impacted in AvroJob:
> - getInputSchema(Configuration job) {
> 	// Implement a fallback like
> 	String s = job.get(INPUT_SCHEMA);
> 	if(s==null) s = (String)Class.forName(job.get(INPUT_CLASS)).getDeclaredField("SCHEMA$").get(null);
> 	    return Schema.parse(s);
> 	}
>   }
> - getMapOutputSchema()
> - getOutputSchema()
> Also, it would be more consistent to add new setters. This is not mandatory since in that use case, the new keys are filled up directly in the job, not using AvroJob. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-923) Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125964#comment-13125964 ] 

Doug Cutting commented on AVRO-923:
-----------------------------------

> it seems to me this risk is already taken for other parameters such as "avro.mapper". For the case of schemas though there is a second check that occurs when the input file schema does not match the compiled schema.

The input schema is not what I was most concerned about, rather the map output schema.  If different tasks somehow got a different map output schema it would result in strange hard-to-debug i/o exceptions.  We require that the map output schema is constant across all tasks in a job for things to work correctly.  Of course it's not always possible to prohibit folks from creating erroneous situations, we should try to discourage that but don't want to overly limit functionality in the process.

> It can also be described with xml files

What I meant was that the xml files can be programmatically constructed.  They should ideally not be constructed with cut and paste, but should use the same source for schemas as the Java code that's getting re-generated to build the new version of the jar file.  Perhaps you can refer to the schemas with an external entity definition in the XML that fetches the appropriate version? 

{code}
<!DOCTYPE job [
<!ENTITY schemaX SYSTEM "http://svn.foo.com/project/trunk/schemas/x.avsc">
]>
<job>
 ... &schemaX; ...
</job>
{code}

                
> Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration
> ---------------------------------------------------------------------------------------
>
>                 Key: AVRO-923
>                 URL: https://issues.apache.org/jira/browse/AVRO-923
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.4
>         Environment: any
>            Reporter: Julien Muller
>             Fix For: 1.6.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The current implementation of Avro MapRed is designed to use JobConf. While it is possible to use job.xml file, it is pretty painful since you have to copy/paste the all schemes for input and output. This is error prone and time consuming. Also any update in a bean requires to recopy/repaste the schema (if using JobConf a simple recompile would be enough).
> A proposition to improve this and to stay backward compatible would be to introduce new keys in AvroJob and reference the actual avro bean used. This can be implemented as a fallback.
> New keys would be created:
> - avro.input.schema > avro.input.class
> - avro.map.output.schema > avro.map.output.class
> - avro.output.schema > avro.output.class
> Only 3 methods would be impacted in AvroJob:
> - getInputSchema(Configuration job) {
> 	// Implement a fallback like
> 	String s = job.get(INPUT_SCHEMA);
> 	if(s==null) s = (String)Class.forName(job.get(INPUT_CLASS)).getDeclaredField("SCHEMA$").get(null);
> 	    return Schema.parse(s);
> 	}
>   }
> - getMapOutputSchema()
> - getOutputSchema()
> Also, it would be more consistent to add new setters. This is not mandatory since in that use case, the new keys are filled up directly in the job, not using AvroJob. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-923) Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration

Posted by "Julien Muller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125714#comment-13125714 ] 

Julien Muller commented on AVRO-923:
------------------------------------

Answers to the previous comment:

- It's slightly riskier to get the schema from the runtime than from the job
> This is correct, but it seems to me this risk is already taken for other parameters such as "avro.mapper". For the case of schemas though there is a second check that occurs when the input file schema does not match the compiled schema.

- not all schemas have a class that defines them
> If the schema is a primitive type (e.g. long or string), I don't see any value in using the proposed mechanism. It seems to me this feature would only apply to complex schemas that may be updated regularly. If the schema is a Pair based on simple or complex type, we would still be able to generate the associated Avro bean. Not sure about the usual usage.

- Why is it so painful to insert the schema there as a part of your job
> Let's say you have a schema used in 100 different jobs in 20 workflows, changing a field to nullable implies a modification of all these workflows and test runs with a risk of copy / paste error. As the schema is not human readable (compared to a class name), it is hard to identify all the places where your schema is used (and what version). We encountered this about 3 times over a 6 month period. If we were using programmatically constructed JobConf, this would be a simple recompilation of the jobs.
A side effect is that we have to maintain specifications of our workflows, where the flows would be self explainable.

- A JobConf can be programmatically constructed
This is totally correct. It can also be described with xml files, and the all point is to improve the support of this second case. When using Avro part of a global solution, together with hadoop and Oozie, we can have separate responsibilities, developers implement Business Objects (Avro) and MapReduce, and architects design the workflows pipelines in xml files.

- any added mechanism has the potential to create bugs and user confusion
I try to address user confusion by allowing the usage of "avro.input.schema", and falling back to use "avro.input.class". A way to improve this would be to put this mechanism behind the scene, and add an additional signature to setInputSchema(JobConf job, Class c).
This still would need to be improved to something like:
setInputSchema(JobConf job, Class<? extends SpecificRecord> c), but getSchema() is an instance method and there is no simple way to ensure the SCHEMA$ field would be present. 

Another approach would be to drop entirely having to set the schema in the xml configuration: the AvroMapper knows the input schema as it is compiled with it, the RecordReader knows the schema of the underlying data. If there should be a match, it should be matching these two, instead of matching with an external schema string. Not sure if there is a technical limitation to this approach.
                
> Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration
> ---------------------------------------------------------------------------------------
>
>                 Key: AVRO-923
>                 URL: https://issues.apache.org/jira/browse/AVRO-923
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.4
>         Environment: any
>            Reporter: Julien Muller
>             Fix For: 1.6.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The current implementation of Avro MapRed is designed to use JobConf. While it is possible to use job.xml file, it is pretty painful since you have to copy/paste the all schemes for input and output. This is error prone and time consuming. Also any update in a bean requires to recopy/repaste the schema (if using JobConf a simple recompile would be enough).
> A proposition to improve this and to stay backward compatible would be to introduce new keys in AvroJob and reference the actual avro bean used. This can be implemented as a fallback.
> New keys would be created:
> - avro.input.schema > avro.input.class
> - avro.map.output.schema > avro.map.output.class
> - avro.output.schema > avro.output.class
> Only 3 methods would be impacted in AvroJob:
> - getInputSchema(Configuration job) {
> 	// Implement a fallback like
> 	String s = job.get(INPUT_SCHEMA);
> 	if(s==null) s = (String)Class.forName(job.get(INPUT_CLASS)).getDeclaredField("SCHEMA$").get(null);
> 	    return Schema.parse(s);
> 	}
>   }
> - getMapOutputSchema()
> - getOutputSchema()
> Also, it would be more consistent to add new setters. This is not mandatory since in that use case, the new keys are filled up directly in the job, not using AvroJob. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira