You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Julien Muller <ju...@ezako.com> on 2011/10/10 14:41:51 UTC

Avro mapred: How to avoid schema specification in job.xml?

Hello,

I have been using avro with hadoop and oozie for months now and I am very
happy with the results.

The only point I see as a limitation now is that we specify avro schemes in
workflow.xml (job.xml):
- avro.input.schema
- avro.output.schema
Since this info is already provided in Mapper/Reducer signatures, I see this
as redundant. The schema is also present in all my serialized files, which
means that the schema is specified in 3 different places.

>From a run point of view, this is a pain, since any schema modification
(let's say a simple optional field added) forces me to update many job
files. This task is very error prone and since we have a large amount of
jobs, it generates a lot of work.

The only solution I see now would be to find/replace in the build script,
but I hope I could find a better solution by providing some generic schemes
to the job file, or find a way to deactivate schema validation in the job.
Any help will be appreciated!

-- 
Julien Muller

Re: Avro mapred: How to avoid schema specification in job.xml?

Posted by Julien Muller <ju...@ezako.com>.

Hello,

I followed your advice and filled up a jira:
https://issues.apache.org/jira/browse/AVRO-923
My first idea was to implement a custom HadoopMapper, but the actual code
for this is in static methods of AvroJob. The impact is that I would have to
add many custom classes (Mapper, Reducer, RecordReader, AvroJob ...).
It is actually a very good idea to implement a fallback in AvroJob, since it
would be only a dozen of lines of code to add.

The route I might go is to build my custom version of Avro MapRed based on
1.5.4, but this has several drawbacks, including problems during version
updates and jobs modifications when the feature will be actually
implemented.

--
Julien Muller

2011/10/10 Scott Carey <sc...@apache.org>

> On 10/10/11 11:41 AM, "Julien Muller" <ju...@ezako.com> wrote:
>
> Hello,
>
> Thanks for your answer, let me try to clarify my context a bit:
>
> I'm not all that familiar with how Oozie interacts with Avro.
>>
> Let's get oozie out of the picture. I use job.xml files to configure Jobs.
> This means I do not have any JobConf object and I cannot use AvroJob.
> Therefore I directly write the job properties (as what AvroJob outputs).
>
> The Job must set its avro.input.schema and avro.output.schema properties —
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples),
>>
> The solution I have now is basically based on the Avro mapred unit tests.
> But in my context, it is not an option to code (using the $SCHEMA property)
> at the job configuration level.
> where you code:
>     AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
> I have to copy the entire schema in job.xml file. And I have to update it
> every time my schema get updated.
> I hope I can find a better solution.
>
>
> I suppose that in AvroJob we could transmit only the class name in a
> property, and use that to look up the schema for generated classes using
> reflection.  Could you do something similar?  I don't think it is possible
> to avoid configuring at least some sort of pointer to where the schema is.
>  This could be via a property, or if you already have the job class, an
> annotation on that class.
>
>
> and if you are using SpecificRecords and DataFiles the schema is available
>> to the code where necessary.
>>
> I am not sure what you mean here. I am using SpecificRecords and would like
> to avoid specifying avro.input.schema, since this info is already here in
> the specific record.
>
>
> Potentially the AvroMapper / AvroReducer could have a fall-back for
> obtaining the schema if the property is not set — reflection on a class name
> or an annotation .  If this looks like it is an enhancement request for Avro
> (or a bug) please file a JIRA ticket.  Thanks!
>
>
> Thanks,
>
> Julien Muller
>
> 2011/10/10 Scott Carey <sc...@apache.org>
>
>> I'm not all that familiar with how Oozie interacts with Avro.
>>
>> The Job must set its avro.input.schema and avro.output.schema properties —
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), and if you are using SpecificRecords and DataFiles the schema is
>> available to the code where necessary.
>>
>>
>>
>> On 10/10/11 5:41 AM, "Julien Muller" <ju...@ezako.com> wrote:
>>
>> Hello,
>>
>> I have been using avro with hadoop and oozie for months now and I am very
>> happy with the results.
>>
>> The only point I see as a limitation now is that we specify avro schemes
>> in workflow.xml (job.xml):
>> - avro.input.schema
>> - avro.output.schema
>> Since this info is already provided in Mapper/Reducer signatures, I see
>> this as redundant. The schema is also present in all my serialized files,
>> which means that the schema is specified in 3 different places.
>>
>> From a run point of view, this is a pain, since any schema modification
>> (let's say a simple optional field added) forces me to update many job
>> files. This task is very error prone and since we have a large amount of
>> jobs, it generates a lot of work.
>>
>> The only solution I see now would be to find/replace in the build script,
>> but I hope I could find a better solution by providing some generic schemes
>> to the job file, or find a way to deactivate schema validation in the job.
>> Any help will be appreciated!
>>
>> --
>> Julien Muller
>>
>>
>

Re: Avro mapred: How to avoid schema specification in job.xml?

Posted by Scott Carey <sc...@apache.org>.

On 10/10/11 11:41 AM, "Julien Muller" <ju...@ezako.com> wrote:

> Hello,
> 
> Thanks for your answer, let me try to clarify my context a bit:
> 
>> I'm not all that familiar with how Oozie interacts with Avro.
> Let's get oozie out of the picture. I use job.xml files to configure Jobs.
> This means I do not have any JobConf object and I cannot use AvroJob.
> Therefore I directly write the job properties (as what AvroJob outputs).
> 
>> The Job must set its avro.input.schema and avro.output.schema properties 
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), 
> The solution I have now is basically based on the Avro mapred unit tests. But
> in my context, it is not an option to code (using the $SCHEMA property) at the
> job configuration level.
> where you code:
>     AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
> I have to copy the entire schema in job.xml file. And I have to update it
> every time my schema get updated.
> I hope I can find a better solution.

I suppose that in AvroJob we could transmit only the class name in a
property, and use that to look up the schema for generated classes using
reflection.  Could you do something similar?  I don't think it is possible
to avoid configuring at least some sort of pointer to where the schema is.
This could be via a property, or if you already have the job class, an
annotation on that class.

> 
>> and if you are using SpecificRecords and DataFiles the schema is available to
>> the code where necessary.
> I am not sure what you mean here. I am using SpecificRecords and would like to
> avoid specifying avro.input.schema, since this info is already here in the
> specific record.

Potentially the AvroMapper / AvroReducer could have a fall-back for
obtaining the schema if the property is not set  reflection on a class name
or an annotation .  If this looks like it is an enhancement request for Avro
(or a bug) please file a JIRA ticket.  Thanks!

> 
> Thanks,
> 
> Julien Muller
> 
> 2011/10/10 Scott Carey <sc...@apache.org>
>> I'm not all that familiar with how Oozie interacts with Avro.
>> 
>> The Job must set its avro.input.schema and avro.output.schema properties 
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), and if you are using SpecificRecords and DataFiles the schema is
>> available to the code where necessary.
>> 
>> 
>> 
>> On 10/10/11 5:41 AM, "Julien Muller" <ju...@ezako.com> wrote:
>> 
>>> Hello,
>>> 
>>> I have been using avro with hadoop and oozie for months now and I am very
>>> happy with the results.
>>> 
>>> The only point I see as a limitation now is that we specify avro schemes in
>>> workflow.xml (job.xml):
>>> - avro.input.schema
>>> - avro.output.schema
>>> Since this info is already provided in Mapper/Reducer signatures, I see this
>>> as redundant. The schema is also present in all my serialized files, which
>>> means that the schema is specified in 3 different places.
>>> 
>>> From a run point of view, this is a pain, since any schema modification
>>> (let's say a simple optional field added) forces me to update many job
>>> files. This task is very error prone and since we have a large amount of
>>> jobs, it generates a lot of work.
>>> 
>>> The only solution I see now would be to find/replace in the build script,
>>> but I hope I could find a better solution by providing some generic schemes
>>> to the job file, or find a way to deactivate schema validation in the job.
>>> Any help will be appreciated!
>>> 
>>> -- 
>>> Julien Muller
>

Re: Avro mapred: How to avoid schema specification in job.xml?

Posted by Julien Muller <ju...@ezako.com>.

Hello,

Thanks for your answer, let me try to clarify my context a bit:

I'm not all that familiar with how Oozie interacts with Avro.
>
Let's get oozie out of the picture. I use job.xml files to configure Jobs.
This means I do not have any JobConf object and I cannot use AvroJob.
Therefore I directly write the job properties (as what AvroJob outputs).

The Job must set its avro.input.schema and avro.output.schema properties —
> this can be done in code (see the unit tests in the Avro mapred project for
> examples),
>
The solution I have now is basically based on the Avro mapred unit tests.
But in my context, it is not an option to code (using the $SCHEMA property)
at the job configuration level.
where you code:
    AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
I have to copy the entire schema in job.xml file. And I have to update it
every time my schema get updated.
I hope I can find a better solution.

and if you are using SpecificRecords and DataFiles the schema is available
> to the code where necessary.
>
I am not sure what you mean here. I am using SpecificRecords and would like
to avoid specifying avro.input.schema, since this info is already here in
the specific record.

Thanks,

Julien Muller

2011/10/10 Scott Carey <sc...@apache.org>

> I'm not all that familiar with how Oozie interacts with Avro.
>
> The Job must set its avro.input.schema and avro.output.schema properties —
> this can be done in code (see the unit tests in the Avro mapred project for
> examples), and if you are using SpecificRecords and DataFiles the schema is
> available to the code where necessary.
>
>
>
> On 10/10/11 5:41 AM, "Julien Muller" <ju...@ezako.com> wrote:
>
> Hello,
>
> I have been using avro with hadoop and oozie for months now and I am very
> happy with the results.
>
> The only point I see as a limitation now is that we specify avro schemes in
> workflow.xml (job.xml):
> - avro.input.schema
> - avro.output.schema
> Since this info is already provided in Mapper/Reducer signatures, I see
> this as redundant. The schema is also present in all my serialized files,
> which means that the schema is specified in 3 different places.
>
> From a run point of view, this is a pain, since any schema modification
> (let's say a simple optional field added) forces me to update many job
> files. This task is very error prone and since we have a large amount of
> jobs, it generates a lot of work.
>
> The only solution I see now would be to find/replace in the build script,
> but I hope I could find a better solution by providing some generic schemes
> to the job file, or find a way to deactivate schema validation in the job.
> Any help will be appreciated!
>
> --
> Julien Muller
>
>

Re: Avro mapred: How to avoid schema specification in job.xml?

Posted by Scott Carey <sc...@apache.org>.

I'm not all that familiar with how Oozie interacts with Avro.

The Job must set its avro.input.schema and avro.output.schema properties 
this can be done in code (see the unit tests in the Avro mapred project for
examples), and if you are using SpecificRecords and DataFiles the schema is
available to the code where necessary.



On 10/10/11 5:41 AM, "Julien Muller" <ju...@ezako.com> wrote:

> Hello,
> 
> I have been using avro with hadoop and oozie for months now and I am very
> happy with the results.
> 
> The only point I see as a limitation now is that we specify avro schemes in
> workflow.xml (job.xml):
> - avro.input.schema
> - avro.output.schema
> Since this info is already provided in Mapper/Reducer signatures, I see this
> as redundant. The schema is also present in all my serialized files, which
> means that the schema is specified in 3 different places.
> 
> From a run point of view, this is a pain, since any schema modification (let's
> say a simple optional field added) forces me to update many job files. This
> task is very error prone and since we have a large amount of jobs, it
> generates a lot of work.
> 
> The only solution I see now would be to find/replace in the build script, but
> I hope I could find a better solution by providing some generic schemes to the
> job file, or find a way to deactivate schema validation in the job. Any help
> will be appreciated!
> 
> -- 
> Julien Muller