You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mrunit.apache.org by Florian Froese <f....@gmail.com> on 2013/12/17 16:25:40 UTC

MRUnit Avro support

Hey guys!
I noticed that MRUnit uses the same Serialization for adding data as input and as output. This works fine if the key and value types are Writable.
But I currently used avro based types (ex. AvroKey<Long> ) by following the example of http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization .
This works fine if either input or the output are avro types. But if both, input and output are avro based, only one schema can be used ( the one registered in “avro.serialization.key.writer.schema” and “avro.serialization.value.writer.schema”).

So if e.g. you have a mapper Mapper<AvroKey<String>, AvroValue<String>,AvroKey<String>,AvroValue<Long>> the output serialization will fail since it tries to decode a Long as a String.

Normally using Avro the input and output schemas are given by the properties "avro.schema.input.key" and "avro.schema.output.key" .
The AvroSerialization class from avro only uses the “avro.serialization.key.writer.schema” schemas. Input and output is done by the AvroKeyValueInputFormat / OutputFormat.

Adding a custom AvroSerialization that takes the schemas class to "io.serializations" would not solve the problem since both classes would accept AvroKey and AvroValue.

Are you planing to include Avro support into the MRUnit API? Other than the workaround defining the schemas as config properties?
(e.g. a method withAvroKeyInputSchema(Schema schema) )

I would suggest to provide an API for avro and to switch out “avro.serialization.key.writer.schema” by the according input/output schemas when adding new values. This way only the addInput() and addOutput() methods have to be changed. Although this is a hack.
Do you have suggestions that provide avro support in a cleaner way without huge effort?

Would you be interested to integrate avro support into MRUnit?
Do you have any recommendations on how to proceed?
Should I create a JIRA issue and suggest an implementation?

Best regards
Florian

Re: MRUnit Avro support

Posted by Brock Noland <br...@cloudera.com>.

I think to get around that you use a separate configuration for output?

http://mrunit.apache.org/documentation/javadocs/1.0.0/org/apache/hadoop/mrunit/TestDriver.html#withOutputSerializationConfiguration(org.apache.hadoop.conf.Configuration)


On Tue, Dec 17, 2013 at 10:48 AM, Florian Froese <f....@gmail.com> wrote:

> The problem starts when adding values to the driver.
> It is with the copy() method in TestDriver.java .
> Since there is only one Serialization for input and output, only one
> schema can be defined for key and value.
> Thus if there are different avroschemas for input and output an error is
> thrown since it cannot serialize the object with the wrong schema.
>
> Cheers
> Florian
>
>
>
> On 17.12.2013, at 17:23, Brock Noland <br...@cloudera.com> wrote:
>
> I have added an answer to SO:
> http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization/20639370#20639370
>
> Please checkout this JIRA:
> https://issues.apache.org/jira/browse/MRUNIT-181 and let me know if that
> works for you.
>
> Cheers!
>
>
> On Tue, Dec 17, 2013 at 9:25 AM, Florian Froese <f....@gmail.com>wrote:
>
>> Hey guys!
>> I noticed that MRUnit uses the same Serialization for adding data as
>> input and as output. This works fine if the key and value types are
>> Writable.
>> But I currently used avro based types (ex. AvroKey<Long> ) by following
>> the example of
>> http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization
>>  .
>> This works fine if either input or the output are avro types. But if
>> both, input and output are avro based, only one schema can be used ( the
>> one registered in “avro.serialization.key.writer.schema” and
>> “avro.serialization.value.writer.schema”).
>>
>> So if e.g. you have a mapper Mapper<AvroKey<String>,
>> AvroValue<String>,AvroKey<String>,AvroValue<Long>> the output serialization
>> will fail since it tries to decode a Long as a String.
>>
>> Normally using Avro the input and output schemas are given by the
>> properties "avro.schema.input.key" and "avro.schema.output.key" .
>> The AvroSerialization class from avro only uses the
>> “avro.serialization.key.writer.schema” schemas. Input and output is done by
>> the AvroKeyValueInputFormat / OutputFormat.
>>
>> Adding a custom AvroSerialization that takes the schemas class to
>> "io.serializations" would not solve the problem since both classes would
>> accept AvroKey and AvroValue.
>>
>> Are you planing to include Avro support into the MRUnit API? Other than
>> the workaround defining the schemas as config properties?
>> (e.g. a  method withAvroKeyInputSchema(Schema schema) )
>>
>> I would suggest to provide an API for avro and to switch out
>> “avro.serialization.key.writer.schema” by the according input/output
>> schemas when adding new values. This way only the addInput() and
>> addOutput() methods have to be changed. Although this is a hack.
>> Do you have suggestions that provide avro support in a cleaner way
>> without huge effort?
>>
>> Would you be interested to integrate avro support into MRUnit?
>> Do you have any recommendations on how to proceed?
>> Should I create a JIRA issue and suggest an implementation?
>>
>> Best regards
>> Florian
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: MRUnit Avro support

Posted by Florian Froese <f....@gmail.com>.

The problem starts when adding values to the driver.
It is with the copy() method in TestDriver.java .
Since there is only one Serialization for input and output, only one schema can be defined for key and value.
Thus if there are different avroschemas for input and output an error is thrown since it cannot serialize the object with the wrong schema.

Cheers
Florian


On 17.12.2013, at 17:23, Brock Noland <br...@cloudera.com> wrote:

> I have added an answer to SO: http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization/20639370#20639370
> 
> Please checkout this JIRA: https://issues.apache.org/jira/browse/MRUNIT-181 and let me know if that works for you.
> 
> Cheers!
> 
> 
> On Tue, Dec 17, 2013 at 9:25 AM, Florian Froese <f....@gmail.com> wrote:
> Hey guys!
> I noticed that MRUnit uses the same Serialization for adding data as input and as output. This works fine if the key and value types are Writable. 
> But I currently used avro based types (ex. AvroKey<Long> ) by following the example of http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization .
> This works fine if either input or the output are avro types. But if both, input and output are avro based, only one schema can be used ( the one registered in “avro.serialization.key.writer.schema” and “avro.serialization.value.writer.schema”). 
> 
> So if e.g. you have a mapper Mapper<AvroKey<String>, AvroValue<String>,AvroKey<String>,AvroValue<Long>> the output serialization will fail since it tries to decode a Long as a String. 
> 
> Normally using Avro the input and output schemas are given by the properties "avro.schema.input.key" and "avro.schema.output.key" .
> The AvroSerialization class from avro only uses the “avro.serialization.key.writer.schema” schemas. Input and output is done by the AvroKeyValueInputFormat / OutputFormat.
> 
> Adding a custom AvroSerialization that takes the schemas class to "io.serializations" would not solve the problem since both classes would accept AvroKey and AvroValue.
> 
> Are you planing to include Avro support into the MRUnit API? Other than the workaround defining the schemas as config properties?
> (e.g. a  method withAvroKeyInputSchema(Schema schema) )
> 
> I would suggest to provide an API for avro and to switch out “avro.serialization.key.writer.schema” by the according input/output schemas when adding new values. This way only the addInput() and addOutput() methods have to be changed. Although this is a hack. 
> Do you have suggestions that provide avro support in a cleaner way without huge effort?
> 
> Would you be interested to integrate avro support into MRUnit?
> Do you have any recommendations on how to proceed?
> Should I create a JIRA issue and suggest an implementation?
> 
> Best regards
> Florian
> 
> 
> 
> -- 
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: MRUnit Avro support

Posted by Brock Noland <br...@cloudera.com>.

I have added an answer to SO:
http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization/20639370#20639370

Please checkout this JIRA:
https://issues.apache.org/jira/browse/MRUNIT-181and let me know if
that works for you.

Cheers!


On Tue, Dec 17, 2013 at 9:25 AM, Florian Froese <f....@gmail.com> wrote:

> Hey guys!
> I noticed that MRUnit uses the same Serialization for adding data as input
> and as output. This works fine if the key and value types are Writable.
> But I currently used avro based types (ex. AvroKey<Long> ) by following
> the example of
> http://stackoverflow.com/questions/15230482/mrunit-with-avro-nullpointerexception-in-serialization
>  .
> This works fine if either input or the output are avro types. But if both,
> input and output are avro based, only one schema can be used ( the one
> registered in “avro.serialization.key.writer.schema” and
> “avro.serialization.value.writer.schema”).
>
> So if e.g. you have a mapper Mapper<AvroKey<String>,
> AvroValue<String>,AvroKey<String>,AvroValue<Long>> the output serialization
> will fail since it tries to decode a Long as a String.
>
> Normally using Avro the input and output schemas are given by the
> properties "avro.schema.input.key" and "avro.schema.output.key" .
> The AvroSerialization class from avro only uses the
> “avro.serialization.key.writer.schema” schemas. Input and output is done by
> the AvroKeyValueInputFormat / OutputFormat.
>
> Adding a custom AvroSerialization that takes the schemas class to
> "io.serializations" would not solve the problem since both classes would
> accept AvroKey and AvroValue.
>
> Are you planing to include Avro support into the MRUnit API? Other than
> the workaround defining the schemas as config properties?
> (e.g. a  method withAvroKeyInputSchema(Schema schema) )
>
> I would suggest to provide an API for avro and to switch out
> “avro.serialization.key.writer.schema” by the according input/output
> schemas when adding new values. This way only the addInput() and
> addOutput() methods have to be changed. Although this is a hack.
> Do you have suggestions that provide avro support in a cleaner way without
> huge effort?
>
> Would you be interested to integrate avro support into MRUnit?
> Do you have any recommendations on how to proceed?
> Should I create a JIRA issue and suggest an implementation?
>
> Best regards
> Florian
>



-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org