You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mrunit.apache.org by Jacob Metcalf <ja...@hotmail.com> on 2012/05/09 09:15:27 UTC

Deserializer used for both Map and Reducer context.write()


I am trying to integrate Avro-1.7 (specifically the new MR2 extensions), MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any mistakes my question is should MRUnit be using the Serialization factory when I call context.write() in a reducer.
I am using MapReduceDriver and my mapper has output signature:
             <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>> 
My reducer has a different outputt signature:
             <AvroKey<SpecificValue2>, Null>. 
I am using Avro specific serialization so I set my Avro schemas like this:
		AvroSerialization.addToConfiguration( configuration );		AvroSerialization.setKeyReaderSchema(configuration,  SpecificKey1.SCHEMA$ );		AvroSerialization.setKeyWriterSchema(configuration,   SpecificKey1.SCHEMA$ );	        AvroSerialization.setValueReaderSchema(configuration, SpecificValue1.SCHEMA$ );		AvroSerialization.setValueWriterSchema(configuration, SpecificValue1.SCHEMA$ );
My understanding of Avro MR is that the Serialization class is intended to be invoked between the map and reduce phase.
However my test fails at reduce stage. Debugging I realised the mock reducer context is using the serializer to copy objects:
    https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java

Looking at the AvroSerialization object it only expects one set of schemas:
   http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup

So when my reducer tries to write SpecificValue2 to the context, MRUnit's mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ and as a result fails.
I have yet debugged Hadoop itself but I did read some comments (which I since cannot locate) which says that the Serialization class is typically not used for the output of the reduce stage. My limited understanding is that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the deserializer when you are running in Hadoop.
I can spend some time distilling my code into a simple example but wondered if anyone had any pointers - or an Avro + MR2 + MRUnit example.
Jacob


 		 	   		  

RE: MRUNIT-114: should support calling configure on sorting and grouping comparators if applicable

Posted by Jacob Metcalf <ja...@hotmail.com>.
Jim
Yes I did, thanks for putting it in JIRA. 
For testing a reduce side join I quickly wrote the attached to simulate MultipleInputs a clone of the MapReduceDriver. However extending it beyond two would call for a different approach possibly using your new InputFormats code. I would also like to be able to perform MultipleOutputs.
One of the things I like about MRUnit is that you can perform traditional unit testing without reliance on a Hadoop install. Apart from making builds faster and more reliable it also means I can develop on Windows without getting Hadoop running under Cygwin. However I can see you have to draw a line as it would be a never ending task replicating all the functionality, especially with all the differences in the two sets of APIs.

Let me know if there is anything else you want me to test.
Jacob
> Date: Tue, 5 Jun 2012 22:37:31 -0400
> From: donofrio111@gmail.com
> To: user@mrunit.apache.org
> Subject: Fwd: MRUNIT-114: should support calling configure on sorting and grouping comparators if applicable
> 
> Jacob,
> 
> Did you get the below email?
> 
> -------- Original Message --------
> Subject: MRUNIT-114: should support calling configure on sorting and 
> grouping comparators if applicable
> Date: Tue, 29 May 2012 22:46:27 -0400
> From: Jim Donofrio <do...@gmail.com>
> To: user@mrunit.apache.org
> 
> Yes that is a bug, nice catch, I created MRUNIT-114.
> 
> How are you testing your multiple inputs code, using the LocalJobRunner?
> I just MRUnit could test multiple inputs by using the inputformat to
> read the data from each of the paths into a mapdriver and then
> consolidating the output into one list to pass to one reduce driver.
> However, I get nervous that MRUnit will eventually be reimplementing the
> LocalJobRunner. Ideally MRUnit would only use the LocalJobRunner by
> taking only a JobConf as input and would just provide support code to
> make it easier to use by creating the input files and reading the output
> files.
> 
> 
> We fail to use ReflectionUtils.newInstance on the 2 comparators, it
> would be better if most of these classes only took classes as input
> instead of the actual object
> 
>    public void setKeyGroupingComparator(
>        final RawComparator<K2> groupingComparator) {
>      keyGroupComparator =
> ReflectionUtils.newInstance(returnNonNull(groupingComparator).getClass(), getConfiguration()); 
> 
> 
>    }
> 
>    public void setKeyOrderComparator(final RawComparator<K2>
> orderComparator) {
>      keyValueOrderComparator =
> ReflectionUtils.newInstance(returnNonNull(orderComparator),
> getConfiguration());
>    }
> 
> https://issues.apache.org/jira/browse/MRUNIT-114
> 
> On 05/28/2012 01:40 PM, Jacob Metcalf wrote:
> > Yes either would work.
> >
> > On another subject (maybe I need a new thread/JIRA for this) is it
> > intentional that the configuration is not applied to the sorting and
> > grouping comparators?
> >
> > I am writing my own multiple input MR driver to test a reduce side
> > join and had to do:
> >
> >             // Configure grouping and sorting comparators
> >    if (keyGroupComparator instanceof Configured ) {
> >        ((Configured)keyGroupComparator).setConf( configuration );
> >    }
> >    if (keyValueOrderComparator instanceof Configured) {
> >        ((Configured)keyValueOrderComparator).setConf( configuration );
> >    }
> >
> > To get the config applied even though they are Configured objects.
> >
> > Jacob
> >
 		 	   		  

Fwd: MRUNIT-114: should support calling configure on sorting and grouping comparators if applicable

Posted by Jim Donofrio <do...@gmail.com>.
Jacob,

Did you get the below email?

-------- Original Message --------
Subject: MRUNIT-114: should support calling configure on sorting and 
grouping comparators if applicable
Date: Tue, 29 May 2012 22:46:27 -0400
From: Jim Donofrio <do...@gmail.com>
To: user@mrunit.apache.org

Yes that is a bug, nice catch, I created MRUNIT-114.

How are you testing your multiple inputs code, using the LocalJobRunner?
I just MRUnit could test multiple inputs by using the inputformat to
read the data from each of the paths into a mapdriver and then
consolidating the output into one list to pass to one reduce driver.
However, I get nervous that MRUnit will eventually be reimplementing the
LocalJobRunner. Ideally MRUnit would only use the LocalJobRunner by
taking only a JobConf as input and would just provide support code to
make it easier to use by creating the input files and reading the output
files.


We fail to use ReflectionUtils.newInstance on the 2 comparators, it
would be better if most of these classes only took classes as input
instead of the actual object

   public void setKeyGroupingComparator(
       final RawComparator<K2> groupingComparator) {
     keyGroupComparator =
ReflectionUtils.newInstance(returnNonNull(groupingComparator).getClass(), getConfiguration()); 


   }

   public void setKeyOrderComparator(final RawComparator<K2>
orderComparator) {
     keyValueOrderComparator =
ReflectionUtils.newInstance(returnNonNull(orderComparator),
getConfiguration());
   }

https://issues.apache.org/jira/browse/MRUNIT-114

On 05/28/2012 01:40 PM, Jacob Metcalf wrote:
> Yes either would work.
>
> On another subject (maybe I need a new thread/JIRA for this) is it
> intentional that the configuration is not applied to the sorting and
> grouping comparators?
>
> I am writing my own multiple input MR driver to test a reduce side
> join and had to do:
>
>             // Configure grouping and sorting comparators
>    if (keyGroupComparator instanceof Configured ) {
>        ((Configured)keyGroupComparator).setConf( configuration );
>    }
>    if (keyValueOrderComparator instanceof Configured) {
>        ((Configured)keyValueOrderComparator).setConf( configuration );
>    }
>
> To get the config applied even though they are Configured objects.
>
> Jacob
>

MRUNIT-114: should support calling configure on sorting and grouping comparators if applicable

Posted by Jim Donofrio <do...@gmail.com>.
Yes that is a bug, nice catch, I created MRUNIT-114.

How are you testing your multiple inputs code, using the LocalJobRunner? 
I just MRUnit could test multiple inputs by using the inputformat to 
read the data from each of the paths into a mapdriver and then 
consolidating the output into one list to pass to one reduce driver. 
However, I get nervous that MRUnit will eventually be reimplementing the 
LocalJobRunner. Ideally MRUnit would only use the LocalJobRunner by 
taking only a JobConf as input and would just provide support code to 
make it easier to use by creating the input files and reading the output 
files.


We fail to use ReflectionUtils.newInstance on the 2 comparators, it 
would be better if most of these classes only took classes as input 
instead of the actual object

   public void setKeyGroupingComparator(
       final RawComparator<K2> groupingComparator) {
     keyGroupComparator = 
ReflectionUtils.newInstance(returnNonNull(groupingComparator).getClass(), getConfiguration()); 

   }

   public void setKeyOrderComparator(final RawComparator<K2> 
orderComparator) {
     keyValueOrderComparator = 
ReflectionUtils.newInstance(returnNonNull(orderComparator), 
getConfiguration());
   }

https://issues.apache.org/jira/browse/MRUNIT-114

On 05/28/2012 01:40 PM, Jacob Metcalf wrote:
> Yes either would work.
>
> On another subject (maybe I need a new thread/JIRA for this) is it 
> intentional that the configuration is not applied to the sorting and 
> grouping comparators?
>
> I am writing my own multiple input MR driver to test a reduce side 
> join and had to do:
>
>             // Configure grouping and sorting comparators
>    if (keyGroupComparator instanceof Configured ) {
>        ((Configured)keyGroupComparator).setConf( configuration );
>    }
>    if (keyValueOrderComparator instanceof Configured) {
>        ((Configured)keyValueOrderComparator).setConf( configuration );
>    }
>
> To get the config applied even though they are Configured objects.
>
> Jacob
>

RE: Deserializer used for both Map and Reducer context.write()

Posted by Jacob Metcalf <ja...@hotmail.com>.
Yes either would work.
On another subject (maybe I need a new thread/JIRA for this) is it intentional that the configuration is not applied to the sorting and grouping comparators?
I am writing my own multiple input MR driver to test a reduce side join and had to do:

            // Configure grouping and sorting comparators	    if (keyGroupComparator instanceof Configured ) {	        ((Configured)keyGroupComparator).setConf( configuration );	    }	    if (keyValueOrderComparator instanceof Configured) {	        ((Configured)keyValueOrderComparator).setConf( configuration );	    }
To get the config applied even though they are Configured objects.
Jacob

> Date: Tue, 22 May 2012 22:36:26 -0400
> From: donofrio111@gmail.com
> To: user@mrunit.apache.org
> Subject: Fwd: Re: Deserializer used for both Map and Reducer context.write()
> 
> Or maybe if creating a Pair is annoying we could instead do:
> 
> public interface Copier<K, V> {
> 
>    public K copyKey(K key);
> 
>    public V copyKey(V value);
> 
> }
> 
> -------- Original Message --------
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> Date: Tue, 22 May 2012 22:17:02 -0400
> From: Jim Donofrio <do...@gmail.com>
> To: user@mrunit.apache.org
> 
> Ok I understand now. The outputformat could work for you but doesnt
> because I call Serialization.copy in order to use runTest or have run()
> return all the outputs in a list.
> 
> How about if I provide both solutions as overloaded methods since
> sometimes an alternative JobConf will be easier while other times a
> cloner object will be easier. I think maybe we should use a different
> term such as copier to avoid confusion with Java's Cloneable.
> 
> So for MapDriver for example there would be:
> 
> public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends
> OutputFormat> outputFormatClass, final Class<? extends InputFormat>
> inputFormatClass)
> public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends
> OutputFormat> outputFormatClass, final Class<? extends InputFormat>
> inputFormatClass, JobConf inputFormatOnlyJobConf)
> public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends
> OutputFormat> outputFormatClass, final Class<? extends InputFormat>
> inputFormatClass, Copier copier)
> 
> Copier would be an interface:
> 
> public interface Copier {
> 
>    public Pair<K, V> copy(K key, V value);
> 
> }
> 
> What do you think?
> 
> On 05/22/2012 04:34 PM, Jacob Metcalf wrote:
> > Jim
> >
> > My last example is just to show other people who are stuck with this
> > how to use unions to solve it.
> >
> > I had not used an output format in it because it made no difference to
> > my issue. Let me try to explain with the code:
> >
> > Line 162:
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockMapreduceOutputFormat.java
> >
> >
> > while (recordReader.nextKeyValue()) {
> > outputs.add(new Pair<K, V>(serialization.copy(recordReader
> > .getCurrentKey()), serialization.copy(recordReader
> > .getCurrentValue())));
> > }
> >
> > Line 48
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/output/MockOutputCollector.java
> >
> > public void collect(final K key, final V value) throws IOException {
> > collectedOutputs.add(new Pair<K, V>(serialization.copy(key), serialization
> > .copy(value)));
> > }
> >
> > Both of these use MRUnit's Serialization class to clone the output
> > objects so it does not matter whether I configure an output format.
> > These in turn will use AvroSerialization to attempt to clone the
> > object. In my Hadoop job my map output schema Room is different to my
> > reducer output schema House, AvroSerialization is only configured to
> > be able to serialize the mapper output Room and it all works. However
> > if I attempt the same with the unit test, MRUnit's Serialization class
> > attempts to clone a House using the Room schema and it blows up.
> >
> > You suggested adding a JobConf, that would work but would require
> > people to write code to clone the conf and change the Schema.
> > Alternatively I suggested since the user has to write code anyway you
> > could allow them to pass in a Cloner object. For Avro this would be a
> > simple one liner calling Avro's deepCopy:
> > http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy(org.apache.avro.Schema,
> > java.lang.Object)
> > <http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy%28org.apache.avro.Schema%2c%20java.lang.Object%29>
> >
> > Hope all that makes sense !
> >
> > Jacob
> >
> >
> > > Date: Sun, 20 May 2012 21:39:12 -0400
> > > From: donofrio111@gmail.com
> > > To: mrunit-user@incubator.apache.org
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > >
> > > Sorry for the delay. So you are suggesting to provide an option for an
> > > alternative conf that just that only the inputformat uses? So we could
> > > have withOutputFormat(outputformat, inputformat) and
> > > withOutputFormat(outputformat, inputformat, jobconf)?
> > >
> > > I am confused why your example doesnt use withOutputFormat, is that
> > > because you are doing your own verification with run() instead of
> > > calling runTest?
> > >
> > > On 05/13/2012 03:43 PM, Jacob Metcalf wrote:
> > > >
> > > > The InputFormat works fine - but it is configured separately to
> > > > AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is
> > > > effectively using to clone. Garret Wu's new MR2
> > > > AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their
> > > > configuration from "avro.schema.[input|output].[key|value]". Whereas
> > > > AvroSerialization, which is typically only used on the shuffle, picks
> > > > up its configuration from
> > > > "avro.serialization.[key|value].[writer|reader].schema".
> > > >
> > > > In the case of MRUnit I see
> > > > org.apache.hadoop.mrunit.internal.io.Serialization already has a
> > > > copyWithConf(). So you could have users provide a separate optional
> > > > config to withOutputFormat(). It would take a few comments to explain
> > > > and users would have to be careful to keep the configs separate !
> > > >
> > > > ---
> > > >
> > > > For anyone who has trouble with this in future (3) worked and was
> > > > pretty easy. I found that you can get Avro to support multiple
> > schemas
> > > > through unions: https://issues.apache.org/jira/browse/AVRO-127. In my
> > > > case it was a matter of doing this:
> > > >
> > > > AvroJob.setMapOutputValueSchema( job, Schema.createUnion(
> > > > Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
> > > >
> > > > Then breaking with convention and storing the Avro output of the
> > > > reducer in the value. For completeness I have attached an example
> > > > which works on both MRUnit and Hadoop 0.23 but you will need to
> > obtain
> > > > and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
> > > >
> > > > Jacob
> > > >
> > > >
> > > > > Date: Sun, 13 May 2012 10:50:16 -0400
> > > > > From: donofrio111@gmail.com
> > > > > To: mrunit-user@incubator.apache.org
> > > > > Subject: Re: Deserializer used for both Map and Reducer
> > context.write()
> > > > >
> > > > > Yes I agree 3 is a bad idea, you shouldnt have to change your
> > code to
> > > > > work with a unit test.
> > > > >
> > > > > Ideally AvroSerialization would already support this and you wouldnt
> > > > > have to do 4.
> > > > >
> > > > > I am not sure I want to do 2 either, it is just more code users
> > have to
> > > > > write to use MRUnit.
> > > > >
> > > > >
> > > > > MRUnit doesnt really use serialization to clone in the reducer.
> > After I
> > > > > write the output out with the outputformat I need some way to
> > bring the
> > > > > objects back in so that I can use our existing validation
> > methods. The
> > > > > simplest way to do this that I thought of that used existing hadoop
> > > > > concepts was to have the user set an inputformat as if they were
> > using
> > > > > the mapper in another map reduce job to read the output of this
> > > > > mapreduce job that you are testing. How do you usually read the
> > output
> > > > > of an Avro job, maybe I just need to allow you to set an alternative
> > > > > JobConf that only gets used by the InputFormat since you say that
> > > > > AvroSerialization only supports one key and value?
> > > > >
> > > > > On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> > > > > >
> > > > > > No thanks for looking at it. My next step was to attempt to get my
> > > > > > example running on a Pseudo-distributed cluster. This took me
> > a while
> > > > > > as I am only a Hadoop beginner and had problems with my
> > > > > > HADOOP_CLASSPATH but it now all works. This proved to me that
> > Hadoop
> > > > > > does not use AvroSerialization in the Reducer Output stage.
> > > > > >
> > > > > > I understand why MRUnit needs to make copies but:
> > > > > >
> > > > > > * It appears AvroSerialization can only be configured to serialize
> > > > > > one key class and one value schema.
> > > > > > * It appears it is only expecting to be used in the mapper phase.
> > > > > > * I configure it to serialize Room (output of mapper stage)
> > > > > > * So it gets a shock when MRUnit sends it a House (output of
> > reducer
> > > > > > stage)
> > > > > >
> > > > > >
> > > > > > I have thought of a number of ways round this both on the
> > MRUnit side
> > > > > > and my side:
> > > > > >
> > > > > > 1. MRUnit could check to see if objects support
> > > > > > Serializable/Cloneable and utilise these in preference.
> > > > > > Unfortunately I don't think Avro generated classes do implement
> > > > > > these, but Protobuf does.
> > > > > >
> > > > > > 2. withOutputFormat() could take an optional object with interface
> > > > > > e.g. "Cloner" which users pass in. You may not want Avro
> > > > > > dependencies in MRUnit but it is fairly easy for people to write a
> > > > > > concrete Cloner for Avro see:
> > > > > > https://issues.apache.org/jira/browse/AVRO-964
> > > > > >
> > > > > > 3. I think I should be able to use an Avro union
> > > > > > http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> > > > > > House to make AvroSerialization able to handle both classes. This
> > > > > > however is complicating my message format just to support MRUnit
> > > > > > so probably not a good long term solution.
> > > > > >
> > > > > > 4. It may be possible to write an AvroSerialization class
> > capable of
> > > > > > handling any Avro generated class. The problem is Avro wraps
> > > > > > everything in AvroKey and AvroValue so the problem is that when
> > > > > > Serialization.accept is called you have lost the specific class
> > > > > > information through erasure. So if I went down this path I could
> > > > > > end up having to write my own version of Avro MR
> > > > > >
> > > > > >
> > > > > > Let me know if you are interested in option (2) in which case
> > I will
> > > > > > help test. If not I will play around with (3) and (4).
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Jacob
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > > > > > From: donofrio111@gmail.com
> > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > context.write()
> > > > > > >
> > > > > > > Sorry for the delay I havent had a chance to look at this
> > too much.
> > > > > > >
> > > > > > > Yes you are correct that I need to use mrunit's Serialization
> > > > class to
> > > > > > > copy the objects because the RecordReader's will reuse objects.
> > > > The old
> > > > > > > mapred RecordReader interface has createKey and createValue
> > methods
> > > > > > > which create a new instance for me but the mapreduce api
> > removed
> > > > these
> > > > > > > methods so I am forced to copy them.
> > > > > > >
> > > > > > > The configuration gets passed down to AvroSerialization so the
> > > > schema
> > > > > > > should be available for reducer output.
> > > > > > >
> > > > > > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > > > > > Jim
> > > > > > > >
> > > > > > > > Unfortunately this did not fix my issue but at least I can
> > now
> > > > attach
> > > > > > > > a unit test. The test is made up as below:
> > > > > > > >
> > > > > > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > > > > > AvroSerialization class is slightly different but still has
> > > > the same
> > > > > > > > problem.
> > > > > > > >
> > > > > > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on
> > > > the repo.
> > > > > > > >
> > > > > > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7
> > > > as it
> > > > > > > > tries to use HDFS (which is what I am trying to avoid
> > through the
> > > > > > > > excellent MRUNIT). Instead I Mocked out my own
> > > > > > > > in MockAvroFormats.java. This could do with some improvement
> > > > but it
> > > > > > > > demonstrates the problem.
> > > > > > > >
> > > > > > > > - I have a Room and House class which you will see get code
> > > > generated
> > > > > > > > from the Avro schema file.
> > > > > > > >
> > > > > > > > - I have a mapper which takes text and outputs Room and a
> > reducer
> > > > > > > > which takes <Long,List<Room>> and outputs a House.
> > > > > > > >
> > > > > > > >
> > > > > > > > The first test noOutputFormatTest() demonstrates my original
> > > > problem.
> > > > > > > > Trying to re-use the serializer for the output of the
> > reducer at
> > > > > > > > MockOutputCollector:49 causes the exception:
> > > > > > > >
> > > > > > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House
> > cannot
> > > > > > > > be cast to java.lang.Long
> > > > > > > >
> > > > > > > > Because the AvroSerialization is configured for the output
> > of the
> > > > > > > > Mapper so is expecting to be sent a Long in the key but here
> > > > is being
> > > > > > > > sent a House.
> > > > > > > >
> > > > > > > > The second test withOutputFormatTest() results in the same
> > > > exception.
> > > > > > > > But this time from MockMapreduceOutputFormat.java:162. I
> > > > assume you
> > > > > > > > are forced to clone here because the InputFormat may be
> > > > re-using its
> > > > > > > > objects?
> > > > > > > >
> > > > > > > > The heart of the problem is AvroSerialization retrieves
> > its schema
> > > > > > > > through the configuration. So my guess is that it can only
> > ever be
> > > > > > > > used for the shuffle. But I am happy to cross post this on
> > the
> > > > Avro
> > > > > > > > board to see if I am doing something wrong.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > Jacob
> > > > > > > >
> > > > > > > >
> > > > > > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > > > > > From: donofrio111@gmail.com
> > > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > > context.write()
> > > > > > > > >
> > > > > > > > > In 1336519 revision I checked in my initial work for
> > > > MRUNIT-101. I
> > > > > > > > still
> > > > > > > > > need to do some cleaning up and adding the javadoc but the
> > > > > > feature is
> > > > > > > > > there and tested. I reconfigured out jenkins setup to
> > publish
> > > > > > snapshots
> > > > > > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT
> > mrunit jar in
> > > > > > > > > apache's Nexus repository. I dont think this gets
> > replicated
> > > > so you
> > > > > > > > will
> > > > > > > > > have to add apache's repository to your settings.xml if
> > you are
> > > > > > > > using maven.
> > > > > > > > >
> > > > > > > > > @Test
> > > > > > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > > > > > final MapReduceDriver driver = this.driver;
> > > > > > > > > driver.withOutputFormat(TextOutputFormat.class,
> > > > > > TextInputFormat.class);
> > > > > > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > > > > > driver.runTest();
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > You can look at
> > > > > > > > >
> > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1]
> > > > > > to see
> > > > > > > > > how to use the outputformat. Just call withOutputFormat
> > on the
> > > > > > driver
> > > > > > > > > with the outputformat you want to use and the
> > inputformat you
> > > > > > want to
> > > > > > > > > read that output back into the output list. The
> > Serialization
> > > > > > class is
> > > > > > > > > used after the inputformat to copy the inputs into a list so
> > > > > > make sure
> > > > > > > > > to set io.serializations because the mapreduce api
> > RecordReader
> > > > > > does
> > > > > > > > not
> > > > > > > > > have createKey and createValue methods. Let me know if that
> > > > does not
> > > > > > > > > work for Avro usually.
> > > > > > > > >
> > > > > > > > > When I get to MultipleOutputs MRUNIT-13 in the next few
> > days it
> > > > > > will be
> > > > > > > > > implemented with a similar api except you will also need to
> > > > > > specify the
> > > > > > > > > name of the output collector.
> > > > > > > > >
> > > > > > > > > [1]:
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > > > > > >
> > > > > > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > > > > > Jim, Brock
> > > > > > > > > >
> > > > > > > > > > Thanks for getting back to me so quickly, and yes I
> > suspect
> > > > > > MR-101 is
> > > > > > > > > > the answer.
> > > > > > > > > >
> > > > > > > > > > The key thing I wanted to establish is whether:
> > > > > > > > > >
> > > > > > > > > > 1) The "contract" is that the Serialization concrete
> > > > > > implementations
> > > > > > > > > > listed in "io.serializations" should only ever be used for
> > > > > > > > serializing
> > > > > > > > > > mapper output in the shuffle stage.
> > > > > > > > > >
> > > > > > > > > > 2) OR I am doing something very wrong with Avro - for
> > > > example I
> > > > > > > > > > should only be using the same schema for map and reduce
> > > > output.
> > > > > > > > > >
> > > > > > > > > > Assuming (1) is correct then MR-101 would make a big
> > > > > > difference, as
> > > > > > > > > > long as you could avoid using the serializer to clone the
> > > > > > output of
> > > > > > > > > > the reducer. I am guessing you would use the concrete
> > > > > > OutputFormat to
> > > > > > > > > > serialize the reducer output to a stream and then the
> > unit
> > > > tester
> > > > > > > > > > would need to deserialize themselves to assert the
> > output?
> > > > But
> > > > > > what
> > > > > > > > > > would people who just want to stick to asserting based
> > on the
> > > > > > reducer
> > > > > > > > > > output do?
> > > > > > > > > >
> > > > > > > > > > I will try and boil my issue down to a canned example
> > over
> > > > the
> > > > > > next
> > > > > > > > > > few days. If you are interested in Avro they are
> > working on
> > > > > > > > > > integrating Garret Wu's MR2 extensions in 1.7 and
> > there is
> > > > a test
> > > > > > > > case
> > > > > > > > > > here:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup
> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I am happy to test MR-101 for you if you let me know
> > when its
> > > > > > > > available.
> > > > > > > > > >
> > > > > > > > > > Regards
> > > > > > > > > >
> > > > > > > > > > Jacob
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: brock@cloudera.com
> > > > > > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > > > > context.write()
> > > > > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it
> > > > > > possible to
> > > > > > > > > > > share the exception/error you saw? If you have time,
> > I'd
> > > > enjoy
> > > > > > > > seeing
> > > > > > > > > > > a small example of the code in question so we can add
> > > > that to
> > > > > > > > our test
> > > > > > > > > > > suite.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Brock
> > > > > > > > > > >
> > > > > > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > > > > > <do...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > I am not too familar with Avro, maybe someone else can
> > > > > > respond
> > > > > > > > but
> > > > > > > > > > if the
> > > > > > > > > > > > AvroKeyOutputFormat does the serialization then
> > > > MRUNIT-101 [1]
> > > > > > > > > > should fix
> > > > > > > > > > > > your problem. I am just finishing this JIRA up, it
> > > > works under
> > > > > > > > > > Hadoop 1+, I
> > > > > > > > > > > > am having issues with TaskAttemptContext and
> > JobContext
> > > > > > > > changing from
> > > > > > > > > > > > classes to interfaces in the mapreduce api in
> > Hadoop 0.23.
> > > > > > > > > > > >
> > > > > > > > > > > > I should resolve this over the next few days. In the
> > > > > > meantime if
> > > > > > > > > > you can
> > > > > > > > > > > > post your code I can test against it. It may also be
> > > > worth
> > > > > > the
> > > > > > > > MRUnit
> > > > > > > > > > > > project exploring having Jenkins deploy a snapshot to
> > > > > > Nexus so
> > > > > > > > you can
> > > > > > > > > > > > easily test against the trunk without having to build
> > > > it or
> > > > > > > > > > download the jar
> > > > > > > > > > > > from Jenkins.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> I am trying to integrate Avro-1.7 (specifically the
> > > > new MR2
> > > > > > > > > > extensions),
> > > > > > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not
> > > > made any
> > > > > > > > > > mistakes my
> > > > > > > > > > > >> question is should MRUnit be using the Serialization
> > > > factory
> > > > > > > > when
> > > > > > > > > > I call
> > > > > > > > > > > >> context.write() in a reducer.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I am using MapReduceDriver and my mapper has output
> > > > > > signature:
> > > > > > > > > > > >>
> > > > > > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > > > > > >>
> > > > > > > > > > > >> My reducer has a different outputt signature:
> > > > > > > > > > > >>
> > > > > > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I am using Avro specific serialization so I set
> > my Avro
> > > > > > schemas
> > > > > > > > > > like this:
> > > > > > > > > > > >>
> > > > > > > > > > > >> AvroSerialization.addToConfiguration(
> > configuration );
> > > > > > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > > > >> );
> > > > > > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > > > >> );
> > > > > > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > > > >>
> > > > > > > > > > > >> My understanding of Avro MR is that the Serialization
> > > > > > class is
> > > > > > > > > > intended to
> > > > > > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > > > > > >>
> > > > > > > > > > > >> However my test fails at reduce stage. Debugging I
> > > > realised
> > > > > > > > the mock
> > > > > > > > > > > >> reducer context is using the serializer to copy
> > objects:
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > > > > > >>
> > > > > > > > > > > >> Looking at the AvroSerialization object it only
> > > > expects one
> > > > > > > > set of
> > > > > > > > > > > >> schemas:
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > > > > > >>
> > > > > > > > > > > >> So when my reducer tries to write SpecificValue2
> > to the
> > > > > > context,
> > > > > > > > > > MRUnit's
> > > > > > > > > > > >> mock then tries to serialise SpecificValue2 with
> > > > > > Value1.SCHEMA$
> > > > > > > > > > and as a
> > > > > > > > > > > >> result fails.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I have yet debugged Hadoop itself but I did read some
> > > > > > comments
> > > > > > > > > > (which I
> > > > > > > > > > > >> since cannot locate) which says that the
> > Serialization
> > > > > > class is
> > > > > > > > > > typically
> > > > > > > > > > > >> not used for the output of the reduce stage. My
> > limited
> > > > > > > > > > understanding is
> > > > > > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat)
> > will
> > > > act
> > > > > > as the
> > > > > > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I can spend some time distilling my code into a
> > simple
> > > > > > > > example but
> > > > > > > > > > > >> wondered if anyone had any pointers - or an Avro
> > + MR2 +
> > > > > > MRUnit
> > > > > > > > > > example.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Jacob
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > > > > > http://incubator.apache.org/mrunit/
 		 	   		  

Fwd: Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
Or maybe if creating a Pair is annoying we could instead do:

public interface Copier<K, V> {

   public K copyKey(K key);

   public V copyKey(V value);

}

-------- Original Message --------
Subject: Re: Deserializer used for both Map and Reducer context.write()
Date: Tue, 22 May 2012 22:17:02 -0400
From: Jim Donofrio <do...@gmail.com>
To: user@mrunit.apache.org

Ok I understand now. The outputformat could work for you but doesnt
because I call Serialization.copy in order to use runTest or have run()
return all the outputs in a list.

How about if I provide both solutions as overloaded methods since
sometimes an alternative JobConf will be easier while other times a
cloner object will be easier. I think maybe we should use a different
term such as copier to avoid confusion with Java's Cloneable.

So for MapDriver for example there would be:

public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends
OutputFormat> outputFormatClass, final Class<? extends InputFormat>
inputFormatClass)
public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends
OutputFormat> outputFormatClass, final Class<? extends InputFormat>
inputFormatClass, JobConf inputFormatOnlyJobConf)
public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends
OutputFormat> outputFormatClass, final Class<? extends InputFormat>
inputFormatClass, Copier copier)

Copier would be an interface:

public interface Copier {

   public Pair<K, V> copy(K key, V value);

}

What do you think?

On 05/22/2012 04:34 PM, Jacob Metcalf wrote:
> Jim
>
> My last example is just to show other people who are stuck with this
> how to use unions to solve it.
>
> I had not used an output format in it because it made no difference to
> my issue. Let me try to explain with the code:
>
> Line 162:
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockMapreduceOutputFormat.java
>
>
> while (recordReader.nextKeyValue()) {
> outputs.add(new Pair<K, V>(serialization.copy(recordReader
> .getCurrentKey()), serialization.copy(recordReader
> .getCurrentValue())));
> }
>
> Line 48
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/output/MockOutputCollector.java
>
> public void collect(final K key, final V value) throws IOException {
> collectedOutputs.add(new Pair<K, V>(serialization.copy(key), serialization
> .copy(value)));
> }
>
> Both of these use MRUnit's Serialization class to clone the output
> objects so it does not matter whether I configure an output format.
> These in turn will use AvroSerialization to attempt to clone the
> object. In my Hadoop job my map output schema Room is different to my
> reducer output schema House, AvroSerialization is only configured to
> be able to serialize the mapper output Room and it all works. However
> if I attempt the same with the unit test, MRUnit's Serialization class
> attempts to clone a House using the Room schema and it blows up.
>
> You suggested adding a JobConf, that would work but would require
> people to write code to clone the conf and change the Schema.
> Alternatively I suggested since the user has to write code anyway you
> could allow them to pass in a Cloner object. For Avro this would be a
> simple one liner calling Avro's deepCopy:
> http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy(org.apache.avro.Schema,
> java.lang.Object)
> <http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy%28org.apache.avro.Schema%2c%20java.lang.Object%29>
>
> Hope all that makes sense !
>
> Jacob
>
>
> > Date: Sun, 20 May 2012 21:39:12 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > Sorry for the delay. So you are suggesting to provide an option for an
> > alternative conf that just that only the inputformat uses? So we could
> > have withOutputFormat(outputformat, inputformat) and
> > withOutputFormat(outputformat, inputformat, jobconf)?
> >
> > I am confused why your example doesnt use withOutputFormat, is that
> > because you are doing your own verification with run() instead of
> > calling runTest?
> >
> > On 05/13/2012 03:43 PM, Jacob Metcalf wrote:
> > >
> > > The InputFormat works fine - but it is configured separately to
> > > AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is
> > > effectively using to clone. Garret Wu's new MR2
> > > AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their
> > > configuration from "avro.schema.[input|output].[key|value]". Whereas
> > > AvroSerialization, which is typically only used on the shuffle, picks
> > > up its configuration from
> > > "avro.serialization.[key|value].[writer|reader].schema".
> > >
> > > In the case of MRUnit I see
> > > org.apache.hadoop.mrunit.internal.io.Serialization already has a
> > > copyWithConf(). So you could have users provide a separate optional
> > > config to withOutputFormat(). It would take a few comments to explain
> > > and users would have to be careful to keep the configs separate !
> > >
> > > ---
> > >
> > > For anyone who has trouble with this in future (3) worked and was
> > > pretty easy. I found that you can get Avro to support multiple
> schemas
> > > through unions: https://issues.apache.org/jira/browse/AVRO-127. In my
> > > case it was a matter of doing this:
> > >
> > > AvroJob.setMapOutputValueSchema( job, Schema.createUnion(
> > > Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
> > >
> > > Then breaking with convention and storing the Avro output of the
> > > reducer in the value. For completeness I have attached an example
> > > which works on both MRUnit and Hadoop 0.23 but you will need to
> obtain
> > > and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
> > >
> > > Jacob
> > >
> > >
> > > > Date: Sun, 13 May 2012 10:50:16 -0400
> > > > From: donofrio111@gmail.com
> > > > To: mrunit-user@incubator.apache.org
> > > > Subject: Re: Deserializer used for both Map and Reducer
> context.write()
> > > >
> > > > Yes I agree 3 is a bad idea, you shouldnt have to change your
> code to
> > > > work with a unit test.
> > > >
> > > > Ideally AvroSerialization would already support this and you wouldnt
> > > > have to do 4.
> > > >
> > > > I am not sure I want to do 2 either, it is just more code users
> have to
> > > > write to use MRUnit.
> > > >
> > > >
> > > > MRUnit doesnt really use serialization to clone in the reducer.
> After I
> > > > write the output out with the outputformat I need some way to
> bring the
> > > > objects back in so that I can use our existing validation
> methods. The
> > > > simplest way to do this that I thought of that used existing hadoop
> > > > concepts was to have the user set an inputformat as if they were
> using
> > > > the mapper in another map reduce job to read the output of this
> > > > mapreduce job that you are testing. How do you usually read the
> output
> > > > of an Avro job, maybe I just need to allow you to set an alternative
> > > > JobConf that only gets used by the InputFormat since you say that
> > > > AvroSerialization only supports one key and value?
> > > >
> > > > On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> > > > >
> > > > > No thanks for looking at it. My next step was to attempt to get my
> > > > > example running on a Pseudo-distributed cluster. This took me
> a while
> > > > > as I am only a Hadoop beginner and had problems with my
> > > > > HADOOP_CLASSPATH but it now all works. This proved to me that
> Hadoop
> > > > > does not use AvroSerialization in the Reducer Output stage.
> > > > >
> > > > > I understand why MRUnit needs to make copies but:
> > > > >
> > > > > * It appears AvroSerialization can only be configured to serialize
> > > > > one key class and one value schema.
> > > > > * It appears it is only expecting to be used in the mapper phase.
> > > > > * I configure it to serialize Room (output of mapper stage)
> > > > > * So it gets a shock when MRUnit sends it a House (output of
> reducer
> > > > > stage)
> > > > >
> > > > >
> > > > > I have thought of a number of ways round this both on the
> MRUnit side
> > > > > and my side:
> > > > >
> > > > > 1. MRUnit could check to see if objects support
> > > > > Serializable/Cloneable and utilise these in preference.
> > > > > Unfortunately I don't think Avro generated classes do implement
> > > > > these, but Protobuf does.
> > > > >
> > > > > 2. withOutputFormat() could take an optional object with interface
> > > > > e.g. "Cloner" which users pass in. You may not want Avro
> > > > > dependencies in MRUnit but it is fairly easy for people to write a
> > > > > concrete Cloner for Avro see:
> > > > > https://issues.apache.org/jira/browse/AVRO-964
> > > > >
> > > > > 3. I think I should be able to use an Avro union
> > > > > http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> > > > > House to make AvroSerialization able to handle both classes. This
> > > > > however is complicating my message format just to support MRUnit
> > > > > so probably not a good long term solution.
> > > > >
> > > > > 4. It may be possible to write an AvroSerialization class
> capable of
> > > > > handling any Avro generated class. The problem is Avro wraps
> > > > > everything in AvroKey and AvroValue so the problem is that when
> > > > > Serialization.accept is called you have lost the specific class
> > > > > information through erasure. So if I went down this path I could
> > > > > end up having to write my own version of Avro MR
> > > > >
> > > > >
> > > > > Let me know if you are interested in option (2) in which case
> I will
> > > > > help test. If not I will play around with (3) and (4).
> > > > >
> > > > > Thanks
> > > > >
> > > > > Jacob
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > > > > From: donofrio111@gmail.com
> > > > > > To: mrunit-user@incubator.apache.org
> > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > context.write()
> > > > > >
> > > > > > Sorry for the delay I havent had a chance to look at this
> too much.
> > > > > >
> > > > > > Yes you are correct that I need to use mrunit's Serialization
> > > class to
> > > > > > copy the objects because the RecordReader's will reuse objects.
> > > The old
> > > > > > mapred RecordReader interface has createKey and createValue
> methods
> > > > > > which create a new instance for me but the mapreduce api
> removed
> > > these
> > > > > > methods so I am forced to copy them.
> > > > > >
> > > > > > The configuration gets passed down to AvroSerialization so the
> > > schema
> > > > > > should be available for reducer output.
> > > > > >
> > > > > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > > > > Jim
> > > > > > >
> > > > > > > Unfortunately this did not fix my issue but at least I can
> now
> > > attach
> > > > > > > a unit test. The test is made up as below:
> > > > > > >
> > > > > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > > > > AvroSerialization class is slightly different but still has
> > > the same
> > > > > > > problem.
> > > > > > >
> > > > > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on
> > > the repo.
> > > > > > >
> > > > > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7
> > > as it
> > > > > > > tries to use HDFS (which is what I am trying to avoid
> through the
> > > > > > > excellent MRUNIT). Instead I Mocked out my own
> > > > > > > in MockAvroFormats.java. This could do with some improvement
> > > but it
> > > > > > > demonstrates the problem.
> > > > > > >
> > > > > > > - I have a Room and House class which you will see get code
> > > generated
> > > > > > > from the Avro schema file.
> > > > > > >
> > > > > > > - I have a mapper which takes text and outputs Room and a
> reducer
> > > > > > > which takes <Long,List<Room>> and outputs a House.
> > > > > > >
> > > > > > >
> > > > > > > The first test noOutputFormatTest() demonstrates my original
> > > problem.
> > > > > > > Trying to re-use the serializer for the output of the
> reducer at
> > > > > > > MockOutputCollector:49 causes the exception:
> > > > > > >
> > > > > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House
> cannot
> > > > > > > be cast to java.lang.Long
> > > > > > >
> > > > > > > Because the AvroSerialization is configured for the output
> of the
> > > > > > > Mapper so is expecting to be sent a Long in the key but here
> > > is being
> > > > > > > sent a House.
> > > > > > >
> > > > > > > The second test withOutputFormatTest() results in the same
> > > exception.
> > > > > > > But this time from MockMapreduceOutputFormat.java:162. I
> > > assume you
> > > > > > > are forced to clone here because the InputFormat may be
> > > re-using its
> > > > > > > objects?
> > > > > > >
> > > > > > > The heart of the problem is AvroSerialization retrieves
> its schema
> > > > > > > through the configuration. So my guess is that it can only
> ever be
> > > > > > > used for the shuffle. But I am happy to cross post this on
> the
> > > Avro
> > > > > > > board to see if I am doing something wrong.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Jacob
> > > > > > >
> > > > > > >
> > > > > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > > > > From: donofrio111@gmail.com
> > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > context.write()
> > > > > > > >
> > > > > > > > In 1336519 revision I checked in my initial work for
> > > MRUNIT-101. I
> > > > > > > still
> > > > > > > > need to do some cleaning up and adding the javadoc but the
> > > > > feature is
> > > > > > > > there and tested. I reconfigured out jenkins setup to
> publish
> > > > > snapshots
> > > > > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT
> mrunit jar in
> > > > > > > > apache's Nexus repository. I dont think this gets
> replicated
> > > so you
> > > > > > > will
> > > > > > > > have to add apache's repository to your settings.xml if
> you are
> > > > > > > using maven.
> > > > > > > >
> > > > > > > > @Test
> > > > > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > > > > final MapReduceDriver driver = this.driver;
> > > > > > > > driver.withOutputFormat(TextOutputFormat.class,
> > > > > TextInputFormat.class);
> > > > > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > > > > driver.runTest();
> > > > > > > > }
> > > > > > > >
> > > > > > > > You can look at
> > > > > > > >
> org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1]
> > > > > to see
> > > > > > > > how to use the outputformat. Just call withOutputFormat
> on the
> > > > > driver
> > > > > > > > with the outputformat you want to use and the
> inputformat you
> > > > > want to
> > > > > > > > read that output back into the output list. The
> Serialization
> > > > > class is
> > > > > > > > used after the inputformat to copy the inputs into a list so
> > > > > make sure
> > > > > > > > to set io.serializations because the mapreduce api
> RecordReader
> > > > > does
> > > > > > > not
> > > > > > > > have createKey and createValue methods. Let me know if that
> > > does not
> > > > > > > > work for Avro usually.
> > > > > > > >
> > > > > > > > When I get to MultipleOutputs MRUNIT-13 in the next few
> days it
> > > > > will be
> > > > > > > > implemented with a similar api except you will also need to
> > > > > specify the
> > > > > > > > name of the output collector.
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > >
> > > > >
> > >
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > > > > >
> > > > > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > > > > Jim, Brock
> > > > > > > > >
> > > > > > > > > Thanks for getting back to me so quickly, and yes I
> suspect
> > > > > MR-101 is
> > > > > > > > > the answer.
> > > > > > > > >
> > > > > > > > > The key thing I wanted to establish is whether:
> > > > > > > > >
> > > > > > > > > 1) The "contract" is that the Serialization concrete
> > > > > implementations
> > > > > > > > > listed in "io.serializations" should only ever be used for
> > > > > > > serializing
> > > > > > > > > mapper output in the shuffle stage.
> > > > > > > > >
> > > > > > > > > 2) OR I am doing something very wrong with Avro - for
> > > example I
> > > > > > > > > should only be using the same schema for map and reduce
> > > output.
> > > > > > > > >
> > > > > > > > > Assuming (1) is correct then MR-101 would make a big
> > > > > difference, as
> > > > > > > > > long as you could avoid using the serializer to clone the
> > > > > output of
> > > > > > > > > the reducer. I am guessing you would use the concrete
> > > > > OutputFormat to
> > > > > > > > > serialize the reducer output to a stream and then the
> unit
> > > tester
> > > > > > > > > would need to deserialize themselves to assert the
> output?
> > > But
> > > > > what
> > > > > > > > > would people who just want to stick to asserting based
> on the
> > > > > reducer
> > > > > > > > > output do?
> > > > > > > > >
> > > > > > > > > I will try and boil my issue down to a canned example
> over
> > > the
> > > > > next
> > > > > > > > > few days. If you are interested in Avro they are
> working on
> > > > > > > > > integrating Garret Wu's MR2 extensions in 1.7 and
> there is
> > > a test
> > > > > > > case
> > > > > > > > > here:
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup
>
> > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I am happy to test MR-101 for you if you let me know
> when its
> > > > > > > available.
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Jacob
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: brock@cloudera.com
> > > > > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > > > context.write()
> > > > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it
> > > > > possible to
> > > > > > > > > > share the exception/error you saw? If you have time,
> I'd
> > > enjoy
> > > > > > > seeing
> > > > > > > > > > a small example of the code in question so we can add
> > > that to
> > > > > > > our test
> > > > > > > > > > suite.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Brock
> > > > > > > > > >
> > > > > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > > > > <do...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > I am not too familar with Avro, maybe someone else can
> > > > > respond
> > > > > > > but
> > > > > > > > > if the
> > > > > > > > > > > AvroKeyOutputFormat does the serialization then
> > > MRUNIT-101 [1]
> > > > > > > > > should fix
> > > > > > > > > > > your problem. I am just finishing this JIRA up, it
> > > works under
> > > > > > > > > Hadoop 1+, I
> > > > > > > > > > > am having issues with TaskAttemptContext and
> JobContext
> > > > > > > changing from
> > > > > > > > > > > classes to interfaces in the mapreduce api in
> Hadoop 0.23.
> > > > > > > > > > >
> > > > > > > > > > > I should resolve this over the next few days. In the
> > > > > meantime if
> > > > > > > > > you can
> > > > > > > > > > > post your code I can test against it. It may also be
> > > worth
> > > > > the
> > > > > > > MRUnit
> > > > > > > > > > > project exploring having Jenkins deploy a snapshot to
> > > > > Nexus so
> > > > > > > you can
> > > > > > > > > > > easily test against the trunk without having to build
> > > it or
> > > > > > > > > download the jar
> > > > > > > > > > > from Jenkins.
> > > > > > > > > > >
> > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> I am trying to integrate Avro-1.7 (specifically the
> > > new MR2
> > > > > > > > > extensions),
> > > > > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not
> > > made any
> > > > > > > > > mistakes my
> > > > > > > > > > >> question is should MRUnit be using the Serialization
> > > factory
> > > > > > > when
> > > > > > > > > I call
> > > > > > > > > > >> context.write() in a reducer.
> > > > > > > > > > >>
> > > > > > > > > > >> I am using MapReduceDriver and my mapper has output
> > > > > signature:
> > > > > > > > > > >>
> > > > > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > > > > >>
> > > > > > > > > > >> My reducer has a different outputt signature:
> > > > > > > > > > >>
> > > > > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > > > > >>
> > > > > > > > > > >> I am using Avro specific serialization so I set
> my Avro
> > > > > schemas
> > > > > > > > > like this:
> > > > > > > > > > >>
> > > > > > > > > > >> AvroSerialization.addToConfiguration(
> configuration );
> > > > > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > > >> );
> > > > > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > > >> );
> > > > > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > > >>
> > > > > > > > > > >> My understanding of Avro MR is that the Serialization
> > > > > class is
> > > > > > > > > intended to
> > > > > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > > > > >>
> > > > > > > > > > >> However my test fails at reduce stage. Debugging I
> > > realised
> > > > > > > the mock
> > > > > > > > > > >> reducer context is using the serializer to copy
> objects:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > > > > >>
> > > > > > > > > > >> Looking at the AvroSerialization object it only
> > > expects one
> > > > > > > set of
> > > > > > > > > > >> schemas:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > > > > >>
> > > > > > > > > > >> So when my reducer tries to write SpecificValue2
> to the
> > > > > context,
> > > > > > > > > MRUnit's
> > > > > > > > > > >> mock then tries to serialise SpecificValue2 with
> > > > > Value1.SCHEMA$
> > > > > > > > > and as a
> > > > > > > > > > >> result fails.
> > > > > > > > > > >>
> > > > > > > > > > >> I have yet debugged Hadoop itself but I did read some
> > > > > comments
> > > > > > > > > (which I
> > > > > > > > > > >> since cannot locate) which says that the
> Serialization
> > > > > class is
> > > > > > > > > typically
> > > > > > > > > > >> not used for the output of the reduce stage. My
> limited
> > > > > > > > > understanding is
> > > > > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat)
> will
> > > act
> > > > > as the
> > > > > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > > > > >>
> > > > > > > > > > >> I can spend some time distilling my code into a
> simple
> > > > > > > example but
> > > > > > > > > > >> wondered if anyone had any pointers - or an Avro
> + MR2 +
> > > > > MRUnit
> > > > > > > > > example.
> > > > > > > > > > >>
> > > > > > > > > > >> Jacob
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > > > > http://incubator.apache.org/mrunit/

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
Ok I understand now. The outputformat could work for you but doesnt 
because I call Serialization.copy in order to use runTest or have run() 
return all the outputs in a list.

How about if I provide both solutions as overloaded methods since 
sometimes an alternative JobConf will be easier while other times a 
cloner object will be easier. I think maybe we should use a different 
term such as copier to avoid confusion with Java's Cloneable.

So for MapDriver for example there would be:

public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends 
OutputFormat> outputFormatClass, final Class<? extends InputFormat> 
inputFormatClass)
public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends 
OutputFormat> outputFormatClass, final Class<? extends InputFormat> 
inputFormatClass, JobConf inputFormatOnlyJobConf)
public MapDriver<K1, V1, K2, V2> withOutputFormat(final Class<? extends 
OutputFormat> outputFormatClass, final Class<? extends InputFormat> 
inputFormatClass, Copier copier)

Copier would be an interface:

public interface Copier {

   public Pair<K, V> copy(K key, V value);

}

What do you think?

On 05/22/2012 04:34 PM, Jacob Metcalf wrote:
> Jim
>
> My last example is just to show other people who are stuck with this 
> how to use unions to solve it.
>
> I had not used an output format in it because it made no difference to 
> my issue. Let me try to explain with the code:
>
> Line 162: 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockMapreduceOutputFormat.java 
>
>
> while (recordReader.nextKeyValue()) {
> outputs.add(new Pair<K, V>(serialization.copy(recordReader
> .getCurrentKey()), serialization.copy(recordReader
> .getCurrentValue())));
> }
>
> Line 48 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/output/MockOutputCollector.java
>
> public void collect(final K key, final V value) throws IOException {
> collectedOutputs.add(new Pair<K, V>(serialization.copy(key), serialization
> .copy(value)));
> }
>
> Both of these use MRUnit's Serialization class to clone the output 
> objects so it does not matter whether I configure an output format. 
> These in turn will use AvroSerialization to attempt to clone the 
> object. In my Hadoop job my map output schema Room is different to my 
> reducer output schema House, AvroSerialization is only configured to 
> be able to serialize the mapper output Room and it all works. However 
> if I attempt the same with the unit test, MRUnit's Serialization class 
> attempts to clone a House using the Room schema and it blows up.
>
> You suggested adding a JobConf, that would work but would require 
> people to write code to clone the conf and change the Schema. 
> Alternatively I suggested since the user has to write code anyway you 
> could allow them to pass in a Cloner object. For Avro this would be a 
> simple one liner calling Avro's deepCopy: 
> http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy(org.apache.avro.Schema, 
> java.lang.Object) 
> <http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy%28org.apache.avro.Schema%2c%20java.lang.Object%29>
>
> Hope all that makes sense !
>
> Jacob
>
>
> > Date: Sun, 20 May 2012 21:39:12 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > Sorry for the delay. So you are suggesting to provide an option for an
> > alternative conf that just that only the inputformat uses? So we could
> > have withOutputFormat(outputformat, inputformat) and
> > withOutputFormat(outputformat, inputformat, jobconf)?
> >
> > I am confused why your example doesnt use withOutputFormat, is that
> > because you are doing your own verification with run() instead of
> > calling runTest?
> >
> > On 05/13/2012 03:43 PM, Jacob Metcalf wrote:
> > >
> > > The InputFormat works fine - but it is configured separately to
> > > AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is
> > > effectively using to clone. Garret Wu's new MR2
> > > AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their
> > > configuration from "avro.schema.[input|output].[key|value]". Whereas
> > > AvroSerialization, which is typically only used on the shuffle, picks
> > > up its configuration from
> > > "avro.serialization.[key|value].[writer|reader].schema".
> > >
> > > In the case of MRUnit I see
> > > org.apache.hadoop.mrunit.internal.io.Serialization already has a
> > > copyWithConf(). So you could have users provide a separate optional
> > > config to withOutputFormat(). It would take a few comments to explain
> > > and users would have to be careful to keep the configs separate !
> > >
> > > ---
> > >
> > > For anyone who has trouble with this in future (3) worked and was
> > > pretty easy. I found that you can get Avro to support multiple 
> schemas
> > > through unions: https://issues.apache.org/jira/browse/AVRO-127. In my
> > > case it was a matter of doing this:
> > >
> > > AvroJob.setMapOutputValueSchema( job, Schema.createUnion(
> > > Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
> > >
> > > Then breaking with convention and storing the Avro output of the
> > > reducer in the value. For completeness I have attached an example
> > > which works on both MRUnit and Hadoop 0.23 but you will need to 
> obtain
> > > and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
> > >
> > > Jacob
> > >
> > >
> > > > Date: Sun, 13 May 2012 10:50:16 -0400
> > > > From: donofrio111@gmail.com
> > > > To: mrunit-user@incubator.apache.org
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > >
> > > > Yes I agree 3 is a bad idea, you shouldnt have to change your 
> code to
> > > > work with a unit test.
> > > >
> > > > Ideally AvroSerialization would already support this and you wouldnt
> > > > have to do 4.
> > > >
> > > > I am not sure I want to do 2 either, it is just more code users 
> have to
> > > > write to use MRUnit.
> > > >
> > > >
> > > > MRUnit doesnt really use serialization to clone in the reducer. 
> After I
> > > > write the output out with the outputformat I need some way to 
> bring the
> > > > objects back in so that I can use our existing validation 
> methods. The
> > > > simplest way to do this that I thought of that used existing hadoop
> > > > concepts was to have the user set an inputformat as if they were 
> using
> > > > the mapper in another map reduce job to read the output of this
> > > > mapreduce job that you are testing. How do you usually read the 
> output
> > > > of an Avro job, maybe I just need to allow you to set an alternative
> > > > JobConf that only gets used by the InputFormat since you say that
> > > > AvroSerialization only supports one key and value?
> > > >
> > > > On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> > > > >
> > > > > No thanks for looking at it. My next step was to attempt to get my
> > > > > example running on a Pseudo-distributed cluster. This took me 
> a while
> > > > > as I am only a Hadoop beginner and had problems with my
> > > > > HADOOP_CLASSPATH but it now all works. This proved to me that 
> Hadoop
> > > > > does not use AvroSerialization in the Reducer Output stage.
> > > > >
> > > > > I understand why MRUnit needs to make copies but:
> > > > >
> > > > > * It appears AvroSerialization can only be configured to serialize
> > > > > one key class and one value schema.
> > > > > * It appears it is only expecting to be used in the mapper phase.
> > > > > * I configure it to serialize Room (output of mapper stage)
> > > > > * So it gets a shock when MRUnit sends it a House (output of 
> reducer
> > > > > stage)
> > > > >
> > > > >
> > > > > I have thought of a number of ways round this both on the 
> MRUnit side
> > > > > and my side:
> > > > >
> > > > > 1. MRUnit could check to see if objects support
> > > > > Serializable/Cloneable and utilise these in preference.
> > > > > Unfortunately I don't think Avro generated classes do implement
> > > > > these, but Protobuf does.
> > > > >
> > > > > 2. withOutputFormat() could take an optional object with interface
> > > > > e.g. "Cloner" which users pass in. You may not want Avro
> > > > > dependencies in MRUnit but it is fairly easy for people to write a
> > > > > concrete Cloner for Avro see:
> > > > > https://issues.apache.org/jira/browse/AVRO-964
> > > > >
> > > > > 3. I think I should be able to use an Avro union
> > > > > http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> > > > > House to make AvroSerialization able to handle both classes. This
> > > > > however is complicating my message format just to support MRUnit
> > > > > so probably not a good long term solution.
> > > > >
> > > > > 4. It may be possible to write an AvroSerialization class 
> capable of
> > > > > handling any Avro generated class. The problem is Avro wraps
> > > > > everything in AvroKey and AvroValue so the problem is that when
> > > > > Serialization.accept is called you have lost the specific class
> > > > > information through erasure. So if I went down this path I could
> > > > > end up having to write my own version of Avro MR
> > > > >
> > > > >
> > > > > Let me know if you are interested in option (2) in which case 
> I will
> > > > > help test. If not I will play around with (3) and (4).
> > > > >
> > > > > Thanks
> > > > >
> > > > > Jacob
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > > > > From: donofrio111@gmail.com
> > > > > > To: mrunit-user@incubator.apache.org
> > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > context.write()
> > > > > >
> > > > > > Sorry for the delay I havent had a chance to look at this 
> too much.
> > > > > >
> > > > > > Yes you are correct that I need to use mrunit's Serialization
> > > class to
> > > > > > copy the objects because the RecordReader's will reuse objects.
> > > The old
> > > > > > mapred RecordReader interface has createKey and createValue 
> methods
> > > > > > which create a new instance for me but the mapreduce api 
> removed
> > > these
> > > > > > methods so I am forced to copy them.
> > > > > >
> > > > > > The configuration gets passed down to AvroSerialization so the
> > > schema
> > > > > > should be available for reducer output.
> > > > > >
> > > > > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > > > > Jim
> > > > > > >
> > > > > > > Unfortunately this did not fix my issue but at least I can 
> now
> > > attach
> > > > > > > a unit test. The test is made up as below:
> > > > > > >
> > > > > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > > > > AvroSerialization class is slightly different but still has
> > > the same
> > > > > > > problem.
> > > > > > >
> > > > > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on
> > > the repo.
> > > > > > >
> > > > > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7
> > > as it
> > > > > > > tries to use HDFS (which is what I am trying to avoid 
> through the
> > > > > > > excellent MRUNIT). Instead I Mocked out my own
> > > > > > > in MockAvroFormats.java. This could do with some improvement
> > > but it
> > > > > > > demonstrates the problem.
> > > > > > >
> > > > > > > - I have a Room and House class which you will see get code
> > > generated
> > > > > > > from the Avro schema file.
> > > > > > >
> > > > > > > - I have a mapper which takes text and outputs Room and a 
> reducer
> > > > > > > which takes <Long,List<Room>> and outputs a House.
> > > > > > >
> > > > > > >
> > > > > > > The first test noOutputFormatTest() demonstrates my original
> > > problem.
> > > > > > > Trying to re-use the serializer for the output of the 
> reducer at
> > > > > > > MockOutputCollector:49 causes the exception:
> > > > > > >
> > > > > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House 
> cannot
> > > > > > > be cast to java.lang.Long
> > > > > > >
> > > > > > > Because the AvroSerialization is configured for the output 
> of the
> > > > > > > Mapper so is expecting to be sent a Long in the key but here
> > > is being
> > > > > > > sent a House.
> > > > > > >
> > > > > > > The second test withOutputFormatTest() results in the same
> > > exception.
> > > > > > > But this time from MockMapreduceOutputFormat.java:162. I
> > > assume you
> > > > > > > are forced to clone here because the InputFormat may be
> > > re-using its
> > > > > > > objects?
> > > > > > >
> > > > > > > The heart of the problem is AvroSerialization retrieves 
> its schema
> > > > > > > through the configuration. So my guess is that it can only 
> ever be
> > > > > > > used for the shuffle. But I am happy to cross post this on 
> the
> > > Avro
> > > > > > > board to see if I am doing something wrong.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Jacob
> > > > > > >
> > > > > > >
> > > > > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > > > > From: donofrio111@gmail.com
> > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > context.write()
> > > > > > > >
> > > > > > > > In 1336519 revision I checked in my initial work for
> > > MRUNIT-101. I
> > > > > > > still
> > > > > > > > need to do some cleaning up and adding the javadoc but the
> > > > > feature is
> > > > > > > > there and tested. I reconfigured out jenkins setup to 
> publish
> > > > > snapshots
> > > > > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT 
> mrunit jar in
> > > > > > > > apache's Nexus repository. I dont think this gets 
> replicated
> > > so you
> > > > > > > will
> > > > > > > > have to add apache's repository to your settings.xml if 
> you are
> > > > > > > using maven.
> > > > > > > >
> > > > > > > > @Test
> > > > > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > > > > final MapReduceDriver driver = this.driver;
> > > > > > > > driver.withOutputFormat(TextOutputFormat.class,
> > > > > TextInputFormat.class);
> > > > > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > > > > driver.runTest();
> > > > > > > > }
> > > > > > > >
> > > > > > > > You can look at
> > > > > > > > 
> org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1]
> > > > > to see
> > > > > > > > how to use the outputformat. Just call withOutputFormat 
> on the
> > > > > driver
> > > > > > > > with the outputformat you want to use and the 
> inputformat you
> > > > > want to
> > > > > > > > read that output back into the output list. The 
> Serialization
> > > > > class is
> > > > > > > > used after the inputformat to copy the inputs into a list so
> > > > > make sure
> > > > > > > > to set io.serializations because the mapreduce api 
> RecordReader
> > > > > does
> > > > > > > not
> > > > > > > > have createKey and createValue methods. Let me know if that
> > > does not
> > > > > > > > work for Avro usually.
> > > > > > > >
> > > > > > > > When I get to MultipleOutputs MRUNIT-13 in the next few 
> days it
> > > > > will be
> > > > > > > > implemented with a similar api except you will also need to
> > > > > specify the
> > > > > > > > name of the output collector.
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > > > > >
> > > > > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > > > > Jim, Brock
> > > > > > > > >
> > > > > > > > > Thanks for getting back to me so quickly, and yes I 
> suspect
> > > > > MR-101 is
> > > > > > > > > the answer.
> > > > > > > > >
> > > > > > > > > The key thing I wanted to establish is whether:
> > > > > > > > >
> > > > > > > > > 1) The "contract" is that the Serialization concrete
> > > > > implementations
> > > > > > > > > listed in "io.serializations" should only ever be used for
> > > > > > > serializing
> > > > > > > > > mapper output in the shuffle stage.
> > > > > > > > >
> > > > > > > > > 2) OR I am doing something very wrong with Avro - for
> > > example I
> > > > > > > > > should only be using the same schema for map and reduce
> > > output.
> > > > > > > > >
> > > > > > > > > Assuming (1) is correct then MR-101 would make a big
> > > > > difference, as
> > > > > > > > > long as you could avoid using the serializer to clone the
> > > > > output of
> > > > > > > > > the reducer. I am guessing you would use the concrete
> > > > > OutputFormat to
> > > > > > > > > serialize the reducer output to a stream and then the 
> unit
> > > tester
> > > > > > > > > would need to deserialize themselves to assert the 
> output?
> > > But
> > > > > what
> > > > > > > > > would people who just want to stick to asserting based 
> on the
> > > > > reducer
> > > > > > > > > output do?
> > > > > > > > >
> > > > > > > > > I will try and boil my issue down to a canned example 
> over
> > > the
> > > > > next
> > > > > > > > > few days. If you are interested in Avro they are 
> working on
> > > > > > > > > integrating Garret Wu's MR2 extensions in 1.7 and 
> there is
> > > a test
> > > > > > > case
> > > > > > > > > here:
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
>
> > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I am happy to test MR-101 for you if you let me know 
> when its
> > > > > > > available.
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Jacob
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: brock@cloudera.com
> > > > > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > > > context.write()
> > > > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it
> > > > > possible to
> > > > > > > > > > share the exception/error you saw? If you have time, 
> I'd
> > > enjoy
> > > > > > > seeing
> > > > > > > > > > a small example of the code in question so we can add
> > > that to
> > > > > > > our test
> > > > > > > > > > suite.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Brock
> > > > > > > > > >
> > > > > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > > > > <do...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > I am not too familar with Avro, maybe someone else can
> > > > > respond
> > > > > > > but
> > > > > > > > > if the
> > > > > > > > > > > AvroKeyOutputFormat does the serialization then
> > > MRUNIT-101 [1]
> > > > > > > > > should fix
> > > > > > > > > > > your problem. I am just finishing this JIRA up, it
> > > works under
> > > > > > > > > Hadoop 1+, I
> > > > > > > > > > > am having issues with TaskAttemptContext and 
> JobContext
> > > > > > > changing from
> > > > > > > > > > > classes to interfaces in the mapreduce api in 
> Hadoop 0.23.
> > > > > > > > > > >
> > > > > > > > > > > I should resolve this over the next few days. In the
> > > > > meantime if
> > > > > > > > > you can
> > > > > > > > > > > post your code I can test against it. It may also be
> > > worth
> > > > > the
> > > > > > > MRUnit
> > > > > > > > > > > project exploring having Jenkins deploy a snapshot to
> > > > > Nexus so
> > > > > > > you can
> > > > > > > > > > > easily test against the trunk without having to build
> > > it or
> > > > > > > > > download the jar
> > > > > > > > > > > from Jenkins.
> > > > > > > > > > >
> > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> I am trying to integrate Avro-1.7 (specifically the
> > > new MR2
> > > > > > > > > extensions),
> > > > > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not
> > > made any
> > > > > > > > > mistakes my
> > > > > > > > > > >> question is should MRUnit be using the Serialization
> > > factory
> > > > > > > when
> > > > > > > > > I call
> > > > > > > > > > >> context.write() in a reducer.
> > > > > > > > > > >>
> > > > > > > > > > >> I am using MapReduceDriver and my mapper has output
> > > > > signature:
> > > > > > > > > > >>
> > > > > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > > > > >>
> > > > > > > > > > >> My reducer has a different outputt signature:
> > > > > > > > > > >>
> > > > > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > > > > >>
> > > > > > > > > > >> I am using Avro specific serialization so I set 
> my Avro
> > > > > schemas
> > > > > > > > > like this:
> > > > > > > > > > >>
> > > > > > > > > > >> AvroSerialization.addToConfiguration( 
> configuration );
> > > > > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > > >> );
> > > > > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > > >> );
> > > > > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > > >>
> > > > > > > > > > >> My understanding of Avro MR is that the Serialization
> > > > > class is
> > > > > > > > > intended to
> > > > > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > > > > >>
> > > > > > > > > > >> However my test fails at reduce stage. Debugging I
> > > realised
> > > > > > > the mock
> > > > > > > > > > >> reducer context is using the serializer to copy 
> objects:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > >
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > > > > >>
> > > > > > > > > > >> Looking at the AvroSerialization object it only
> > > expects one
> > > > > > > set of
> > > > > > > > > > >> schemas:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > > > > >>
> > > > > > > > > > >> So when my reducer tries to write SpecificValue2 
> to the
> > > > > context,
> > > > > > > > > MRUnit's
> > > > > > > > > > >> mock then tries to serialise SpecificValue2 with
> > > > > Value1.SCHEMA$
> > > > > > > > > and as a
> > > > > > > > > > >> result fails.
> > > > > > > > > > >>
> > > > > > > > > > >> I have yet debugged Hadoop itself but I did read some
> > > > > comments
> > > > > > > > > (which I
> > > > > > > > > > >> since cannot locate) which says that the 
> Serialization
> > > > > class is
> > > > > > > > > typically
> > > > > > > > > > >> not used for the output of the reduce stage. My 
> limited
> > > > > > > > > understanding is
> > > > > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) 
> will
> > > act
> > > > > as the
> > > > > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > > > > >>
> > > > > > > > > > >> I can spend some time distilling my code into a 
> simple
> > > > > > > example but
> > > > > > > > > > >> wondered if anyone had any pointers - or an Avro 
> + MR2 +
> > > > > MRUnit
> > > > > > > > > example.
> > > > > > > > > > >>
> > > > > > > > > > >> Jacob
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > > > > http://incubator.apache.org/mrunit/

RE: Deserializer used for both Map and Reducer context.write()

Posted by Jacob Metcalf <ja...@hotmail.com>.

Jim
My last example is just to show other people who are stuck with this how to use unions to solve it. 
I had not used an output format in it because it made no difference to my issue. Let me try to explain with the code:
Line 162: https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockMapreduceOutputFormat.java 
       while (recordReader.nextKeyValue()) {          outputs.add(new Pair<K, V>(serialization.copy(recordReader              .getCurrentKey()), serialization.copy(recordReader              .getCurrentValue())));        }Line 48 https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/output/MockOutputCollector.java
  public void collect(final K key, final V value) throws IOException {    collectedOutputs.add(new Pair<K, V>(serialization.copy(key), serialization        .copy(value)));  }
Both of these use MRUnit's Serialization class to clone the output objects so it does not matter whether I configure an output format. These in turn will use AvroSerialization to attempt to clone the object. In my Hadoop job my map output schema Room is different to my reducer output schema House, AvroSerialization is only configured to be able to serialize the mapper output Room and it all works. However if I attempt the same with the unit test, MRUnit's Serialization class attempts to clone a House using the Room schema and it blows up.
You suggested adding a JobConf, that would work but would require people to write code to clone the conf and change the Schema. Alternatively I suggested since the user has to write code anyway you could allow them to pass in a Cloner object. For Avro this would be a simple one liner calling Avro's deepCopy: http://avro.apache.org/docs/1.6.0/api/java/org/apache/avro/generic/GenericData.html#deepCopy(org.apache.avro.Schema, java.lang.Object)
Hope all that makes sense !
Jacob

> Date: Sun, 20 May 2012 21:39:12 -0400
> From: donofrio111@gmail.com
> To: mrunit-user@incubator.apache.org
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> 
> Sorry for the delay. So you are suggesting to provide an option for an 
> alternative conf that just that only the inputformat uses? So we could 
> have withOutputFormat(outputformat, inputformat) and 
> withOutputFormat(outputformat, inputformat, jobconf)?
> 
> I am confused why your example doesnt use withOutputFormat, is that 
> because you are doing your own verification with run() instead of 
> calling runTest?
> 
> On 05/13/2012 03:43 PM, Jacob Metcalf wrote:
> >
> > The InputFormat works fine - but it is configured separately to 
> > AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is 
> > effectively using to clone. Garret Wu's new MR2 
> > AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their 
> > configuration from "avro.schema.[input|output].[key|value]". Whereas 
> > AvroSerialization, which is typically only used on the shuffle, picks 
> > up its configuration from 
> > "avro.serialization.[key|value].[writer|reader].schema".
> >
> > In the case of MRUnit I see 
> > org.apache.hadoop.mrunit.internal.io.Serialization already has a 
> > copyWithConf(). So you could have users provide a separate optional 
> > config to withOutputFormat(). It would take a few comments to explain 
> > and users would have to be careful to keep the configs separate !
> >
> > ---
> >
> > For anyone who has trouble with this in future (3) worked and was 
> > pretty easy. I found that you can get Avro to support multiple schemas 
> > through unions: https://issues.apache.org/jira/browse/AVRO-127. In my 
> > case it was a matter of doing this:
> >
> > AvroJob.setMapOutputValueSchema( job, Schema.createUnion( 
> > Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
> >
> > Then breaking with convention and storing the Avro output of the 
> > reducer in the value. For completeness I have attached an example 
> > which works on both MRUnit and Hadoop 0.23 but you will need to obtain 
> > and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
> >
> > Jacob
> >
> >
> > > Date: Sun, 13 May 2012 10:50:16 -0400
> > > From: donofrio111@gmail.com
> > > To: mrunit-user@incubator.apache.org
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > >
> > > Yes I agree 3 is a bad idea, you shouldnt have to change your code to
> > > work with a unit test.
> > >
> > > Ideally AvroSerialization would already support this and you wouldnt
> > > have to do 4.
> > >
> > > I am not sure I want to do 2 either, it is just more code users have to
> > > write to use MRUnit.
> > >
> > >
> > > MRUnit doesnt really use serialization to clone in the reducer. After I
> > > write the output out with the outputformat I need some way to bring the
> > > objects back in so that I can use our existing validation methods. The
> > > simplest way to do this that I thought of that used existing hadoop
> > > concepts was to have the user set an inputformat as if they were using
> > > the mapper in another map reduce job to read the output of this
> > > mapreduce job that you are testing. How do you usually read the output
> > > of an Avro job, maybe I just need to allow you to set an alternative
> > > JobConf that only gets used by the InputFormat since you say that
> > > AvroSerialization only supports one key and value?
> > >
> > > On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> > > >
> > > > No thanks for looking at it. My next step was to attempt to get my
> > > > example running on a Pseudo-distributed cluster. This took me a while
> > > > as I am only a Hadoop beginner and had problems with my
> > > > HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop
> > > > does not use AvroSerialization in the Reducer Output stage.
> > > >
> > > > I understand why MRUnit needs to make copies but:
> > > >
> > > > * It appears AvroSerialization can only be configured to serialize
> > > > one key class and one value schema.
> > > > * It appears it is only expecting to be used in the mapper phase.
> > > > * I configure it to serialize Room (output of mapper stage)
> > > > * So it gets a shock when MRUnit sends it a House (output of reducer
> > > > stage)
> > > >
> > > >
> > > > I have thought of a number of ways round this both on the MRUnit side
> > > > and my side:
> > > >
> > > > 1. MRUnit could check to see if objects support
> > > > Serializable/Cloneable and utilise these in preference.
> > > > Unfortunately I don't think Avro generated classes do implement
> > > > these, but Protobuf does.
> > > >
> > > > 2. withOutputFormat() could take an optional object with interface
> > > > e.g. "Cloner" which users pass in. You may not want Avro
> > > > dependencies in MRUnit but it is fairly easy for people to write a
> > > > concrete Cloner for Avro see:
> > > > https://issues.apache.org/jira/browse/AVRO-964
> > > >
> > > > 3. I think I should be able to use an Avro union
> > > > http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> > > > House to make AvroSerialization able to handle both classes. This
> > > > however is complicating my message format just to support MRUnit
> > > > so probably not a good long term solution.
> > > >
> > > > 4. It may be possible to write an AvroSerialization class capable of
> > > > handling any Avro generated class. The problem is Avro wraps
> > > > everything in AvroKey and AvroValue so the problem is that when
> > > > Serialization.accept is called you have lost the specific class
> > > > information through erasure. So if I went down this path I could
> > > > end up having to write my own version of Avro MR
> > > >
> > > >
> > > > Let me know if you are interested in option (2) in which case I will
> > > > help test. If not I will play around with (3) and (4).
> > > >
> > > > Thanks
> > > >
> > > > Jacob
> > > >
> > > >
> > > >
> > > >
> > > > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > > > From: donofrio111@gmail.com
> > > > > To: mrunit-user@incubator.apache.org
> > > > > Subject: Re: Deserializer used for both Map and Reducer 
> > context.write()
> > > > >
> > > > > Sorry for the delay I havent had a chance to look at this too much.
> > > > >
> > > > > Yes you are correct that I need to use mrunit's Serialization 
> > class to
> > > > > copy the objects because the RecordReader's will reuse objects. 
> > The old
> > > > > mapred RecordReader interface has createKey and createValue methods
> > > > > which create a new instance for me but the mapreduce api removed 
> > these
> > > > > methods so I am forced to copy them.
> > > > >
> > > > > The configuration gets passed down to AvroSerialization so the 
> > schema
> > > > > should be available for reducer output.
> > > > >
> > > > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > > > Jim
> > > > > >
> > > > > > Unfortunately this did not fix my issue but at least I can now 
> > attach
> > > > > > a unit test. The test is made up as below:
> > > > > >
> > > > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > > > AvroSerialization class is slightly different but still has 
> > the same
> > > > > > problem.
> > > > > >
> > > > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on 
> > the repo.
> > > > > >
> > > > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 
> > as it
> > > > > > tries to use HDFS (which is what I am trying to avoid through the
> > > > > > excellent MRUNIT). Instead I Mocked out my own
> > > > > > in MockAvroFormats.java. This could do with some improvement 
> > but it
> > > > > > demonstrates the problem.
> > > > > >
> > > > > > - I have a Room and House class which you will see get code 
> > generated
> > > > > > from the Avro schema file.
> > > > > >
> > > > > > - I have a mapper which takes text and outputs Room and a reducer
> > > > > > which takes <Long,List<Room>> and outputs a House.
> > > > > >
> > > > > >
> > > > > > The first test noOutputFormatTest() demonstrates my original 
> > problem.
> > > > > > Trying to re-use the serializer for the output of the reducer at
> > > > > > MockOutputCollector:49 causes the exception:
> > > > > >
> > > > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > > > > be cast to java.lang.Long
> > > > > >
> > > > > > Because the AvroSerialization is configured for the output of the
> > > > > > Mapper so is expecting to be sent a Long in the key but here 
> > is being
> > > > > > sent a House.
> > > > > >
> > > > > > The second test withOutputFormatTest() results in the same 
> > exception.
> > > > > > But this time from MockMapreduceOutputFormat.java:162. I 
> > assume you
> > > > > > are forced to clone here because the InputFormat may be 
> > re-using its
> > > > > > objects?
> > > > > >
> > > > > > The heart of the problem is AvroSerialization retrieves its schema
> > > > > > through the configuration. So my guess is that it can only ever be
> > > > > > used for the shuffle. But I am happy to cross post this on the 
> > Avro
> > > > > > board to see if I am doing something wrong.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Jacob
> > > > > >
> > > > > >
> > > > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > > > From: donofrio111@gmail.com
> > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > context.write()
> > > > > > >
> > > > > > > In 1336519 revision I checked in my initial work for 
> > MRUNIT-101. I
> > > > > > still
> > > > > > > need to do some cleaning up and adding the javadoc but the
> > > > feature is
> > > > > > > there and tested. I reconfigured out jenkins setup to publish
> > > > snapshots
> > > > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > > > > > apache's Nexus repository. I dont think this gets replicated 
> > so you
> > > > > > will
> > > > > > > have to add apache's repository to your settings.xml if you are
> > > > > > using maven.
> > > > > > >
> > > > > > > @Test
> > > > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > > > final MapReduceDriver driver = this.driver;
> > > > > > > driver.withOutputFormat(TextOutputFormat.class,
> > > > TextInputFormat.class);
> > > > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > > > driver.runTest();
> > > > > > > }
> > > > > > >
> > > > > > > You can look at
> > > > > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1]
> > > > to see
> > > > > > > how to use the outputformat. Just call withOutputFormat on the
> > > > driver
> > > > > > > with the outputformat you want to use and the inputformat you
> > > > want to
> > > > > > > read that output back into the output list. The Serialization
> > > > class is
> > > > > > > used after the inputformat to copy the inputs into a list so
> > > > make sure
> > > > > > > to set io.serializations because the mapreduce api RecordReader
> > > > does
> > > > > > not
> > > > > > > have createKey and createValue methods. Let me know if that 
> > does not
> > > > > > > work for Avro usually.
> > > > > > >
> > > > > > > When I get to MultipleOutputs MRUNIT-13 in the next few days it
> > > > will be
> > > > > > > implemented with a similar api except you will also need to
> > > > specify the
> > > > > > > name of the output collector.
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > > > >
> > > > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > > > Jim, Brock
> > > > > > > >
> > > > > > > > Thanks for getting back to me so quickly, and yes I suspect
> > > > MR-101 is
> > > > > > > > the answer.
> > > > > > > >
> > > > > > > > The key thing I wanted to establish is whether:
> > > > > > > >
> > > > > > > > 1) The "contract" is that the Serialization concrete
> > > > implementations
> > > > > > > > listed in "io.serializations" should only ever be used for
> > > > > > serializing
> > > > > > > > mapper output in the shuffle stage.
> > > > > > > >
> > > > > > > > 2) OR I am doing something very wrong with Avro - for 
> > example I
> > > > > > > > should only be using the same schema for map and reduce 
> > output.
> > > > > > > >
> > > > > > > > Assuming (1) is correct then MR-101 would make a big
> > > > difference, as
> > > > > > > > long as you could avoid using the serializer to clone the
> > > > output of
> > > > > > > > the reducer. I am guessing you would use the concrete
> > > > OutputFormat to
> > > > > > > > serialize the reducer output to a stream and then the unit 
> > tester
> > > > > > > > would need to deserialize themselves to assert the output? 
> > But
> > > > what
> > > > > > > > would people who just want to stick to asserting based on the
> > > > reducer
> > > > > > > > output do?
> > > > > > > >
> > > > > > > > I will try and boil my issue down to a canned example over 
> > the
> > > > next
> > > > > > > > few days. If you are interested in Avro they are working on
> > > > > > > > integrating Garret Wu's MR2 extensions in 1.7 and there is 
> > a test
> > > > > > case
> > > > > > > > here:
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I am happy to test MR-101 for you if you let me know when its
> > > > > > available.
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Jacob
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: brock@cloudera.com
> > > > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > > context.write()
> > > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it
> > > > possible to
> > > > > > > > > share the exception/error you saw? If you have time, I'd 
> > enjoy
> > > > > > seeing
> > > > > > > > > a small example of the code in question so we can add 
> > that to
> > > > > > our test
> > > > > > > > > suite.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Brock
> > > > > > > > >
> > > > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > > > <do...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > I am not too familar with Avro, maybe someone else can
> > > > respond
> > > > > > but
> > > > > > > > if the
> > > > > > > > > > AvroKeyOutputFormat does the serialization then 
> > MRUNIT-101 [1]
> > > > > > > > should fix
> > > > > > > > > > your problem. I am just finishing this JIRA up, it 
> > works under
> > > > > > > > Hadoop 1+, I
> > > > > > > > > > am having issues with TaskAttemptContext and JobContext
> > > > > > changing from
> > > > > > > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > > > > > > >
> > > > > > > > > > I should resolve this over the next few days. In the
> > > > meantime if
> > > > > > > > you can
> > > > > > > > > > post your code I can test against it. It may also be 
> > worth
> > > > the
> > > > > > MRUnit
> > > > > > > > > > project exploring having Jenkins deploy a snapshot to
> > > > Nexus so
> > > > > > you can
> > > > > > > > > > easily test against the trunk without having to build 
> > it or
> > > > > > > > download the jar
> > > > > > > > > > from Jenkins.
> > > > > > > > > >
> > > > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> I am trying to integrate Avro-1.7 (specifically the 
> > new MR2
> > > > > > > > extensions),
> > > > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not 
> > made any
> > > > > > > > mistakes my
> > > > > > > > > >> question is should MRUnit be using the Serialization 
> > factory
> > > > > > when
> > > > > > > > I call
> > > > > > > > > >> context.write() in a reducer.
> > > > > > > > > >>
> > > > > > > > > >> I am using MapReduceDriver and my mapper has output
> > > > signature:
> > > > > > > > > >>
> > > > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > > > >>
> > > > > > > > > >> My reducer has a different outputt signature:
> > > > > > > > > >>
> > > > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > > > >>
> > > > > > > > > >> I am using Avro specific serialization so I set my Avro
> > > > schemas
> > > > > > > > like this:
> > > > > > > > > >>
> > > > > > > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > >> );
> > > > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > > >> );
> > > > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > > >>
> > > > > > > > > >> My understanding of Avro MR is that the Serialization
> > > > class is
> > > > > > > > intended to
> > > > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > > > >>
> > > > > > > > > >> However my test fails at reduce stage. Debugging I 
> > realised
> > > > > > the mock
> > > > > > > > > >> reducer context is using the serializer to copy objects:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > >
> > > > > >
> > > > 
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > > > >>
> > > > > > > > > >> Looking at the AvroSerialization object it only 
> > expects one
> > > > > > set of
> > > > > > > > > >> schemas:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > >
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > > > >>
> > > > > > > > > >> So when my reducer tries to write SpecificValue2 to the
> > > > context,
> > > > > > > > MRUnit's
> > > > > > > > > >> mock then tries to serialise SpecificValue2 with
> > > > Value1.SCHEMA$
> > > > > > > > and as a
> > > > > > > > > >> result fails.
> > > > > > > > > >>
> > > > > > > > > >> I have yet debugged Hadoop itself but I did read some
> > > > comments
> > > > > > > > (which I
> > > > > > > > > >> since cannot locate) which says that the Serialization
> > > > class is
> > > > > > > > typically
> > > > > > > > > >> not used for the output of the reduce stage. My limited
> > > > > > > > understanding is
> > > > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will 
> > act
> > > > as the
> > > > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > > > >>
> > > > > > > > > >> I can spend some time distilling my code into a simple
> > > > > > example but
> > > > > > > > > >> wondered if anyone had any pointers - or an Avro + MR2 +
> > > > MRUnit
> > > > > > > > example.
> > > > > > > > > >>
> > > > > > > > > >> Jacob
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > > > http://incubator.apache.org/mrunit/
 		 	   		  

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
Sorry for the delay. So you are suggesting to provide an option for an 
alternative conf that just that only the inputformat uses? So we could 
have withOutputFormat(outputformat, inputformat) and 
withOutputFormat(outputformat, inputformat, jobconf)?

I am confused why your example doesnt use withOutputFormat, is that 
because you are doing your own verification with run() instead of 
calling runTest?

On 05/13/2012 03:43 PM, Jacob Metcalf wrote:
>
> The InputFormat works fine - but it is configured separately to 
> AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is 
> effectively using to clone. Garret Wu's new MR2 
> AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their 
> configuration from "avro.schema.[input|output].[key|value]". Whereas 
> AvroSerialization, which is typically only used on the shuffle, picks 
> up its configuration from 
> "avro.serialization.[key|value].[writer|reader].schema".
>
> In the case of MRUnit I see 
> org.apache.hadoop.mrunit.internal.io.Serialization already has a 
> copyWithConf(). So you could have users provide a separate optional 
> config to withOutputFormat(). It would take a few comments to explain 
> and users would have to be careful to keep the configs separate !
>
> ---
>
> For anyone who has trouble with this in future (3) worked and was 
> pretty easy. I found that you can get Avro to support multiple schemas 
> through unions: https://issues.apache.org/jira/browse/AVRO-127. In my 
> case it was a matter of doing this:
>
> AvroJob.setMapOutputValueSchema( job, Schema.createUnion( 
> Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
>
> Then breaking with convention and storing the Avro output of the 
> reducer in the value. For completeness I have attached an example 
> which works on both MRUnit and Hadoop 0.23 but you will need to obtain 
> and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
>
> Jacob
>
>
> > Date: Sun, 13 May 2012 10:50:16 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > Yes I agree 3 is a bad idea, you shouldnt have to change your code to
> > work with a unit test.
> >
> > Ideally AvroSerialization would already support this and you wouldnt
> > have to do 4.
> >
> > I am not sure I want to do 2 either, it is just more code users have to
> > write to use MRUnit.
> >
> >
> > MRUnit doesnt really use serialization to clone in the reducer. After I
> > write the output out with the outputformat I need some way to bring the
> > objects back in so that I can use our existing validation methods. The
> > simplest way to do this that I thought of that used existing hadoop
> > concepts was to have the user set an inputformat as if they were using
> > the mapper in another map reduce job to read the output of this
> > mapreduce job that you are testing. How do you usually read the output
> > of an Avro job, maybe I just need to allow you to set an alternative
> > JobConf that only gets used by the InputFormat since you say that
> > AvroSerialization only supports one key and value?
> >
> > On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> > >
> > > No thanks for looking at it. My next step was to attempt to get my
> > > example running on a Pseudo-distributed cluster. This took me a while
> > > as I am only a Hadoop beginner and had problems with my
> > > HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop
> > > does not use AvroSerialization in the Reducer Output stage.
> > >
> > > I understand why MRUnit needs to make copies but:
> > >
> > > * It appears AvroSerialization can only be configured to serialize
> > > one key class and one value schema.
> > > * It appears it is only expecting to be used in the mapper phase.
> > > * I configure it to serialize Room (output of mapper stage)
> > > * So it gets a shock when MRUnit sends it a House (output of reducer
> > > stage)
> > >
> > >
> > > I have thought of a number of ways round this both on the MRUnit side
> > > and my side:
> > >
> > > 1. MRUnit could check to see if objects support
> > > Serializable/Cloneable and utilise these in preference.
> > > Unfortunately I don't think Avro generated classes do implement
> > > these, but Protobuf does.
> > >
> > > 2. withOutputFormat() could take an optional object with interface
> > > e.g. "Cloner" which users pass in. You may not want Avro
> > > dependencies in MRUnit but it is fairly easy for people to write a
> > > concrete Cloner for Avro see:
> > > https://issues.apache.org/jira/browse/AVRO-964
> > >
> > > 3. I think I should be able to use an Avro union
> > > http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> > > House to make AvroSerialization able to handle both classes. This
> > > however is complicating my message format just to support MRUnit
> > > so probably not a good long term solution.
> > >
> > > 4. It may be possible to write an AvroSerialization class capable of
> > > handling any Avro generated class. The problem is Avro wraps
> > > everything in AvroKey and AvroValue so the problem is that when
> > > Serialization.accept is called you have lost the specific class
> > > information through erasure. So if I went down this path I could
> > > end up having to write my own version of Avro MR
> > >
> > >
> > > Let me know if you are interested in option (2) in which case I will
> > > help test. If not I will play around with (3) and (4).
> > >
> > > Thanks
> > >
> > > Jacob
> > >
> > >
> > >
> > >
> > > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > > From: donofrio111@gmail.com
> > > > To: mrunit-user@incubator.apache.org
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > >
> > > > Sorry for the delay I havent had a chance to look at this too much.
> > > >
> > > > Yes you are correct that I need to use mrunit's Serialization 
> class to
> > > > copy the objects because the RecordReader's will reuse objects. 
> The old
> > > > mapred RecordReader interface has createKey and createValue methods
> > > > which create a new instance for me but the mapreduce api removed 
> these
> > > > methods so I am forced to copy them.
> > > >
> > > > The configuration gets passed down to AvroSerialization so the 
> schema
> > > > should be available for reducer output.
> > > >
> > > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > > Jim
> > > > >
> > > > > Unfortunately this did not fix my issue but at least I can now 
> attach
> > > > > a unit test. The test is made up as below:
> > > > >
> > > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > > AvroSerialization class is slightly different but still has 
> the same
> > > > > problem.
> > > > >
> > > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on 
> the repo.
> > > > >
> > > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 
> as it
> > > > > tries to use HDFS (which is what I am trying to avoid through the
> > > > > excellent MRUNIT). Instead I Mocked out my own
> > > > > in MockAvroFormats.java. This could do with some improvement 
> but it
> > > > > demonstrates the problem.
> > > > >
> > > > > - I have a Room and House class which you will see get code 
> generated
> > > > > from the Avro schema file.
> > > > >
> > > > > - I have a mapper which takes text and outputs Room and a reducer
> > > > > which takes <Long,List<Room>> and outputs a House.
> > > > >
> > > > >
> > > > > The first test noOutputFormatTest() demonstrates my original 
> problem.
> > > > > Trying to re-use the serializer for the output of the reducer at
> > > > > MockOutputCollector:49 causes the exception:
> > > > >
> > > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > > > be cast to java.lang.Long
> > > > >
> > > > > Because the AvroSerialization is configured for the output of the
> > > > > Mapper so is expecting to be sent a Long in the key but here 
> is being
> > > > > sent a House.
> > > > >
> > > > > The second test withOutputFormatTest() results in the same 
> exception.
> > > > > But this time from MockMapreduceOutputFormat.java:162. I 
> assume you
> > > > > are forced to clone here because the InputFormat may be 
> re-using its
> > > > > objects?
> > > > >
> > > > > The heart of the problem is AvroSerialization retrieves its schema
> > > > > through the configuration. So my guess is that it can only ever be
> > > > > used for the shuffle. But I am happy to cross post this on the 
> Avro
> > > > > board to see if I am doing something wrong.
> > > > >
> > > > > Thanks
> > > > >
> > > > > Jacob
> > > > >
> > > > >
> > > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > > From: donofrio111@gmail.com
> > > > > > To: mrunit-user@incubator.apache.org
> > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > context.write()
> > > > > >
> > > > > > In 1336519 revision I checked in my initial work for 
> MRUNIT-101. I
> > > > > still
> > > > > > need to do some cleaning up and adding the javadoc but the
> > > feature is
> > > > > > there and tested. I reconfigured out jenkins setup to publish
> > > snapshots
> > > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > > > > apache's Nexus repository. I dont think this gets replicated 
> so you
> > > > > will
> > > > > > have to add apache's repository to your settings.xml if you are
> > > > > using maven.
> > > > > >
> > > > > > @Test
> > > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > > final MapReduceDriver driver = this.driver;
> > > > > > driver.withOutputFormat(TextOutputFormat.class,
> > > TextInputFormat.class);
> > > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > > driver.runTest();
> > > > > > }
> > > > > >
> > > > > > You can look at
> > > > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1]
> > > to see
> > > > > > how to use the outputformat. Just call withOutputFormat on the
> > > driver
> > > > > > with the outputformat you want to use and the inputformat you
> > > want to
> > > > > > read that output back into the output list. The Serialization
> > > class is
> > > > > > used after the inputformat to copy the inputs into a list so
> > > make sure
> > > > > > to set io.serializations because the mapreduce api RecordReader
> > > does
> > > > > not
> > > > > > have createKey and createValue methods. Let me know if that 
> does not
> > > > > > work for Avro usually.
> > > > > >
> > > > > > When I get to MultipleOutputs MRUNIT-13 in the next few days it
> > > will be
> > > > > > implemented with a similar api except you will also need to
> > > specify the
> > > > > > name of the output collector.
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > > >
> > > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > > Jim, Brock
> > > > > > >
> > > > > > > Thanks for getting back to me so quickly, and yes I suspect
> > > MR-101 is
> > > > > > > the answer.
> > > > > > >
> > > > > > > The key thing I wanted to establish is whether:
> > > > > > >
> > > > > > > 1) The "contract" is that the Serialization concrete
> > > implementations
> > > > > > > listed in "io.serializations" should only ever be used for
> > > > > serializing
> > > > > > > mapper output in the shuffle stage.
> > > > > > >
> > > > > > > 2) OR I am doing something very wrong with Avro - for 
> example I
> > > > > > > should only be using the same schema for map and reduce 
> output.
> > > > > > >
> > > > > > > Assuming (1) is correct then MR-101 would make a big
> > > difference, as
> > > > > > > long as you could avoid using the serializer to clone the
> > > output of
> > > > > > > the reducer. I am guessing you would use the concrete
> > > OutputFormat to
> > > > > > > serialize the reducer output to a stream and then the unit 
> tester
> > > > > > > would need to deserialize themselves to assert the output? 
> But
> > > what
> > > > > > > would people who just want to stick to asserting based on the
> > > reducer
> > > > > > > output do?
> > > > > > >
> > > > > > > I will try and boil my issue down to a canned example over 
> the
> > > next
> > > > > > > few days. If you are interested in Avro they are working on
> > > > > > > integrating Garret Wu's MR2 extensions in 1.7 and there is 
> a test
> > > > > case
> > > > > > > here:
> > > > > > >
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
>
> > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > I am happy to test MR-101 for you if you let me know when its
> > > > > available.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Jacob
> > > > > > >
> > > > > > >
> > > > > > > > From: brock@cloudera.com
> > > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > context.write()
> > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it
> > > possible to
> > > > > > > > share the exception/error you saw? If you have time, I'd 
> enjoy
> > > > > seeing
> > > > > > > > a small example of the code in question so we can add 
> that to
> > > > > our test
> > > > > > > > suite.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Brock
> > > > > > > >
> > > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > > <do...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > I am not too familar with Avro, maybe someone else can
> > > respond
> > > > > but
> > > > > > > if the
> > > > > > > > > AvroKeyOutputFormat does the serialization then 
> MRUNIT-101 [1]
> > > > > > > should fix
> > > > > > > > > your problem. I am just finishing this JIRA up, it 
> works under
> > > > > > > Hadoop 1+, I
> > > > > > > > > am having issues with TaskAttemptContext and JobContext
> > > > > changing from
> > > > > > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > > > > > >
> > > > > > > > > I should resolve this over the next few days. In the
> > > meantime if
> > > > > > > you can
> > > > > > > > > post your code I can test against it. It may also be 
> worth
> > > the
> > > > > MRUnit
> > > > > > > > > project exploring having Jenkins deploy a snapshot to
> > > Nexus so
> > > > > you can
> > > > > > > > > easily test against the trunk without having to build 
> it or
> > > > > > > download the jar
> > > > > > > > > from Jenkins.
> > > > > > > > >
> > > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> I am trying to integrate Avro-1.7 (specifically the 
> new MR2
> > > > > > > extensions),
> > > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not 
> made any
> > > > > > > mistakes my
> > > > > > > > >> question is should MRUnit be using the Serialization 
> factory
> > > > > when
> > > > > > > I call
> > > > > > > > >> context.write() in a reducer.
> > > > > > > > >>
> > > > > > > > >> I am using MapReduceDriver and my mapper has output
> > > signature:
> > > > > > > > >>
> > > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > > >>
> > > > > > > > >> My reducer has a different outputt signature:
> > > > > > > > >>
> > > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > > >>
> > > > > > > > >> I am using Avro specific serialization so I set my Avro
> > > schemas
> > > > > > > like this:
> > > > > > > > >>
> > > > > > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > >> );
> > > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > >> );
> > > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > >>
> > > > > > > > >> My understanding of Avro MR is that the Serialization
> > > class is
> > > > > > > intended to
> > > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > > >>
> > > > > > > > >> However my test fails at reduce stage. Debugging I 
> realised
> > > > > the mock
> > > > > > > > >> reducer context is using the serializer to copy objects:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > >
> > > > >
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > > >>
> > > > > > > > >> Looking at the AvroSerialization object it only 
> expects one
> > > > > set of
> > > > > > > > >> schemas:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > > >>
> > > > > > > > >> So when my reducer tries to write SpecificValue2 to the
> > > context,
> > > > > > > MRUnit's
> > > > > > > > >> mock then tries to serialise SpecificValue2 with
> > > Value1.SCHEMA$
> > > > > > > and as a
> > > > > > > > >> result fails.
> > > > > > > > >>
> > > > > > > > >> I have yet debugged Hadoop itself but I did read some
> > > comments
> > > > > > > (which I
> > > > > > > > >> since cannot locate) which says that the Serialization
> > > class is
> > > > > > > typically
> > > > > > > > >> not used for the output of the reduce stage. My limited
> > > > > > > understanding is
> > > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will 
> act
> > > as the
> > > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > > >>
> > > > > > > > >> I can spend some time distilling my code into a simple
> > > > > example but
> > > > > > > > >> wondered if anyone had any pointers - or an Avro + MR2 +
> > > MRUnit
> > > > > > > example.
> > > > > > > > >>
> > > > > > > > >> Jacob
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > > http://incubator.apache.org/mrunit/

RE: Deserializer used for both Map and Reducer context.write()

Posted by Jacob Metcalf <ja...@hotmail.com>.




The InputFormat works fine - but it is configured separately to AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is effectively using to clone. Garret Wu's new MR2 AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their configuration from "avro.schema.[input|output].[key|value]". Whereas AvroSerialization, which is typically only used on the shuffle, picks up its configuration from "avro.serialization.[key|value].[writer|reader].schema". 
In the case of MRUnit I see org.apache.hadoop.mrunit.internal.io.Serialization already has a copyWithConf(). So you could have users provide a separate optional config to withOutputFormat(). It would take a few comments to explain and users would have to be careful to keep the configs separate ! 
---
For anyone who has trouble with this in future (3) worked and was pretty easy. I found that you can get Avro to support multiple schemas through unions: https://issues.apache.org/jira/browse/AVRO-127. In my case it was a matter of doing this:
			AvroJob.setMapOutputValueSchema( job, Schema.createUnion( Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
Then breaking with convention and storing the Avro output of the reducer in the value. For completeness I have attached an example which works on both MRUnit and Hadoop 0.23 but you will need to obtain and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
Jacob

> Date: Sun, 13 May 2012 10:50:16 -0400
> From: donofrio111@gmail.com
> To: mrunit-user@incubator.apache.org
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> 
> Yes I agree 3 is a bad idea, you shouldnt have to change your code to 
> work with a unit test.
> 
> Ideally AvroSerialization would already support this and you wouldnt 
> have to do 4.
> 
> I am not sure I want to do 2 either, it is just more code users have to 
> write to use MRUnit.
> 
> 
> MRUnit doesnt really use serialization to clone in the reducer. After I 
> write the output out with the outputformat I need some way to bring the 
> objects back in so that I can use our existing validation methods. The 
> simplest way to do this that I thought of that used existing hadoop 
> concepts was to have the user set an inputformat as if they were using 
> the mapper in another map reduce job to read the output of this 
> mapreduce job that you are testing. How do you usually read the output 
> of an Avro job, maybe I just need to allow you to set an alternative 
> JobConf that only gets used by the InputFormat since you say that 
> AvroSerialization only supports one key and value?
> 
> On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> >
> > No thanks for looking at it. My next step was to attempt to get my 
> > example running on a Pseudo-distributed cluster. This took me a while 
> > as I am only a Hadoop beginner and had problems with my 
> > HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop 
> > does not use AvroSerialization in the Reducer Output stage.
> >
> > I understand why MRUnit needs to make copies but:
> >
> >   * It appears AvroSerialization can only be configured to serialize
> >     one key class and one value schema.
> >   * It appears it is only expecting to be used in the mapper phase.
> >   * I configure it to serialize Room (output of mapper stage)
> >   * So it gets a shock when MRUnit sends it a House (output of reducer
> >     stage)
> >
> >
> > I have thought of a number of ways round this both on the MRUnit side 
> > and my side:
> >
> >  1. MRUnit could check to see if objects support
> >     Serializable/Cloneable and utilise these in preference.
> >     Unfortunately I don't think Avro generated classes do implement
> >     these, but Protobuf does.
> >
> >  2. withOutputFormat() could take an optional object with interface
> >     e.g. "Cloner" which users pass in. You may not want Avro
> >     dependencies in MRUnit but it is fairly easy for people to write a
> >     concrete Cloner for Avro see:
> >     https://issues.apache.org/jira/browse/AVRO-964
> >
> >  3. I think I should be able to use an Avro union
> >     http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> >     House to make AvroSerialization able to handle both classes. This
> >     however is complicating my message format just to support MRUnit
> >     so probably not a good long term solution.
> >
> >  4. It may be possible to write an AvroSerialization class capable of
> >     handling any Avro generated class. The problem is Avro wraps
> >     everything in AvroKey and AvroValue so the problem is that when
> >     Serialization.accept is called you have lost the specific class
> >     information through erasure. So if I went down this path I could
> >     end up having to write my own version of Avro MR
> >
> >
> > Let me know if you are interested in option (2) in which case I will 
> > help test. If not I will play around with (3) and (4).
> >
> > Thanks
> >
> > Jacob
> >
> >
> >
> >
> > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > From: donofrio111@gmail.com
> > > To: mrunit-user@incubator.apache.org
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > >
> > > Sorry for the delay I havent had a chance to look at this too much.
> > >
> > > Yes you are correct that I need to use mrunit's Serialization class to
> > > copy the objects because the RecordReader's will reuse objects. The old
> > > mapred RecordReader interface has createKey and createValue methods
> > > which create a new instance for me but the mapreduce api removed these
> > > methods so I am forced to copy them.
> > >
> > > The configuration gets passed down to AvroSerialization so the schema
> > > should be available for reducer output.
> > >
> > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > Jim
> > > >
> > > > Unfortunately this did not fix my issue but at least I can now attach
> > > > a unit test. The test is made up as below:
> > > >
> > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > AvroSerialization class is slightly different but still has the same
> > > > problem.
> > > >
> > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
> > > >
> > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it
> > > > tries to use HDFS (which is what I am trying to avoid through the
> > > > excellent MRUNIT). Instead I Mocked out my own
> > > > in MockAvroFormats.java. This could do with some improvement but it
> > > > demonstrates the problem.
> > > >
> > > > - I have a Room and House class which you will see get code generated
> > > > from the Avro schema file.
> > > >
> > > > - I have a mapper which takes text and outputs Room and a reducer
> > > > which takes <Long,List<Room>> and outputs a House.
> > > >
> > > >
> > > > The first test noOutputFormatTest() demonstrates my original problem.
> > > > Trying to re-use the serializer for the output of the reducer at
> > > > MockOutputCollector:49 causes the exception:
> > > >
> > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > > be cast to java.lang.Long
> > > >
> > > > Because the AvroSerialization is configured for the output of the
> > > > Mapper so is expecting to be sent a Long in the key but here is being
> > > > sent a House.
> > > >
> > > > The second test withOutputFormatTest() results in the same exception.
> > > > But this time from MockMapreduceOutputFormat.java:162. I assume you
> > > > are forced to clone here because the InputFormat may be re-using its
> > > > objects?
> > > >
> > > > The heart of the problem is AvroSerialization retrieves its schema
> > > > through the configuration. So my guess is that it can only ever be
> > > > used for the shuffle. But I am happy to cross post this on the Avro
> > > > board to see if I am doing something wrong.
> > > >
> > > > Thanks
> > > >
> > > > Jacob
> > > >
> > > >
> > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > From: donofrio111@gmail.com
> > > > > To: mrunit-user@incubator.apache.org
> > > > > Subject: Re: Deserializer used for both Map and Reducer 
> > context.write()
> > > > >
> > > > > In 1336519 revision I checked in my initial work for MRUNIT-101. I
> > > > still
> > > > > need to do some cleaning up and adding the javadoc but the 
> > feature is
> > > > > there and tested. I reconfigured out jenkins setup to publish 
> > snapshots
> > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > > > apache's Nexus repository. I dont think this gets replicated so you
> > > > will
> > > > > have to add apache's repository to your settings.xml if you are
> > > > using maven.
> > > > >
> > > > > @Test
> > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > final MapReduceDriver driver = this.driver;
> > > > > driver.withOutputFormat(TextOutputFormat.class, 
> > TextInputFormat.class);
> > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > driver.runTest();
> > > > > }
> > > > >
> > > > > You can look at
> > > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] 
> > to see
> > > > > how to use the outputformat. Just call withOutputFormat on the 
> > driver
> > > > > with the outputformat you want to use and the inputformat you 
> > want to
> > > > > read that output back into the output list. The Serialization 
> > class is
> > > > > used after the inputformat to copy the inputs into a list so 
> > make sure
> > > > > to set io.serializations because the mapreduce api RecordReader 
> > does
> > > > not
> > > > > have createKey and createValue methods. Let me know if that does not
> > > > > work for Avro usually.
> > > > >
> > > > > When I get to MultipleOutputs MRUNIT-13 in the next few days it 
> > will be
> > > > > implemented with a similar api except you will also need to 
> > specify the
> > > > > name of the output collector.
> > > > >
> > > > > [1]:
> > > > >
> > > > 
> > http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > >
> > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > Jim, Brock
> > > > > >
> > > > > > Thanks for getting back to me so quickly, and yes I suspect 
> > MR-101 is
> > > > > > the answer.
> > > > > >
> > > > > > The key thing I wanted to establish is whether:
> > > > > >
> > > > > > 1) The "contract" is that the Serialization concrete 
> > implementations
> > > > > > listed in "io.serializations" should only ever be used for
> > > > serializing
> > > > > > mapper output in the shuffle stage.
> > > > > >
> > > > > > 2) OR I am doing something very wrong with Avro - for example I
> > > > > > should only be using the same schema for map and reduce output.
> > > > > >
> > > > > > Assuming (1) is correct then MR-101 would make a big 
> > difference, as
> > > > > > long as you could avoid using the serializer to clone the 
> > output of
> > > > > > the reducer. I am guessing you would use the concrete 
> > OutputFormat to
> > > > > > serialize the reducer output to a stream and then the unit tester
> > > > > > would need to deserialize themselves to assert the output? But 
> > what
> > > > > > would people who just want to stick to asserting based on the 
> > reducer
> > > > > > output do?
> > > > > >
> > > > > > I will try and boil my issue down to a canned example over the 
> > next
> > > > > > few days. If you are interested in Avro they are working on
> > > > > > integrating Garret Wu's MR2 extensions in 1.7 and there is a test
> > > > case
> > > > > > here:
> > > > > >
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
> >
> > > >
> > > > > >
> > > > > >
> > > > > > I am happy to test MR-101 for you if you let me know when its
> > > > available.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Jacob
> > > > > >
> > > > > >
> > > > > > > From: brock@cloudera.com
> > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > context.write()
> > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it 
> > possible to
> > > > > > > share the exception/error you saw? If you have time, I'd enjoy
> > > > seeing
> > > > > > > a small example of the code in question so we can add that to
> > > > our test
> > > > > > > suite.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Brock
> > > > > > >
> > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > <do...@gmail.com>
> > > > > > wrote:
> > > > > > > > I am not too familar with Avro, maybe someone else can 
> > respond
> > > > but
> > > > > > if the
> > > > > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1]
> > > > > > should fix
> > > > > > > > your problem. I am just finishing this JIRA up, it works under
> > > > > > Hadoop 1+, I
> > > > > > > > am having issues with TaskAttemptContext and JobContext
> > > > changing from
> > > > > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > > > > >
> > > > > > > > I should resolve this over the next few days. In the 
> > meantime if
> > > > > > you can
> > > > > > > > post your code I can test against it. It may also be worth 
> > the
> > > > MRUnit
> > > > > > > > project exploring having Jenkins deploy a snapshot to 
> > Nexus so
> > > > you can
> > > > > > > > easily test against the trunk without having to build it or
> > > > > > download the jar
> > > > > > > > from Jenkins.
> > > > > > > >
> > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > >
> > > > > > > >
> > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I am trying to integrate Avro-1.7 (specifically the new MR2
> > > > > > extensions),
> > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any
> > > > > > mistakes my
> > > > > > > >> question is should MRUnit be using the Serialization factory
> > > > when
> > > > > > I call
> > > > > > > >> context.write() in a reducer.
> > > > > > > >>
> > > > > > > >> I am using MapReduceDriver and my mapper has output 
> > signature:
> > > > > > > >>
> > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > >>
> > > > > > > >> My reducer has a different outputt signature:
> > > > > > > >>
> > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > >>
> > > > > > > >> I am using Avro specific serialization so I set my Avro 
> > schemas
> > > > > > like this:
> > > > > > > >>
> > > > > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > SpecificKey1.SCHEMA$
> > > > > > > >> );
> > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > SpecificKey1.SCHEMA$
> > > > > > > >> );
> > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > >>
> > > > > > > >> My understanding of Avro MR is that the Serialization 
> > class is
> > > > > > intended to
> > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > >>
> > > > > > > >> However my test fails at reduce stage. Debugging I realised
> > > > the mock
> > > > > > > >> reducer context is using the serializer to copy objects:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > > 
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > >>
> > > > > > > >> Looking at the AvroSerialization object it only expects one
> > > > set of
> > > > > > > >> schemas:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > >>
> > > > > > > >> So when my reducer tries to write SpecificValue2 to the 
> > context,
> > > > > > MRUnit's
> > > > > > > >> mock then tries to serialise SpecificValue2 with 
> > Value1.SCHEMA$
> > > > > > and as a
> > > > > > > >> result fails.
> > > > > > > >>
> > > > > > > >> I have yet debugged Hadoop itself but I did read some 
> > comments
> > > > > > (which I
> > > > > > > >> since cannot locate) which says that the Serialization 
> > class is
> > > > > > typically
> > > > > > > >> not used for the output of the reduce stage. My limited
> > > > > > understanding is
> > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act 
> > as the
> > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > >>
> > > > > > > >> I can spend some time distilling my code into a simple
> > > > example but
> > > > > > > >> wondered if anyone had any pointers - or an Avro + MR2 + 
> > MRUnit
> > > > > > example.
> > > > > > > >>
> > > > > > > >> Jacob
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > http://incubator.apache.org/mrunit/

 		 	   		  

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
Yes I agree 3 is a bad idea, you shouldnt have to change your code to 
work with a unit test.

Ideally AvroSerialization would already support this and you wouldnt 
have to do 4.

I am not sure I want to do 2 either, it is just more code users have to 
write to use MRUnit.


MRUnit doesnt really use serialization to clone in the reducer. After I 
write the output out with the outputformat I need some way to bring the 
objects back in so that I can use our existing validation methods. The 
simplest way to do this that I thought of that used existing hadoop 
concepts was to have the user set an inputformat as if they were using 
the mapper in another map reduce job to read the output of this 
mapreduce job that you are testing. How do you usually read the output 
of an Avro job, maybe I just need to allow you to set an alternative 
JobConf that only gets used by the InputFormat since you say that 
AvroSerialization only supports one key and value?

On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
>
> No thanks for looking at it. My next step was to attempt to get my 
> example running on a Pseudo-distributed cluster. This took me a while 
> as I am only a Hadoop beginner and had problems with my 
> HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop 
> does not use AvroSerialization in the Reducer Output stage.
>
> I understand why MRUnit needs to make copies but:
>
>   * It appears AvroSerialization can only be configured to serialize
>     one key class and one value schema.
>   * It appears it is only expecting to be used in the mapper phase.
>   * I configure it to serialize Room (output of mapper stage)
>   * So it gets a shock when MRUnit sends it a House (output of reducer
>     stage)
>
>
> I have thought of a number of ways round this both on the MRUnit side 
> and my side:
>
>  1. MRUnit could check to see if objects support
>     Serializable/Cloneable and utilise these in preference.
>     Unfortunately I don't think Avro generated classes do implement
>     these, but Protobuf does.
>
>  2. withOutputFormat() could take an optional object with interface
>     e.g. "Cloner" which users pass in. You may not want Avro
>     dependencies in MRUnit but it is fairly easy for people to write a
>     concrete Cloner for Avro see:
>     https://issues.apache.org/jira/browse/AVRO-964
>
>  3. I think I should be able to use an Avro union
>     http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
>     House to make AvroSerialization able to handle both classes. This
>     however is complicating my message format just to support MRUnit
>     so probably not a good long term solution.
>
>  4. It may be possible to write an AvroSerialization class capable of
>     handling any Avro generated class. The problem is Avro wraps
>     everything in AvroKey and AvroValue so the problem is that when
>     Serialization.accept is called you have lost the specific class
>     information through erasure. So if I went down this path I could
>     end up having to write my own version of Avro MR
>
>
> Let me know if you are interested in option (2) in which case I will 
> help test. If not I will play around with (3) and (4).
>
> Thanks
>
> Jacob
>
>
>
>
> > Date: Sat, 12 May 2012 11:09:07 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > Sorry for the delay I havent had a chance to look at this too much.
> >
> > Yes you are correct that I need to use mrunit's Serialization class to
> > copy the objects because the RecordReader's will reuse objects. The old
> > mapred RecordReader interface has createKey and createValue methods
> > which create a new instance for me but the mapreduce api removed these
> > methods so I am forced to copy them.
> >
> > The configuration gets passed down to AvroSerialization so the schema
> > should be available for reducer output.
> >
> > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > Jim
> > >
> > > Unfortunately this did not fix my issue but at least I can now attach
> > > a unit test. The test is made up as below:
> > >
> > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > AvroSerialization class is slightly different but still has the same
> > > problem.
> > >
> > > - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
> > >
> > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it
> > > tries to use HDFS (which is what I am trying to avoid through the
> > > excellent MRUNIT). Instead I Mocked out my own
> > > in MockAvroFormats.java. This could do with some improvement but it
> > > demonstrates the problem.
> > >
> > > - I have a Room and House class which you will see get code generated
> > > from the Avro schema file.
> > >
> > > - I have a mapper which takes text and outputs Room and a reducer
> > > which takes <Long,List<Room>> and outputs a House.
> > >
> > >
> > > The first test noOutputFormatTest() demonstrates my original problem.
> > > Trying to re-use the serializer for the output of the reducer at
> > > MockOutputCollector:49 causes the exception:
> > >
> > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > be cast to java.lang.Long
> > >
> > > Because the AvroSerialization is configured for the output of the
> > > Mapper so is expecting to be sent a Long in the key but here is being
> > > sent a House.
> > >
> > > The second test withOutputFormatTest() results in the same exception.
> > > But this time from MockMapreduceOutputFormat.java:162. I assume you
> > > are forced to clone here because the InputFormat may be re-using its
> > > objects?
> > >
> > > The heart of the problem is AvroSerialization retrieves its schema
> > > through the configuration. So my guess is that it can only ever be
> > > used for the shuffle. But I am happy to cross post this on the Avro
> > > board to see if I am doing something wrong.
> > >
> > > Thanks
> > >
> > > Jacob
> > >
> > >
> > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > From: donofrio111@gmail.com
> > > > To: mrunit-user@incubator.apache.org
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > >
> > > > In 1336519 revision I checked in my initial work for MRUNIT-101. I
> > > still
> > > > need to do some cleaning up and adding the javadoc but the 
> feature is
> > > > there and tested. I reconfigured out jenkins setup to publish 
> snapshots
> > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > > apache's Nexus repository. I dont think this gets replicated so you
> > > will
> > > > have to add apache's repository to your settings.xml if you are
> > > using maven.
> > > >
> > > > @Test
> > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > final MapReduceDriver driver = this.driver;
> > > > driver.withOutputFormat(TextOutputFormat.class, 
> TextInputFormat.class);
> > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > driver.runTest();
> > > > }
> > > >
> > > > You can look at
> > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] 
> to see
> > > > how to use the outputformat. Just call withOutputFormat on the 
> driver
> > > > with the outputformat you want to use and the inputformat you 
> want to
> > > > read that output back into the output list. The Serialization 
> class is
> > > > used after the inputformat to copy the inputs into a list so 
> make sure
> > > > to set io.serializations because the mapreduce api RecordReader 
> does
> > > not
> > > > have createKey and createValue methods. Let me know if that does not
> > > > work for Avro usually.
> > > >
> > > > When I get to MultipleOutputs MRUNIT-13 in the next few days it 
> will be
> > > > implemented with a similar api except you will also need to 
> specify the
> > > > name of the output collector.
> > > >
> > > > [1]:
> > > >
> > > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > >
> > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > Jim, Brock
> > > > >
> > > > > Thanks for getting back to me so quickly, and yes I suspect 
> MR-101 is
> > > > > the answer.
> > > > >
> > > > > The key thing I wanted to establish is whether:
> > > > >
> > > > > 1) The "contract" is that the Serialization concrete 
> implementations
> > > > > listed in "io.serializations" should only ever be used for
> > > serializing
> > > > > mapper output in the shuffle stage.
> > > > >
> > > > > 2) OR I am doing something very wrong with Avro - for example I
> > > > > should only be using the same schema for map and reduce output.
> > > > >
> > > > > Assuming (1) is correct then MR-101 would make a big 
> difference, as
> > > > > long as you could avoid using the serializer to clone the 
> output of
> > > > > the reducer. I am guessing you would use the concrete 
> OutputFormat to
> > > > > serialize the reducer output to a stream and then the unit tester
> > > > > would need to deserialize themselves to assert the output? But 
> what
> > > > > would people who just want to stick to asserting based on the 
> reducer
> > > > > output do?
> > > > >
> > > > > I will try and boil my issue down to a canned example over the 
> next
> > > > > few days. If you are interested in Avro they are working on
> > > > > integrating Garret Wu's MR2 extensions in 1.7 and there is a test
> > > case
> > > > > here:
> > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
>
> > >
> > > > >
> > > > >
> > > > > I am happy to test MR-101 for you if you let me know when its
> > > available.
> > > > >
> > > > > Regards
> > > > >
> > > > > Jacob
> > > > >
> > > > >
> > > > > > From: brock@cloudera.com
> > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > context.write()
> > > > > > To: mrunit-user@incubator.apache.org
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it 
> possible to
> > > > > > share the exception/error you saw? If you have time, I'd enjoy
> > > seeing
> > > > > > a small example of the code in question so we can add that to
> > > our test
> > > > > > suite.
> > > > > >
> > > > > > Cheers,
> > > > > > Brock
> > > > > >
> > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > <do...@gmail.com>
> > > > > wrote:
> > > > > > > I am not too familar with Avro, maybe someone else can 
> respond
> > > but
> > > > > if the
> > > > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1]
> > > > > should fix
> > > > > > > your problem. I am just finishing this JIRA up, it works under
> > > > > Hadoop 1+, I
> > > > > > > am having issues with TaskAttemptContext and JobContext
> > > changing from
> > > > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > > > >
> > > > > > > I should resolve this over the next few days. In the 
> meantime if
> > > > > you can
> > > > > > > post your code I can test against it. It may also be worth 
> the
> > > MRUnit
> > > > > > > project exploring having Jenkins deploy a snapshot to 
> Nexus so
> > > you can
> > > > > > > easily test against the trunk without having to build it or
> > > > > download the jar
> > > > > > > from Jenkins.
> > > > > > >
> > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > >
> > > > > > >
> > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> I am trying to integrate Avro-1.7 (specifically the new MR2
> > > > > extensions),
> > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any
> > > > > mistakes my
> > > > > > >> question is should MRUnit be using the Serialization factory
> > > when
> > > > > I call
> > > > > > >> context.write() in a reducer.
> > > > > > >>
> > > > > > >> I am using MapReduceDriver and my mapper has output 
> signature:
> > > > > > >>
> > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > >>
> > > > > > >> My reducer has a different outputt signature:
> > > > > > >>
> > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > >>
> > > > > > >> I am using Avro specific serialization so I set my Avro 
> schemas
> > > > > like this:
> > > > > > >>
> > > > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > SpecificKey1.SCHEMA$
> > > > > > >> );
> > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > SpecificKey1.SCHEMA$
> > > > > > >> );
> > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > >>
> > > > > > >> My understanding of Avro MR is that the Serialization 
> class is
> > > > > intended to
> > > > > > >> be invoked between the map and reduce phase.
> > > > > > >>
> > > > > > >> However my test fails at reduce stage. Debugging I realised
> > > the mock
> > > > > > >> reducer context is using the serializer to copy objects:
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > >
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > >>
> > > > > > >> Looking at the AvroSerialization object it only expects one
> > > set of
> > > > > > >> schemas:
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > >>
> > > > > > >> So when my reducer tries to write SpecificValue2 to the 
> context,
> > > > > MRUnit's
> > > > > > >> mock then tries to serialise SpecificValue2 with 
> Value1.SCHEMA$
> > > > > and as a
> > > > > > >> result fails.
> > > > > > >>
> > > > > > >> I have yet debugged Hadoop itself but I did read some 
> comments
> > > > > (which I
> > > > > > >> since cannot locate) which says that the Serialization 
> class is
> > > > > typically
> > > > > > >> not used for the output of the reduce stage. My limited
> > > > > understanding is
> > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act 
> as the
> > > > > > >> deserializer when you are running in Hadoop.
> > > > > > >>
> > > > > > >> I can spend some time distilling my code into a simple
> > > example but
> > > > > > >> wondered if anyone had any pointers - or an Avro + MR2 + 
> MRUnit
> > > > > example.
> > > > > > >>
> > > > > > >> Jacob
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > http://incubator.apache.org/mrunit/

RE: Deserializer used for both Map and Reducer context.write()

Posted by Jacob Metcalf <ja...@hotmail.com>.

No thanks for looking at it. My next step was to attempt to get my example running on a Pseudo-distributed cluster. This took me a while as I am only a Hadoop beginner and had problems with my HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop does not use AvroSerialization in the Reducer Output stage.
I understand why MRUnit needs to make copies but:
It appears AvroSerialization can only be configured to serialize one key class and one value schema. It appears it is only expecting to be used in the mapper phase.I configure it to serialize Room (output of mapper stage)So it gets a shock when MRUnit sends it a House (output of reducer stage)
I have thought of a number of ways round this both on the MRUnit side and my side:
MRUnit could check to see if objects support Serializable/Cloneable and utilise these in preference. Unfortunately I don't think Avro generated classes do implement these, but Protobuf does.

withOutputFormat() could take an optional object with interface e.g. "Cloner" which users pass in. You may not want Avro dependencies in MRUnit but it is fairly easy for people to write a concrete Cloner for Avro see: 
https://issues.apache.org/jira/browse/AVRO-964

I think I should be able to use an Avro union 
http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and House to make AvroSerialization able to handle both classes. This however is complicating my message format just to support MRUnit so probably not a good long term solution.

It may be possible to write an AvroSerialization class capable of handling any Avro generated class. The problem is Avro wraps everything in AvroKey and AvroValue so the problem is that when Serialization.accept is called you have lost the specific class information through erasure. So if I went down this path I could end up having to write my own version of Avro MR
Let me know if you are interested in option (2) in which case I will help test. If not I will play around with (3) and (4).
Thanks
Jacob 



> Date: Sat, 12 May 2012 11:09:07 -0400
> From: donofrio111@gmail.com
> To: mrunit-user@incubator.apache.org
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> 
> Sorry for the delay I havent had a chance to look at this too much.
> 
> Yes you are correct that I need to use mrunit's Serialization class to 
> copy the objects because the RecordReader's will reuse objects. The old 
> mapred RecordReader interface has createKey and createValue methods 
> which create a new instance for me but the mapreduce api removed these 
> methods so I am forced to copy them.
> 
> The configuration gets passed down to AvroSerialization so the schema 
> should be available for reducer output.
> 
> On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > Jim
> >
> > Unfortunately this did not fix my issue but at least I can now attach 
> > a unit test. The test is made up as below:
> >
> > - I used Avro 1.6.3 so you did not have to build 1.7. The 
> > AvroSerialization class is slightly different but still has the same 
> > problem.
> >
> > - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
> >
> > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it 
> > tries to use HDFS (which is what I am trying to avoid through the 
> > excellent MRUNIT). Instead I Mocked out my own 
> > in MockAvroFormats.java. This could do with some improvement but it 
> > demonstrates the problem.
> >
> > - I have a Room and House class which you will see get code generated 
> > from the Avro schema file.
> >
> > - I have a mapper which takes text and outputs Room and a reducer 
> > which takes <Long,List<Room>> and outputs a House.
> >
> >
> > The first test noOutputFormatTest() demonstrates my original problem. 
> > Trying to re-use the serializer for the output of the reducer at 
> > MockOutputCollector:49 causes the exception:
> >
> >     java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot 
> > be cast to java.lang.Long
> >
> > Because the AvroSerialization is configured for the output of the 
> > Mapper so is expecting to be sent a Long in the key but here is being 
> > sent a House.
> >
> > The second test withOutputFormatTest() results in the same exception. 
> > But this time from MockMapreduceOutputFormat.java:162. I assume you 
> > are forced to clone here because the InputFormat may be re-using its 
> > objects?
> >
> > The heart of the problem is AvroSerialization retrieves its schema 
> > through the configuration. So my guess is that it can only ever be 
> > used for the shuffle. But I am happy to cross post this on the Avro 
> > board to see if I am doing something wrong.
> >
> > Thanks
> >
> > Jacob
> >
> >
> > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > From: donofrio111@gmail.com
> > > To: mrunit-user@incubator.apache.org
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > >
> > > In 1336519 revision I checked in my initial work for MRUNIT-101. I 
> > still
> > > need to do some cleaning up and adding the javadoc but the feature is
> > > there and tested. I reconfigured out jenkins setup to publish snapshots
> > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > apache's Nexus repository. I dont think this gets replicated so you 
> > will
> > > have to add apache's repository to your settings.xml if you are 
> > using maven.
> > >
> > > @Test
> > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > final MapReduceDriver driver = this.driver;
> > > driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
> > > driver.withInput(new Text("a"), new LongWritable(1));
> > > driver.withInput(new Text("a"), new LongWritable(2));
> > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > driver.runTest();
> > > }
> > >
> > > You can look at
> > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see
> > > how to use the outputformat. Just call withOutputFormat on the driver
> > > with the outputformat you want to use and the inputformat you want to
> > > read that output back into the output list. The Serialization class is
> > > used after the inputformat to copy the inputs into a list so make sure
> > > to set io.serializations because the mapreduce api RecordReader does 
> > not
> > > have createKey and createValue methods. Let me know if that does not
> > > work for Avro usually.
> > >
> > > When I get to MultipleOutputs MRUNIT-13 in the next few days it will be
> > > implemented with a similar api except you will also need to specify the
> > > name of the output collector.
> > >
> > > [1]:
> > > 
> > http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > >
> > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > Jim, Brock
> > > >
> > > > Thanks for getting back to me so quickly, and yes I suspect MR-101 is
> > > > the answer.
> > > >
> > > > The key thing I wanted to establish is whether:
> > > >
> > > > 1) The "contract" is that the Serialization concrete implementations
> > > > listed in "io.serializations" should only ever be used for 
> > serializing
> > > > mapper output in the shuffle stage.
> > > >
> > > > 2) OR I am doing something very wrong with Avro - for example I
> > > > should only be using the same schema for map and reduce output.
> > > >
> > > > Assuming (1) is correct then MR-101 would make a big difference, as
> > > > long as you could avoid using the serializer to clone the output of
> > > > the reducer. I am guessing you would use the concrete OutputFormat to
> > > > serialize the reducer output to a stream and then the unit tester
> > > > would need to deserialize themselves to assert the output? But what
> > > > would people who just want to stick to asserting based on the reducer
> > > > output do?
> > > >
> > > > I will try and boil my issue down to a canned example over the next
> > > > few days. If you are interested in Avro they are working on
> > > > integrating Garret Wu's MR2 extensions in 1.7 and there is a test 
> > case
> > > > here:
> > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
> >
> > > >
> > > >
> > > > I am happy to test MR-101 for you if you let me know when its 
> > available.
> > > >
> > > > Regards
> > > >
> > > > Jacob
> > > >
> > > >
> > > > > From: brock@cloudera.com
> > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > Subject: Re: Deserializer used for both Map and Reducer 
> > context.write()
> > > > > To: mrunit-user@incubator.apache.org
> > > > >
> > > > > Hi,
> > > > >
> > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > > > > share the exception/error you saw? If you have time, I'd enjoy 
> > seeing
> > > > > a small example of the code in question so we can add that to 
> > our test
> > > > > suite.
> > > > >
> > > > > Cheers,
> > > > > Brock
> > > > >
> > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio 
> > <do...@gmail.com>
> > > > wrote:
> > > > > > I am not too familar with Avro, maybe someone else can respond 
> > but
> > > > if the
> > > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1]
> > > > should fix
> > > > > > your problem. I am just finishing this JIRA up, it works under
> > > > Hadoop 1+, I
> > > > > > am having issues with TaskAttemptContext and JobContext 
> > changing from
> > > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > > >
> > > > > > I should resolve this over the next few days. In the meantime if
> > > > you can
> > > > > > post your code I can test against it. It may also be worth the 
> > MRUnit
> > > > > > project exploring having Jenkins deploy a snapshot to Nexus so 
> > you can
> > > > > > easily test against the trunk without having to build it or
> > > > download the jar
> > > > > > from Jenkins.
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > >
> > > > > >
> > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > >>
> > > > > >>
> > > > > >> I am trying to integrate Avro-1.7 (specifically the new MR2
> > > > extensions),
> > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any
> > > > mistakes my
> > > > > >> question is should MRUnit be using the Serialization factory 
> > when
> > > > I call
> > > > > >> context.write() in a reducer.
> > > > > >>
> > > > > >> I am using MapReduceDriver and my mapper has output signature:
> > > > > >>
> > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > >>
> > > > > >> My reducer has a different outputt signature:
> > > > > >>
> > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > >>
> > > > > >> I am using Avro specific serialization so I set my Avro schemas
> > > > like this:
> > > > > >>
> > > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > SpecificKey1.SCHEMA$
> > > > > >> );
> > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > SpecificKey1.SCHEMA$
> > > > > >> );
> > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > >> SpecificValue1.SCHEMA$ );
> > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > >> SpecificValue1.SCHEMA$ );
> > > > > >>
> > > > > >> My understanding of Avro MR is that the Serialization class is
> > > > intended to
> > > > > >> be invoked between the map and reduce phase.
> > > > > >>
> > > > > >> However my test fails at reduce stage. Debugging I realised 
> > the mock
> > > > > >> reducer context is using the serializer to copy objects:
> > > > > >>
> > > > > >>
> > > > > >>
> > > > 
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > >>
> > > > > >> Looking at the AvroSerialization object it only expects one 
> > set of
> > > > > >> schemas:
> > > > > >>
> > > > > >>
> > > > > >>
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > >>
> > > > > >> So when my reducer tries to write SpecificValue2 to the context,
> > > > MRUnit's
> > > > > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$
> > > > and as a
> > > > > >> result fails.
> > > > > >>
> > > > > >> I have yet debugged Hadoop itself but I did read some comments
> > > > (which I
> > > > > >> since cannot locate) which says that the Serialization class is
> > > > typically
> > > > > >> not used for the output of the reduce stage. My limited
> > > > understanding is
> > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > > > > >> deserializer when you are running in Hadoop.
> > > > > >>
> > > > > >> I can spend some time distilling my code into a simple 
> > example but
> > > > > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit
> > > > example.
> > > > > >>
> > > > > >> Jacob
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Apache MRUnit - Unit testing MapReduce -
> > > > http://incubator.apache.org/mrunit/
 		 	   		  

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
Sorry for the delay I havent had a chance to look at this too much.

Yes you are correct that I need to use mrunit's Serialization class to 
copy the objects because the RecordReader's will reuse objects. The old 
mapred RecordReader interface has createKey and createValue methods 
which create a new instance for me but the mapreduce api removed these 
methods so I am forced to copy them.

The configuration gets passed down to AvroSerialization so the schema 
should be available for reducer output.

On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> Jim
>
> Unfortunately this did not fix my issue but at least I can now attach 
> a unit test. The test is made up as below:
>
> - I used Avro 1.6.3 so you did not have to build 1.7. The 
> AvroSerialization class is slightly different but still has the same 
> problem.
>
> - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
>
> - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it 
> tries to use HDFS (which is what I am trying to avoid through the 
> excellent MRUNIT). Instead I Mocked out my own 
> in MockAvroFormats.java. This could do with some improvement but it 
> demonstrates the problem.
>
> - I have a Room and House class which you will see get code generated 
> from the Avro schema file.
>
> - I have a mapper which takes text and outputs Room and a reducer 
> which takes <Long,List<Room>> and outputs a House.
>
>
> The first test noOutputFormatTest() demonstrates my original problem. 
> Trying to re-use the serializer for the output of the reducer at 
> MockOutputCollector:49 causes the exception:
>
>     java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot 
> be cast to java.lang.Long
>
> Because the AvroSerialization is configured for the output of the 
> Mapper so is expecting to be sent a Long in the key but here is being 
> sent a House.
>
> The second test withOutputFormatTest() results in the same exception. 
> But this time from MockMapreduceOutputFormat.java:162. I assume you 
> are forced to clone here because the InputFormat may be re-using its 
> objects?
>
> The heart of the problem is AvroSerialization retrieves its schema 
> through the configuration. So my guess is that it can only ever be 
> used for the shuffle. But I am happy to cross post this on the Avro 
> board to see if I am doing something wrong.
>
> Thanks
>
> Jacob
>
>
> > Date: Thu, 10 May 2012 08:57:36 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > In 1336519 revision I checked in my initial work for MRUNIT-101. I 
> still
> > need to do some cleaning up and adding the javadoc but the feature is
> > there and tested. I reconfigured out jenkins setup to publish snapshots
> > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > apache's Nexus repository. I dont think this gets replicated so you 
> will
> > have to add apache's repository to your settings.xml if you are 
> using maven.
> >
> > @Test
> > public void testOutputFormatWithMismatchInOutputClasses() {
> > final MapReduceDriver driver = this.driver;
> > driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
> > driver.withInput(new Text("a"), new LongWritable(1));
> > driver.withInput(new Text("a"), new LongWritable(2));
> > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > driver.runTest();
> > }
> >
> > You can look at
> > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see
> > how to use the outputformat. Just call withOutputFormat on the driver
> > with the outputformat you want to use and the inputformat you want to
> > read that output back into the output list. The Serialization class is
> > used after the inputformat to copy the inputs into a list so make sure
> > to set io.serializations because the mapreduce api RecordReader does 
> not
> > have createKey and createValue methods. Let me know if that does not
> > work for Avro usually.
> >
> > When I get to MultipleOutputs MRUNIT-13 in the next few days it will be
> > implemented with a similar api except you will also need to specify the
> > name of the output collector.
> >
> > [1]:
> > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> >
> > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > Jim, Brock
> > >
> > > Thanks for getting back to me so quickly, and yes I suspect MR-101 is
> > > the answer.
> > >
> > > The key thing I wanted to establish is whether:
> > >
> > > 1) The "contract" is that the Serialization concrete implementations
> > > listed in "io.serializations" should only ever be used for 
> serializing
> > > mapper output in the shuffle stage.
> > >
> > > 2) OR I am doing something very wrong with Avro - for example I
> > > should only be using the same schema for map and reduce output.
> > >
> > > Assuming (1) is correct then MR-101 would make a big difference, as
> > > long as you could avoid using the serializer to clone the output of
> > > the reducer. I am guessing you would use the concrete OutputFormat to
> > > serialize the reducer output to a stream and then the unit tester
> > > would need to deserialize themselves to assert the output? But what
> > > would people who just want to stick to asserting based on the reducer
> > > output do?
> > >
> > > I will try and boil my issue down to a canned example over the next
> > > few days. If you are interested in Avro they are working on
> > > integrating Garret Wu's MR2 extensions in 1.7 and there is a test 
> case
> > > here:
> > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
>
> > >
> > >
> > > I am happy to test MR-101 for you if you let me know when its 
> available.
> > >
> > > Regards
> > >
> > > Jacob
> > >
> > >
> > > > From: brock@cloudera.com
> > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > > To: mrunit-user@incubator.apache.org
> > > >
> > > > Hi,
> > > >
> > > > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > > > share the exception/error you saw? If you have time, I'd enjoy 
> seeing
> > > > a small example of the code in question so we can add that to 
> our test
> > > > suite.
> > > >
> > > > Cheers,
> > > > Brock
> > > >
> > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio 
> <do...@gmail.com>
> > > wrote:
> > > > > I am not too familar with Avro, maybe someone else can respond 
> but
> > > if the
> > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1]
> > > should fix
> > > > > your problem. I am just finishing this JIRA up, it works under
> > > Hadoop 1+, I
> > > > > am having issues with TaskAttemptContext and JobContext 
> changing from
> > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > >
> > > > > I should resolve this over the next few days. In the meantime if
> > > you can
> > > > > post your code I can test against it. It may also be worth the 
> MRUnit
> > > > > project exploring having Jenkins deploy a snapshot to Nexus so 
> you can
> > > > > easily test against the trunk without having to build it or
> > > download the jar
> > > > > from Jenkins.
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > >
> > > > >
> > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > >>
> > > > >>
> > > > >> I am trying to integrate Avro-1.7 (specifically the new MR2
> > > extensions),
> > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any
> > > mistakes my
> > > > >> question is should MRUnit be using the Serialization factory 
> when
> > > I call
> > > > >> context.write() in a reducer.
> > > > >>
> > > > >> I am using MapReduceDriver and my mapper has output signature:
> > > > >>
> > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > >>
> > > > >> My reducer has a different outputt signature:
> > > > >>
> > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > >>
> > > > >> I am using Avro specific serialization so I set my Avro schemas
> > > like this:
> > > > >>
> > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > SpecificKey1.SCHEMA$
> > > > >> );
> > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > SpecificKey1.SCHEMA$
> > > > >> );
> > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > >> SpecificValue1.SCHEMA$ );
> > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > >> SpecificValue1.SCHEMA$ );
> > > > >>
> > > > >> My understanding of Avro MR is that the Serialization class is
> > > intended to
> > > > >> be invoked between the map and reduce phase.
> > > > >>
> > > > >> However my test fails at reduce stage. Debugging I realised 
> the mock
> > > > >> reducer context is using the serializer to copy objects:
> > > > >>
> > > > >>
> > > > >>
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > >>
> > > > >> Looking at the AvroSerialization object it only expects one 
> set of
> > > > >> schemas:
> > > > >>
> > > > >>
> > > > >>
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > >>
> > > > >> So when my reducer tries to write SpecificValue2 to the context,
> > > MRUnit's
> > > > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$
> > > and as a
> > > > >> result fails.
> > > > >>
> > > > >> I have yet debugged Hadoop itself but I did read some comments
> > > (which I
> > > > >> since cannot locate) which says that the Serialization class is
> > > typically
> > > > >> not used for the output of the reduce stage. My limited
> > > understanding is
> > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > > > >> deserializer when you are running in Hadoop.
> > > > >>
> > > > >> I can spend some time distilling my code into a simple 
> example but
> > > > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit
> > > example.
> > > > >>
> > > > >> Jacob
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Apache MRUnit - Unit testing MapReduce -
> > > http://incubator.apache.org/mrunit/

RE: Deserializer used for both Map and Reducer context.write()

Posted by Jacob Metcalf <ja...@hotmail.com>.
Jim
Unfortunately this did not fix my issue but at least I can now attach a unit test. The test is made up as below:
- I used Avro 1.6.3 so you did not have to build 1.7. The AvroSerialization class is slightly different but still has the same problem.
- I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
- I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it tries to use HDFS (which is what I am trying to avoid through the excellent MRUNIT). Instead I Mocked out my own in MockAvroFormats.java. This could do with some improvement but it demonstrates the problem.
- I have a Room and House class which you will see get code generated from the Avro schema file.
- I have a mapper which takes text and outputs Room and a reducer which takes <Long,List<Room>> and outputs a House.

The first test noOutputFormatTest() demonstrates my original problem. Trying to re-use the serializer for the output of the reducer at MockOutputCollector:49 causes the exception: 
    java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot be cast to java.lang.LongBecause the AvroSerialization is configured for the output of the Mapper so is expecting to be sent a Long in the key but here is being sent a House.

The second test withOutputFormatTest() results in the same exception. But this time from MockMapreduceOutputFormat.java:162. I assume you are forced to clone here because the InputFormat may be re-using its objects?
The heart of the problem is AvroSerialization retrieves its schema through the configuration. So my guess is that it can only ever be used for the shuffle. But I am happy to cross post this on the Avro board to see if I am doing something wrong.
Thanks
Jacob

> Date: Thu, 10 May 2012 08:57:36 -0400
> From: donofrio111@gmail.com
> To: mrunit-user@incubator.apache.org
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> 
> In 1336519 revision I checked in my initial work for MRUNIT-101. I still 
> need to do some cleaning up and adding the javadoc but the feature is 
> there and tested. I reconfigured out jenkins setup to publish snapshots 
> to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in 
> apache's Nexus repository. I dont think this gets replicated so you will 
> have to add apache's repository to your settings.xml if you are using maven.
> 
>    @Test
>    public void testOutputFormatWithMismatchInOutputClasses() {
>      final MapReduceDriver driver = this.driver;
>      driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
>      driver.withInput(new Text("a"), new LongWritable(1));
>      driver.withInput(new Text("a"), new LongWritable(2));
>      driver.withOutput(new LongWritable(), new Text("a\t3"));
>      driver.runTest();
>    }
> 
> You can look at 
> org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see 
> how to use the outputformat. Just call withOutputFormat on the driver 
> with the outputformat you want to use and the inputformat you want to 
> read that output back into the output list. The Serialization class is 
> used after the inputformat to copy the inputs into a list so make sure 
> to set io.serializations because the mapreduce api RecordReader does not 
> have createKey and createValue methods. Let me know if that does not 
> work for Avro usually.
> 
> When I get to MultipleOutputs MRUNIT-13 in the next few days it will be 
> implemented with a similar api except you will also need to specify the 
> name of the output collector.
> 
> [1]: 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> 
> On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > Jim, Brock
> >
> > Thanks for getting back to me so quickly, and yes I suspect MR-101 is 
> > the answer.
> >
> > The key thing I wanted to establish is whether:
> >
> >  1) The "contract" is that the Serialization concrete implementations 
> > listed in "io.serializations" should only ever be used for serializing 
> > mapper output in the shuffle stage.
> >
> >  2) OR I am doing something very wrong with Avro - for example I 
> > should only be using the same schema for map and reduce output.
> >
> > Assuming (1) is correct then MR-101 would make a big difference, as 
> > long as you could avoid using the serializer to clone the output of 
> > the reducer. I am guessing you would use the concrete OutputFormat to 
> > serialize the reducer output to a stream and then the unit tester 
> > would need to deserialize themselves to assert the output? But what 
> > would people who just want to stick to asserting based on the reducer 
> > output do?
> >
> > I will try and boil my issue down to a canned example over the next 
> > few days. If you are interested in Avro they are working on 
> > integrating Garret Wu's MR2 extensions in 1.7 and there is a test case 
> > here:
> >
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
> >
> >
> > I am happy to test MR-101 for you if you let me know when its available.
> >
> > Regards
> >
> > Jacob
> >
> >
> > > From: brock@cloudera.com
> > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > > To: mrunit-user@incubator.apache.org
> > >
> > > Hi,
> > >
> > > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > > share the exception/error you saw? If you have time, I'd enjoy seeing
> > > a small example of the code in question so we can add that to our test
> > > suite.
> > >
> > > Cheers,
> > > Brock
> > >
> > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <do...@gmail.com> 
> > wrote:
> > > > I am not too familar with Avro, maybe someone else can respond but 
> > if the
> > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
> > should fix
> > > > your problem. I am just finishing this JIRA up, it works under 
> > Hadoop 1+, I
> > > > am having issues with TaskAttemptContext and JobContext changing from
> > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > >
> > > > I should resolve this over the next few days. In the meantime if 
> > you can
> > > > post your code I can test against it. It may also be worth the MRUnit
> > > > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > > > easily test against the trunk without having to build it or 
> > download the jar
> > > > from Jenkins.
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > >
> > > >
> > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > >>
> > > >>
> > > >> I am trying to integrate Avro-1.7 (specifically the new MR2 
> > extensions),
> > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any 
> > mistakes my
> > > >> question is should MRUnit be using the Serialization factory when 
> > I call
> > > >> context.write() in a reducer.
> > > >>
> > > >> I am using MapReduceDriver and my mapper has output signature:
> > > >>
> > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > >>
> > > >> My reducer has a different outputt signature:
> > > >>
> > > >> <AvroKey<SpecificValue2>, Null>.
> > > >>
> > > >> I am using Avro specific serialization so I set my Avro schemas 
> > like this:
> > > >>
> > > >> AvroSerialization.addToConfiguration( configuration );
> > > >> AvroSerialization.setKeyReaderSchema(configuration, 
> >  SpecificKey1.SCHEMA$
> > > >> );
> > > >> AvroSerialization.setKeyWriterSchema(configuration,   
> > SpecificKey1.SCHEMA$
> > > >> );
> > > >>        AvroSerialization.setValueReaderSchema(configuration,
> > > >> SpecificValue1.SCHEMA$ );
> > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > >> SpecificValue1.SCHEMA$ );
> > > >>
> > > >> My understanding of Avro MR is that the Serialization class is 
> > intended to
> > > >> be invoked between the map and reduce phase.
> > > >>
> > > >> However my test fails at reduce stage. Debugging I realised the mock
> > > >> reducer context is using the serializer to copy objects:
> > > >>
> > > >>
> > > >> 
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > >>
> > > >> Looking at the AvroSerialization object it only expects one set of
> > > >> schemas:
> > > >>
> > > >>
> > > >> 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > >>
> > > >> So when my reducer tries to write SpecificValue2 to the context, 
> > MRUnit's
> > > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ 
> > and as a
> > > >> result fails.
> > > >>
> > > >> I have yet debugged Hadoop itself but I did read some comments 
> > (which I
> > > >> since cannot locate) which says that the Serialization class is 
> > typically
> > > >> not used for the output of the reduce stage. My limited 
> > understanding is
> > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > > >> deserializer when you are running in Hadoop.
> > > >>
> > > >> I can spend some time distilling my code into a simple example but
> > > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit 
> > example.
> > > >>
> > > >> Jacob
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Apache MRUnit - Unit testing MapReduce - 
> > http://incubator.apache.org/mrunit/
 		 	   		  

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
In 1336519 revision I checked in my initial work for MRUNIT-101. I still 
need to do some cleaning up and adding the javadoc but the feature is 
there and tested. I reconfigured out jenkins setup to publish snapshots 
to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in 
apache's Nexus repository. I dont think this gets replicated so you will 
have to add apache's repository to your settings.xml if you are using maven.

   @Test
   public void testOutputFormatWithMismatchInOutputClasses() {
     final MapReduceDriver driver = this.driver;
     driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
     driver.withInput(new Text("a"), new LongWritable(1));
     driver.withInput(new Text("a"), new LongWritable(2));
     driver.withOutput(new LongWritable(), new Text("a\t3"));
     driver.runTest();
   }

You can look at 
org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see 
how to use the outputformat. Just call withOutputFormat on the driver 
with the outputformat you want to use and the inputformat you want to 
read that output back into the output list. The Serialization class is 
used after the inputformat to copy the inputs into a list so make sure 
to set io.serializations because the mapreduce api RecordReader does not 
have createKey and createValue methods. Let me know if that does not 
work for Avro usually.

When I get to MultipleOutputs MRUNIT-13 in the next few days it will be 
implemented with a similar api except you will also need to specify the 
name of the output collector.

[1]: 
http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup

On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> Jim, Brock
>
> Thanks for getting back to me so quickly, and yes I suspect MR-101 is 
> the answer.
>
> The key thing I wanted to establish is whether:
>
>  1) The "contract" is that the Serialization concrete implementations 
> listed in "io.serializations" should only ever be used for serializing 
> mapper output in the shuffle stage.
>
>  2) OR I am doing something very wrong with Avro - for example I 
> should only be using the same schema for map and reduce output.
>
> Assuming (1) is correct then MR-101 would make a big difference, as 
> long as you could avoid using the serializer to clone the output of 
> the reducer. I am guessing you would use the concrete OutputFormat to 
> serialize the reducer output to a stream and then the unit tester 
> would need to deserialize themselves to assert the output? But what 
> would people who just want to stick to asserting based on the reducer 
> output do?
>
> I will try and boil my issue down to a canned example over the next 
> few days. If you are interested in Avro they are working on 
> integrating Garret Wu's MR2 extensions in 1.7 and there is a test case 
> here:
>
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
>
>
> I am happy to test MR-101 for you if you let me know when its available.
>
> Regards
>
> Jacob
>
>
> > From: brock@cloudera.com
> > Date: Wed, 9 May 2012 09:17:42 -0500
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > To: mrunit-user@incubator.apache.org
> >
> > Hi,
> >
> > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > share the exception/error you saw? If you have time, I'd enjoy seeing
> > a small example of the code in question so we can add that to our test
> > suite.
> >
> > Cheers,
> > Brock
> >
> > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <do...@gmail.com> 
> wrote:
> > > I am not too familar with Avro, maybe someone else can respond but 
> if the
> > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
> should fix
> > > your problem. I am just finishing this JIRA up, it works under 
> Hadoop 1+, I
> > > am having issues with TaskAttemptContext and JobContext changing from
> > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > >
> > > I should resolve this over the next few days. In the meantime if 
> you can
> > > post your code I can test against it. It may also be worth the MRUnit
> > > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > > easily test against the trunk without having to build it or 
> download the jar
> > > from Jenkins.
> > >
> > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > >
> > >
> > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > >>
> > >>
> > >> I am trying to integrate Avro-1.7 (specifically the new MR2 
> extensions),
> > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any 
> mistakes my
> > >> question is should MRUnit be using the Serialization factory when 
> I call
> > >> context.write() in a reducer.
> > >>
> > >> I am using MapReduceDriver and my mapper has output signature:
> > >>
> > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > >>
> > >> My reducer has a different outputt signature:
> > >>
> > >> <AvroKey<SpecificValue2>, Null>.
> > >>
> > >> I am using Avro specific serialization so I set my Avro schemas 
> like this:
> > >>
> > >> AvroSerialization.addToConfiguration( configuration );
> > >> AvroSerialization.setKeyReaderSchema(configuration, 
>  SpecificKey1.SCHEMA$
> > >> );
> > >> AvroSerialization.setKeyWriterSchema(configuration,   
> SpecificKey1.SCHEMA$
> > >> );
> > >>        AvroSerialization.setValueReaderSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >> AvroSerialization.setValueWriterSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >>
> > >> My understanding of Avro MR is that the Serialization class is 
> intended to
> > >> be invoked between the map and reduce phase.
> > >>
> > >> However my test fails at reduce stage. Debugging I realised the mock
> > >> reducer context is using the serializer to copy objects:
> > >>
> > >>
> > >> 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > >>
> > >> Looking at the AvroSerialization object it only expects one set of
> > >> schemas:
> > >>
> > >>
> > >> 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > >>
> > >> So when my reducer tries to write SpecificValue2 to the context, 
> MRUnit's
> > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ 
> and as a
> > >> result fails.
> > >>
> > >> I have yet debugged Hadoop itself but I did read some comments 
> (which I
> > >> since cannot locate) which says that the Serialization class is 
> typically
> > >> not used for the output of the reduce stage. My limited 
> understanding is
> > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > >> deserializer when you are running in Hadoop.
> > >>
> > >> I can spend some time distilling my code into a simple example but
> > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit 
> example.
> > >>
> > >> Jacob
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Apache MRUnit - Unit testing MapReduce - 
> http://incubator.apache.org/mrunit/

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
Yes MRUNIT-101 will solely use the output format for serialization if 
specified.

No the user does not need to deserialize, you must also specify an 
inputformat so that I can deserialize the output back into the usual 
output lists that work with runTest()

On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> Jim, Brock
>
> Thanks for getting back to me so quickly, and yes I suspect MR-101 is 
> the answer.
>
> The key thing I wanted to establish is whether:
>
>  1) The "contract" is that the Serialization concrete implementations 
> listed in "io.serializations" should only ever be used for serializing 
> mapper output in the shuffle stage.
>
>  2) OR I am doing something very wrong with Avro - for example I 
> should only be using the same schema for map and reduce output.
>
> Assuming (1) is correct then MR-101 would make a big difference, as 
> long as you could avoid using the serializer to clone the output of 
> the reducer. I am guessing you would use the concrete OutputFormat to 
> serialize the reducer output to a stream and then the unit tester 
> would need to deserialize themselves to assert the output? But what 
> would people who just want to stick to asserting based on the reducer 
> output do?
>
> I will try and boil my issue down to a canned example over the next 
> few days. If you are interested in Avro they are working on 
> integrating Garret Wu's MR2 extensions in 1.7 and there is a test case 
> here:
>
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup 
>
>
> I am happy to test MR-101 for you if you let me know when its available.
>
> Regards
>
> Jacob
>
>
> > From: brock@cloudera.com
> > Date: Wed, 9 May 2012 09:17:42 -0500
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > To: mrunit-user@incubator.apache.org
> >
> > Hi,
> >
> > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > share the exception/error you saw? If you have time, I'd enjoy seeing
> > a small example of the code in question so we can add that to our test
> > suite.
> >
> > Cheers,
> > Brock
> >
> > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <do...@gmail.com> 
> wrote:
> > > I am not too familar with Avro, maybe someone else can respond but 
> if the
> > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
> should fix
> > > your problem. I am just finishing this JIRA up, it works under 
> Hadoop 1+, I
> > > am having issues with TaskAttemptContext and JobContext changing from
> > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > >
> > > I should resolve this over the next few days. In the meantime if 
> you can
> > > post your code I can test against it. It may also be worth the MRUnit
> > > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > > easily test against the trunk without having to build it or 
> download the jar
> > > from Jenkins.
> > >
> > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > >
> > >
> > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > >>
> > >>
> > >> I am trying to integrate Avro-1.7 (specifically the new MR2 
> extensions),
> > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any 
> mistakes my
> > >> question is should MRUnit be using the Serialization factory when 
> I call
> > >> context.write() in a reducer.
> > >>
> > >> I am using MapReduceDriver and my mapper has output signature:
> > >>
> > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > >>
> > >> My reducer has a different outputt signature:
> > >>
> > >> <AvroKey<SpecificValue2>, Null>.
> > >>
> > >> I am using Avro specific serialization so I set my Avro schemas 
> like this:
> > >>
> > >> AvroSerialization.addToConfiguration( configuration );
> > >> AvroSerialization.setKeyReaderSchema(configuration, 
>  SpecificKey1.SCHEMA$
> > >> );
> > >> AvroSerialization.setKeyWriterSchema(configuration,   
> SpecificKey1.SCHEMA$
> > >> );
> > >>        AvroSerialization.setValueReaderSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >> AvroSerialization.setValueWriterSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >>
> > >> My understanding of Avro MR is that the Serialization class is 
> intended to
> > >> be invoked between the map and reduce phase.
> > >>
> > >> However my test fails at reduce stage. Debugging I realised the mock
> > >> reducer context is using the serializer to copy objects:
> > >>
> > >>
> > >> 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > >>
> > >> Looking at the AvroSerialization object it only expects one set of
> > >> schemas:
> > >>
> > >>
> > >> 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > >>
> > >> So when my reducer tries to write SpecificValue2 to the context, 
> MRUnit's
> > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ 
> and as a
> > >> result fails.
> > >>
> > >> I have yet debugged Hadoop itself but I did read some comments 
> (which I
> > >> since cannot locate) which says that the Serialization class is 
> typically
> > >> not used for the output of the reduce stage. My limited 
> understanding is
> > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > >> deserializer when you are running in Hadoop.
> > >>
> > >> I can spend some time distilling my code into a simple example but
> > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit 
> example.
> > >>
> > >> Jacob
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Apache MRUnit - Unit testing MapReduce - 
> http://incubator.apache.org/mrunit/

RE: Deserializer used for both Map and Reducer context.write()

Posted by Jacob Metcalf <ja...@hotmail.com>.
Jim, Brock
Thanks for getting back to me so quickly, and yes I suspect MR-101 is the answer. 
The key thing I wanted to establish is whether:
 1) The "contract" is that the Serialization concrete implementations listed in "io.serializations" should only ever be used for serializing mapper output in the shuffle stage.
 2) OR I am doing something very wrong with Avro - for example I should only be using the same schema for map and reduce output.
Assuming (1) is correct then MR-101 would make a big difference, as long as you could avoid using the serializer to clone the output of the reducer. I am guessing you would use the concrete OutputFormat to serialize the reducer output to a stream and then the unit tester would need to deserialize themselves to assert the output? But what would people who just want to stick to asserting based on the reducer output do? 
I will try and boil my issue down to a canned example over the next few days. If you are interested in Avro they are working on integrating Garret Wu's MR2 extensions in 1.7 and there is a test case here:
http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

I am happy to test MR-101 for you if you let me know when its available. 
Regards 
Jacob

> From: brock@cloudera.com
> Date: Wed, 9 May 2012 09:17:42 -0500
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> To: mrunit-user@incubator.apache.org
> 
> Hi,
> 
> As Jim says, I wonder if MRUNIT-101 will help.  Would it possible to
> share the exception/error you saw?  If you have time, I'd enjoy seeing
> a small example of the code in question so we can add that to our test
> suite.
> 
> Cheers,
> Brock
> 
> On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <do...@gmail.com> wrote:
> > I am not too familar with Avro, maybe someone else can respond but if the
> > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] should fix
> > your problem. I am just finishing this JIRA up, it works under Hadoop 1+, I
> > am having issues with TaskAttemptContext and JobContext changing from
> > classes to interfaces in the mapreduce api in Hadoop 0.23.
> >
> > I should resolve this over the next few days. In the meantime if you can
> > post your code I can test against it. It may also be worth the MRUnit
> > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > easily test against the trunk without having to build it or download the jar
> > from Jenkins.
> >
> > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> >
> >
> > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> >>
> >>
> >> I am trying to integrate Avro-1.7 (specifically the new MR2 extensions),
> >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any mistakes my
> >> question is should MRUnit be using the Serialization factory when I call
> >> context.write() in a reducer.
> >>
> >> I am using MapReduceDriver and my mapper has output signature:
> >>
> >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> >>
> >> My reducer has a different outputt signature:
> >>
> >> <AvroKey<SpecificValue2>, Null>.
> >>
> >> I am using Avro specific serialization so I set my Avro schemas like this:
> >>
> >> AvroSerialization.addToConfiguration( configuration );
> >> AvroSerialization.setKeyReaderSchema(configuration,  SpecificKey1.SCHEMA$
> >> );
> >> AvroSerialization.setKeyWriterSchema(configuration,   SpecificKey1.SCHEMA$
> >> );
> >>        AvroSerialization.setValueReaderSchema(configuration,
> >> SpecificValue1.SCHEMA$ );
> >> AvroSerialization.setValueWriterSchema(configuration,
> >> SpecificValue1.SCHEMA$ );
> >>
> >> My understanding of Avro MR is that the Serialization class is intended to
> >> be invoked between the map and reduce phase.
> >>
> >> However my test fails at reduce stage. Debugging I realised the mock
> >> reducer context is using the serializer to copy objects:
> >>
> >>
> >> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> >>
> >> Looking at the AvroSerialization object it only expects one set of
> >> schemas:
> >>
> >>
> >> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> >>
> >> So when my reducer tries to write SpecificValue2 to the context, MRUnit's
> >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ and as a
> >> result fails.
> >>
> >> I have yet debugged Hadoop itself but I did read some comments (which I
> >> since cannot locate) which says that the Serialization class is typically
> >> not used for the output of the reduce stage. My limited understanding is
> >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> >> deserializer when you are running in Hadoop.
> >>
> >> I can spend some time distilling my code into a simple example but
> >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit example.
> >>
> >> Jacob
> >>
> >>
> >>
> >
> 
> 
> 
> -- 
> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
 		 	   		  

Re: Deserializer used for both Map and Reducer context.write()

Posted by Brock Noland <br...@cloudera.com>.
Hi,

As Jim says, I wonder if MRUNIT-101 will help.  Would it possible to
share the exception/error you saw?  If you have time, I'd enjoy seeing
a small example of the code in question so we can add that to our test
suite.

Cheers,
Brock

On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <do...@gmail.com> wrote:
> I am not too familar with Avro, maybe someone else can respond but if the
> AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] should fix
> your problem. I am just finishing this JIRA up, it works under Hadoop 1+, I
> am having issues with TaskAttemptContext and JobContext changing from
> classes to interfaces in the mapreduce api in Hadoop 0.23.
>
> I should resolve this over the next few days. In the meantime if you can
> post your code I can test against it. It may also be worth the MRUnit
> project exploring having Jenkins deploy a snapshot to Nexus so you can
> easily test against the trunk without having to build it or download the jar
> from Jenkins.
>
> [1]: https://issues.apache.org/jira/browse/MRUNIT-101
>
>
> On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
>>
>>
>> I am trying to integrate Avro-1.7 (specifically the new MR2 extensions),
>> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any mistakes my
>> question is should MRUnit be using the Serialization factory when I call
>> context.write() in a reducer.
>>
>> I am using MapReduceDriver and my mapper has output signature:
>>
>> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
>>
>> My reducer has a different outputt signature:
>>
>> <AvroKey<SpecificValue2>, Null>.
>>
>> I am using Avro specific serialization so I set my Avro schemas like this:
>>
>> AvroSerialization.addToConfiguration( configuration );
>> AvroSerialization.setKeyReaderSchema(configuration,  SpecificKey1.SCHEMA$
>> );
>> AvroSerialization.setKeyWriterSchema(configuration,   SpecificKey1.SCHEMA$
>> );
>>        AvroSerialization.setValueReaderSchema(configuration,
>> SpecificValue1.SCHEMA$ );
>> AvroSerialization.setValueWriterSchema(configuration,
>> SpecificValue1.SCHEMA$ );
>>
>> My understanding of Avro MR is that the Serialization class is intended to
>> be invoked between the map and reduce phase.
>>
>> However my test fails at reduce stage. Debugging I realised the mock
>> reducer context is using the serializer to copy objects:
>>
>>
>> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
>>
>> Looking at the AvroSerialization object it only expects one set of
>> schemas:
>>
>>
>> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
>>
>> So when my reducer tries to write SpecificValue2 to the context, MRUnit's
>> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ and as a
>> result fails.
>>
>> I have yet debugged Hadoop itself but I did read some comments (which I
>> since cannot locate) which says that the Serialization class is typically
>> not used for the output of the reduce stage. My limited understanding is
>> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
>> deserializer when you are running in Hadoop.
>>
>> I can spend some time distilling my code into a simple example but
>> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit example.
>>
>> Jacob
>>
>>
>>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Deserializer used for both Map and Reducer context.write()

Posted by Jim Donofrio <do...@gmail.com>.
I am not too familar with Avro, maybe someone else can respond but if 
the AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
should fix your problem. I am just finishing this JIRA up, it works 
under Hadoop 1+, I am having issues with TaskAttemptContext and 
JobContext changing from classes to interfaces in the mapreduce api in 
Hadoop 0.23.

I should resolve this over the next few days. In the meantime if you can 
post your code I can test against it. It may also be worth the MRUnit 
project exploring having Jenkins deploy a snapshot to Nexus so you can 
easily test against the trunk without having to build it or download the 
jar from Jenkins.

[1]: https://issues.apache.org/jira/browse/MRUNIT-101

On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
>
> I am trying to integrate Avro-1.7 (specifically the new MR2 
> extensions), MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made 
> any mistakes my question is should MRUnit be using the Serialization 
> factory when I call context.write() in a reducer.
>
> I am using MapReduceDriver and my mapper has output signature:
>
> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
>
> My reducer has a different outputt signature:
>
> <AvroKey<SpecificValue2>, Null>.
>
> I am using Avro specific serialization so I set my Avro schemas like this:
>
> AvroSerialization.addToConfiguration( configuration );
> AvroSerialization.setKeyReaderSchema(configuration, 
>  SpecificKey1.SCHEMA$ );
> AvroSerialization.setKeyWriterSchema(configuration,   
> SpecificKey1.SCHEMA$ );
>         AvroSerialization.setValueReaderSchema(configuration, 
> SpecificValue1.SCHEMA$ );
> AvroSerialization.setValueWriterSchema(configuration, 
> SpecificValue1.SCHEMA$ );
>
> My understanding of Avro MR is that the Serialization class is 
> intended to be invoked between the map and reduce phase.
>
> However my test fails at reduce stage. Debugging I realised the mock 
> reducer context is using the serializer to copy objects:
>
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java 
>
>
> Looking at the AvroSerialization object it only expects one set of 
> schemas:
>
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup 
>
>
> So when my reducer tries to write SpecificValue2 to the context, 
> MRUnit's mock then tries to serialise SpecificValue2 with 
> Value1.SCHEMA$ and as a result fails.
>
> I have yet debugged Hadoop itself but I did read some comments (which 
> I since cannot locate) which says that the Serialization class is 
> typically not used for the output of the reduce stage. My limited 
> understanding is that the OutputFormat (e.g. AvroKeyOutputFormat) will 
> act as the deserializer when you are running in Hadoop.
>
> I can spend some time distilling my code into a simple example but 
> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit example.
>
> Jacob
>
>
>