You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Jacob Metcalf <ja...@hotmail.com> on 2012/06/28 00:09:59 UTC

Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2


I spent an hour or so of today debugging some map reduce jobs I had developed in Avro 1.7 and Map Reduce 2 and thought it might be constructive to share. I needed to do a reduce side join for which you need a composite key. The key consists of the key you are actually grouping by and an integer which is just used for sorting (the technique is described in many places but there is a nice picture on page 24 of http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf). 
For this I thought it would be ideal to use Avro pair class which has a handy function for creating its own schema so I could configure the shuffle something like this:
Schema joinKeySchema = Pair.getPairSchema( Schema.create( Schema.type.STRING ), Schema.create( Schema.type.INTEGER ));AvroJob.setMapOutputKeySchema( joinKeySchema ); I then planned to use the standard AvroKeyComparator for sorting and a specialised comparator for grouping/partitioning which would ignore the integer part. However it did not work as the sort on the integer did not appear to take place and my map output would arrive in the wrong order at the reducer. I finally tracked the issue down to the fact that the pair schema by default ignores the second part of the pair:
private static Schema makePairSchema(Schema key, Schema value) {    Schema pair = Schema.createRecord(PAIR, null, null, false);    List<Field> fields = new ArrayList<Field>();    fields.add(new Field(KEY, key, "", null));    fields.add(new Field(VALUE, value, "", null, Field.Order.IGNORE));    pair.setFields(fields);    return pair;  }
In the end it was easy enough to work around by creating my own pair schema. I am not an expert but I suspect there is a very valid application for this ignore in MR1. As a suggestion it may help going forwards if a second version with a boolean to toggle the ignore were introduced to make the semantics clearer .
Jacob 		 	   		  

Re: Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2

Posted by Scott Carey <sc...@apache.org>.
It sounds like we need to be extra clear in the documentation on Pair, and
perhaps have a different class or flavor that serves the purpose you needed.
("KeyPair"?)

In Avro's MRV1 API, there is no key schema or value schema for map output,
but only one map output schema that must be a Pair ‹ a pair of key and
value, where only the key is used for the sort.

-Scott

On 6/27/12 3:09 PM, "Jacob Metcalf" <ja...@hotmail.com> wrote:

> 
> 
> I spent an hour or so of today debugging some map reduce jobs I had developed
> in Avro 1.7 and Map Reduce 2 and thought it might be constructive to share. I
> needed to do a reduce side join for which you need a composite key. The key
> consists of the key you are actually grouping by and an integer which is just
> used for sorting (the technique is described in many places but there is a
> nice picture on page 24 of
> http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf).
> 
> 
> For this I thought it would be ideal to use Avro pair class which has a handy
> function for creating its own schema so I could configure the shuffle
> something like this:
> 
>> 
>> Schema joinKeySchema = Pair.getPairSchema( Schema.create( Schema.type.STRING
>> ), Schema.create( Schema.type.INTEGER ));
>> AvroJob.setMapOutputKeySchema( joinKeySchema );
>  
> I then planned to use the standard AvroKeyComparator for sorting and a
> specialised comparator for grouping/partitioning which would ignore the
> integer part. However it did not work as the sort on the integer did not
> appear to take place and my map output would arrive in the wrong order at the
> reducer. I finally tracked the issue down to the fact that the pair schema by
> default ignores the second part of the pair:
> 
> 
> private static Schema makePairSchema(Schema key, Schema value) {
>     Schema pair = Schema.createRecord(PAIR, null, null, false);
>     List<Field> fields = new ArrayList<Field>();
>     fields.add(new Field(KEY, key, "", null));
>     fields.add(new Field(VALUE, value, "", null, Field.Order.IGNORE));
>     pair.setFields(fields);
>     return pair;
>   }
> 
> 
> In the end it was easy enough to work around by creating my own pair schema. I
> am not an expert but I suspect there is a very valid application for this
> ignore in MR1. As a suggestion it may help going forwards if a second version
> with a boolean to toggle the ignore were introduced to make the semantics
> clearer .
> 
> 
> 
> Jacob