You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Qiong Zhang <ja...@yahoo-inc.com> on 2008/02/06 20:24:50 UTC

sort by value

Hi, All,

Is there a better way to sort by value in the same key before reaching
reducers?

I know it can be achieved by using
setOutputValueGroupingComparator/setOutputKeyComparatorClass.

But it actually adds duplicate data (i.e., the value column which needs
sorting) to the key.

Also, I wonder what is the benefit to sort values before reaching
reducers.
It can be achieved in the reduce phase anyway.

Thanks,
James

RE: sort by value

Posted by Qiong Zhang <ja...@yahoo-inc.com>.
Thank you all for the reply. 

Looks like the class KeyFieldBasedPartitioner in
org.apache.hadoop.mapred.lib can be used in Hadoop streaming to sort
both key (like primary key) and value (like secondary key) without data
duplication.

It is useful if we have same functionality in the native Java API.

James
-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 06, 2008 1:53 PM
To: core-user@hadoop.apache.org
Subject: Re: sort by value




On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the
records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If
you
replicate the second field in the value output of the map, then you get
this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3


Re: sort by value

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 6, 2008, at 2:52 PM, Joydeep Sen Sarma wrote:

> ok - got it. This seems to be a subtle drawback in the reduce api.  
> Keys
> in the same reduce group may differ - but the reduce api does not make
> it possible for the reducer function to get access to each key. It  
> only
> gets access to the starting key value for that group.

With a context object api, it would be:

public interface Reducer<KeyIn,ValueIn,KeyOut,ValueOut> extends  
Closeable {
   void reduce(ReducerContext<KeyIn,ValueIn,KeyOut,ValueOut> context
               ) throws IOException, InterruptedException;
}

and the ReducerContext would have:

KeyIn getKey();

to get the key. Then your assumption would work because it would be  
the current key. Now if only I had time to work on HADOOP-1230. *smile*

-- Owen

RE: sort by value

Posted by Joydeep Sen Sarma <js...@facebook.com>.
ok - got it. This seems to be a subtle drawback in the reduce api. Keys
in the same reduce group may differ - but the reduce api does not make
it possible for the reducer function to get access to each key. It only
gets access to the starting key value for that group.

If the api was instead:

class KVpair { WritableComparable key, Writable value }
reduce(WritableComparable groupKey, Iterator<KVpair> keyvalues)

then we would be in good shape (since we can see the key and the value
and don't have to duplicate any data across them).

The underlying iterator has access to this data - it's just not
available through the api.

I suspect that these kinds of small optimizations are too complex to
make for a one-time job - but for any query language on top of hadoop -
it's a one time effort and probably worth it.

joydeep

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 06, 2008 1:53 PM
To: core-user@hadoop.apache.org
Subject: Re: sort by value




On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the
records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If
you
replicate the second field in the value output of the map, then you get
this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3


Re: sort by value

Posted by Ted Dunning <td...@veoh.com>.


On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If you
replicate the second field in the value output of the map, then you get this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3


RE: sort by value

Posted by Joydeep Sen Sarma <js...@facebook.com>.
> But it actually adds duplicate data (i.e., the value column which
needs 
> sorting) to the key.

Why? U can always take it out of the value to remove the redundancy.

> Also, I wonder what is the benefit to sort values before reaching
> reducers. It can be achieved in the reduce phase anyway.

The reduce only does a merge of sorted segments. The segments have to be
sorted using all the sort fields before the merge itself. Otherwise u
can't do a merge. (hope I understood the question right)


-----Original Message-----
From: Qiong Zhang [mailto:jamesz@yahoo-inc.com] 
Sent: Wednesday, February 06, 2008 11:25 AM
To: core-user@hadoop.apache.org
Subject: sort by value


Hi, All,

Is there a better way to sort by value in the same key before reaching
reducers?

I know it can be achieved by using
setOutputValueGroupingComparator/setOutputKeyComparatorClass.

But it actually adds duplicate data (i.e., the value column which needs
sorting) to the key.

Also, I wonder what is the benefit to sort values before reaching
reducers.
It can be achieved in the reduce phase anyway.

Thanks,
James

Re: sort by value

Posted by Ted Dunning <td...@veoh.com>.

The method to describe is the standard approach.

The benefit is that the data that arrives at the reducer might be larger
than you want to store in memory (for sorting by the reduce).  Also, reading
the entire set of reduce values would increase the amount of data allocated
and would mean that you would need to make two passes over each reduce set
(at least).  Sorting in the shuffle phase is essentially free.

One conventional use of this sorting is to ensure that summary data is
processed before other data.  For instance, if you are estimating
conditional probabilities p(b | a) and you have counts k(a, b) and k(a, *)
then it is nice to reduce on a so that you get k(a, *), k(a, b_1), k(a,
b_2)... as the input to the reducer.  With a simple sort, you can guarantee
that the k(a,*) value comes first which makes it easier to computer
k(a,b)/k(a,*) since you would already have the value of k(a,*) handy.

Another, much more obscure, use is in co-grouping where sorting in random
order can help minimize memory use for temporary buffering as you split the
reduce values into two or more lists or if you implement iterators over
virtual lists.


On 2/6/08 11:24 AM, "Qiong Zhang" <ja...@yahoo-inc.com> wrote:

> 
> Hi, All,
> 
> Is there a better way to sort by value in the same key before reaching
> reducers?
> 
> I know it can be achieved by using
> setOutputValueGroupingComparator/setOutputKeyComparatorClass.
> 
> But it actually adds duplicate data (i.e., the value column which needs
> sorting) to the key.
> 
> Also, I wonder what is the benefit to sort values before reaching
> reducers.
> It can be achieved in the reduce phase anyway.
> 
> Thanks,
> James