You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Gyanit <gy...@gmail.com> on 2009/03/11 03:44:17 UTC

Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

I have large number of key,value pairs. I don't actually care if data goes in
value or key. Let me be more exact. 
(k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
pair. I can put it in keys or values.
I have experimented with both options (heavy key , light value)  vs (light
key, heavy value). It turns out that hk,lv option is much much better than
(lk,hv). 
Has someone else also noticed this?
Is there a way to make things faster in light key , heavy value option. As
some application will need that also. 
Remember in both cases we are talking about atleast dozen or so million
pairs.
There is a difference of time in shuffle phase. Which is weird as amount of
data transferred is same.

-gyanit
-- 
View this message in context: http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Tim Wintle <ti...@teamrubber.com>.

On Tue, 2009-03-10 at 19:44 -0700, Gyanit wrote:
> I have large number of key,value pairs. I don't actually care if data goes in
> value or key. Let me be more exact. 
> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
> pair. I can put it in keys or values.
> I have experimented with both options (heavy key , light value)  vs (light
> key, heavy value). It turns out that hk,lv option is much much better than
> (lk,hv). 
<snip>
> There is a difference of time in shuffle phase. Which is weird as amount of
> data transferred is same.

just an idea, but is this related to the hash function? are there the
same number of reducers no matter which you do?

As I understand it the reducers merge-sort the data while the shuffle is
happening, so if each reducer has to sort less data this could be part
of it.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Gyanit <gy...@gmail.com>.

I notices one more thing. Lighter keys tend to make smaller number of unique
keys.
For example (key,value) pairs may be 10Mil, but if key is lighter unique
keys might be just 1000.
In other case if keys are heavier unique keys might be 5 mil.
I think this might have something to do with it. 
Bottom line: If your reduce is simple dump and no combining, the put data in
keys than values.

I need to put data in values. Any suggestions on how to make it faster.

-Gyanit.


Scott Carey wrote:
> 
> That is a fascinating question.  I would also love to know the reason
> behind this.
> 
> If I were to guess I would have thought that smaller keys and heavier
> values would slightly outperform, rather than significantly underperform. 
> (assuming total pair count at each phase is the same).   Perhaps there is
> room for optimization here?
> 
> 
> 
> On 3/10/09 6:44 PM, "Gyanit" <gy...@gmail.com> wrote:
> 
> 
> 
> I have large number of key,value pairs. I don't actually care if data goes
> in
> value or key. Let me be more exact.
> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
> pair. I can put it in keys or values.
> I have experimented with both options (heavy key , light value)  vs (light
> key, heavy value). It turns out that hk,lv option is much much better than
> (lk,hv).
> Has someone else also noticed this?
> Is there a way to make things faster in light key , heavy value option. As
> some application will need that also.
> Remember in both cases we are talking about atleast dozen or so million
> pairs.
> There is a difference of time in shuffle phase. Which is weird as amount
> of
> data transferred is same.
> 
> -gyanit
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463050.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Richa Khandelwal <ri...@gmail.com>.

I am running the same test and job that completes in 10 mins for (hk,lv)
case takes  is still running after 30mins have passed for (sk,hv) case.
Would be interesting to pinpoint the reason behind it.
On Wed, Mar 11, 2009 at 1:27 PM, Gyanit <gy...@gmail.com> wrote:

>
> Here are exact numbers:
> # of (k,v) pairs = 1.2 Mil this is same.
> # of unique k = 1000, k is integer.
> # of  unique v = 1Mil, v is a big big string.
> For a given k, cumulative size of all v associated to it is about 30 Mb.
> (That is each v is about 25~30Kb)
> # of Mappers = 30
> # of Reducers = 10
>
> (v,k) is atleast 4/5 times faster than (k,v).
>
> -Gyanit
>
>
> Scott Carey wrote:
> >
> > Well if the smaller keys are producing fewer unique values, there should
> > be some more significant differences.
> >
> > I had assumed that your test produced the same number of unique values.
> >
> > I'm still not sure why there would be that significant of a difference as
> > long as the total number of unique values in the small key test is a good
> > deal larger than the number of reducers and there is not too much skew in
> > the bucket sizes.  If there are a small subset of keys in the small key
> > test that contain a large subset of the values, then the reducers will
> > have very skewed work sizes and this could explain your observation.
> >
> >
> > On 3/11/09 11:50 AM, "Gyanit" <gy...@gmail.com> wrote:
> >
> >
> >
> > I notices one more thing. Lighter keys tend to make smaller number of
> > unique
> > keys.
> > For example (key,value) pairs may be 10Mil, but if key is lighter unique
> > keys might be just 1000.
> > In other case if keys are heavier unique keys might be 5 mil.
> > I think this might have something to do with it.
> > Bottom line: If your reduce is simple dump and no combining, the put data
> > in
> > keys than values.
> >
> > I need to put data in values. Any suggestions on how to make it faster.
> >
> > -Gyanit.
> >
> >
> > Scott Carey wrote:
> >>
> >> That is a fascinating question.  I would also love to know the reason
> >> behind this.
> >>
> >> If I were to guess I would have thought that smaller keys and heavier
> >> values would slightly outperform, rather than significantly
> underperform.
> >> (assuming total pair count at each phase is the same).   Perhaps there
> is
> >> room for optimization here?
> >>
> >>
> >>
> >> On 3/10/09 6:44 PM, "Gyanit" <gy...@gmail.com> wrote:
> >>
> >>
> >>
> >> I have large number of key,value pairs. I don't actually care if data
> >> goes
> >> in
> >> value or key. Let me be more exact.
> >> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for
> each
> >> pair. I can put it in keys or values.
> >> I have experimented with both options (heavy key , light value)  vs
> >> (light
> >> key, heavy value). It turns out that hk,lv option is much much better
> >> than
> >> (lk,hv).
> >> Has someone else also noticed this?
> >> Is there a way to make things faster in light key , heavy value option.
> >> As
> >> some application will need that also.
> >> Remember in both cases we are talking about atleast dozen or so million
> >> pairs.
> >> There is a difference of time in shuffle phase. Which is weird as amount
> >> of
> >> data transferred is same.
> >>
> >> -gyanit
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >>
> >>
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463784.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Gyanit <gy...@gmail.com>.

Here are exact numbers:
# of (k,v) pairs = 1.2 Mil this is same. 
# of unique k = 1000, k is integer.
# of  unique v = 1Mil, v is a big big string.
For a given k, cumulative size of all v associated to it is about 30 Mb.
(That is each v is about 25~30Kb)
# of Mappers = 30
# of Reducers = 10

(v,k) is atleast 4/5 times faster than (k,v).

-Gyanit


Scott Carey wrote:
> 
> Well if the smaller keys are producing fewer unique values, there should
> be some more significant differences.
> 
> I had assumed that your test produced the same number of unique values.
> 
> I'm still not sure why there would be that significant of a difference as
> long as the total number of unique values in the small key test is a good
> deal larger than the number of reducers and there is not too much skew in
> the bucket sizes.  If there are a small subset of keys in the small key
> test that contain a large subset of the values, then the reducers will
> have very skewed work sizes and this could explain your observation.
> 
> 
> On 3/11/09 11:50 AM, "Gyanit" <gy...@gmail.com> wrote:
> 
> 
> 
> I notices one more thing. Lighter keys tend to make smaller number of
> unique
> keys.
> For example (key,value) pairs may be 10Mil, but if key is lighter unique
> keys might be just 1000.
> In other case if keys are heavier unique keys might be 5 mil.
> I think this might have something to do with it.
> Bottom line: If your reduce is simple dump and no combining, the put data
> in
> keys than values.
> 
> I need to put data in values. Any suggestions on how to make it faster.
> 
> -Gyanit.
> 
> 
> Scott Carey wrote:
>>
>> That is a fascinating question.  I would also love to know the reason
>> behind this.
>>
>> If I were to guess I would have thought that smaller keys and heavier
>> values would slightly outperform, rather than significantly underperform.
>> (assuming total pair count at each phase is the same).   Perhaps there is
>> room for optimization here?
>>
>>
>>
>> On 3/10/09 6:44 PM, "Gyanit" <gy...@gmail.com> wrote:
>>
>>
>>
>> I have large number of key,value pairs. I don't actually care if data
>> goes
>> in
>> value or key. Let me be more exact.
>> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
>> pair. I can put it in keys or values.
>> I have experimented with both options (heavy key , light value)  vs
>> (light
>> key, heavy value). It turns out that hk,lv option is much much better
>> than
>> (lk,hv).
>> Has someone else also noticed this?
>> Is there a way to make things faster in light key , heavy value option.
>> As
>> some application will need that also.
>> Remember in both cases we are talking about atleast dozen or so million
>> pairs.
>> There is a difference of time in shuffle phase. Which is weird as amount
>> of
>> data transferred is same.
>>
>> -gyanit
>> --
>> View this message in context:
>> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>
>>
> 
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463784.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Scott Carey <sc...@richrelevance.com>.

Well if the smaller keys are producing fewer unique values, there should be some more significant differences.

I had assumed that your test produced the same number of unique values.

I'm still not sure why there would be that significant of a difference as long as the total number of unique values in the small key test is a good deal larger than the number of reducers and there is not too much skew in the bucket sizes.  If there are a small subset of keys in the small key test that contain a large subset of the values, then the reducers will have very skewed work sizes and this could explain your observation.

On 3/11/09 11:50 AM, "Gyanit" <gy...@gmail.com> wrote:

I notices one more thing. Lighter keys tend to make smaller number of unique
keys.
For example (key,value) pairs may be 10Mil, but if key is lighter unique
keys might be just 1000.
In other case if keys are heavier unique keys might be 5 mil.
I think this might have something to do with it.
Bottom line: If your reduce is simple dump and no combining, the put data in
keys than values.

I need to put data in values. Any suggestions on how to make it faster.

-Gyanit.

Scott Carey wrote:
>
> That is a fascinating question.  I would also love to know the reason
> behind this.
>
> If I were to guess I would have thought that smaller keys and heavier
> values would slightly outperform, rather than significantly underperform.
> (assuming total pair count at each phase is the same).   Perhaps there is
> room for optimization here?
>
>
>
> On 3/10/09 6:44 PM, "Gyanit" <gy...@gmail.com> wrote:
>
>
>
> I have large number of key,value pairs. I don't actually care if data goes
> in
> value or key. Let me be more exact.
> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
> pair. I can put it in keys or values.
> I have experimented with both options (heavy key , light value)  vs (light
> key, heavy value). It turns out that hk,lv option is much much better than
> (lk,hv).
> Has someone else also noticed this?
> Is there a way to make things faster in light key , heavy value option. As
> some application will need that also.
> Remember in both cases we are talking about atleast dozen or so million
> pairs.
> There is a difference of time in shuffle phase. Which is weird as amount
> of
> data transferred is same.
>
> -gyanit
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>
>

--
View this message in context: http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Gyanit <gy...@gmail.com>.

I notices one more thing. Lighter keys tend to make smaller number of unique
keys.
For example (key,value) pairs may be 10Mil, but if key is lighter unique
keys might be just 1000.
In other case if keys are heavier unique keys might be 5 mil.
I think this might have something to do with it. 
Bottom line: If your reduce is simple dump and no combining, the put data in
keys than values.

I need to put data in values. Any suggestions on how to make it faster.

-Gyanit.


Scott Carey wrote:
> 
> That is a fascinating question.  I would also love to know the reason
> behind this.
> 
> If I were to guess I would have thought that smaller keys and heavier
> values would slightly outperform, rather than significantly underperform. 
> (assuming total pair count at each phase is the same).   Perhaps there is
> room for optimization here?
> 
> 
> 
> On 3/10/09 6:44 PM, "Gyanit" <gy...@gmail.com> wrote:
> 
> 
> 
> I have large number of key,value pairs. I don't actually care if data goes
> in
> value or key. Let me be more exact.
> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
> pair. I can put it in keys or values.
> I have experimented with both options (heavy key , light value)  vs (light
> key, heavy value). It turns out that hk,lv option is much much better than
> (lk,hv).
> Has someone else also noticed this?
> Is there a way to make things faster in light key , heavy value option. As
> some application will need that also.
> Remember in both cases we are talking about atleast dozen or so million
> pairs.
> There is a difference of time in shuffle phase. Which is weird as amount
> of
> data transferred is same.
> 
> -gyanit
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

Posted by Scott Carey <sc...@richrelevance.com>.

That is a fascinating question.  I would also love to know the reason behind this.

If I were to guess I would have thought that smaller keys and heavier values would slightly outperform, rather than significantly underperform.  (assuming total pair count at each phase is the same).   Perhaps there is room for optimization here?



On 3/10/09 6:44 PM, "Gyanit" <gy...@gmail.com> wrote:



I have large number of key,value pairs. I don't actually care if data goes in
value or key. Let me be more exact.
(k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
pair. I can put it in keys or values.
I have experimented with both options (heavy key , light value)  vs (light
key, heavy value). It turns out that hk,lv option is much much better than
(lk,hv).
Has someone else also noticed this?
Is there a way to make things faster in light key , heavy value option. As
some application will need that also.
Remember in both cases we are talking about atleast dozen or so million
pairs.
There is a difference of time in shuffle phase. Which is weird as amount of
data transferred is same.

-gyanit
--
View this message in context: http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.