You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jeff Zhang <zj...@gmail.com> on 2009/11/15 05:03:14 UTC

How to handle imbalanced data in hadoop ?

Hi all,

Today there's a problem about imbalanced data come out of mind .

I'd like to know how hadoop handle this kind of data.  e.g. one key
dominates the map output, say 99%. So 99% data set will go to one reducer,
and this reducer will become the bottleneck.

Does hadoop have any other better ways to handle such imbalanced data set ?


Jeff Zhang

Re: How to handle imbalanced data in hadoop ?

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
This is the time for all three phases of reducer right?
I think its due to the constant spilling for a single key to disk since the map partitions couldn't be held in-mem due to buffer limit. Did the other reducer have numerous keys with low number of values ( ie smaller partitions? )

Thanks,
Amogh


On 11/18/09 3:37 AM, "Todd Lipcon" <to...@cloudera.com> wrote:

On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com> wrote:

> With respect to Imbalanced data, Can anyone guide me how sorting takes
> place
> in Hadoop after Map phase.
>
> I did some experiments and found that if there are two reducers which have
> same number of keys to sort and one reducer has all the keys same and other
> have different keys then time taken by by the reducer having all keys same
> is terribly large then other one.
>
>
Hi Pankil,

This is an interesting experiment you've done with results that I wouldn't
quite expect. Do you have the java source available that you used to run
this experiment?



> Also I found that length on my Key doesnt matter in the time taken to sort
> it.
>
>
With small keys on CPU-bound workload this is probably the case since the
sort would be dominated by comparison. If you were to benchmark keys that
are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.


> I wanted some hints how sorting is done ..
>

MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
The actual sort is a relatively basic quicksort, but there is plenty of
complexity in the spill/shuffle/merge logic.

-Todd



>
> Pankil
>
> On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> >wrote:
>
> > Hey Jeff,
> >
> > You may be interested in the Skewed Design specification from the Pig
> team:
> > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >
> > Regards,
> > Jeff
> >
> > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> wrote:
> >
> > > My first thought is that it depends on the reduce logic. If you could
> do
> > > the
> > > reduction in two passes then you could do an initial arbitrary
> partition
> > > for
> > > the majority key and bring the partitions together in a second
> reduction
> > > (or
> > > a map-side join). I would use a round robin strategy to assign the
> > > arbitrary
> > > partitions.
> > >
> > >
> > >
> > >
> > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Today there's a problem about imbalanced data come out of mind .
> > > >
> > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > dominates the map output, say 99%. So 99% data set will go to one
> > > reducer,
> > > > and this reducer will become the bottleneck.
> > > >
> > > > Does hadoop have any other better ways to handle such imbalanced data
> > set
> > > ?
> > > >
> > > >
> > > > Jeff Zhang
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Pankil,

Thanks for sending these along. I'll try to block out some time this week to
take a look.

-Todd

On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <fo...@gmail.com> wrote:

> Hey  Todd,
>
> I will attach dataset and java source used by me. Make sure you use with 10
> reducers and also use partitioner class as I have provided.
>
> Dataset-1 has smaller key length
> Dataset-2 has larger key length
>
> When I experiment with both dataset , According my partitioner class
> Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take
> maximum amount of time in all reducers.( like 17 mins) where as remaining
> all reducers also get 100000 keys but they all are not same , and these
> reducers get over in (1 min 30 sec on avg).
>
> Pankil
>
>
> On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com>
>> wrote:
>>
>> > With respect to Imbalanced data, Can anyone guide me how sorting takes
>> > place
>> > in Hadoop after Map phase.
>> >
>> > I did some experiments and found that if there are two reducers which
>> have
>> > same number of keys to sort and one reducer has all the keys same and
>> other
>> > have different keys then time taken by by the reducer having all keys
>> same
>> > is terribly large then other one.
>> >
>> >
>> Hi Pankil,
>>
>> This is an interesting experiment you've done with results that I wouldn't
>> quite expect. Do you have the java source available that you used to run
>> this experiment?
>>
>>
>>
>> > Also I found that length on my Key doesnt matter in the time taken to
>> sort
>> > it.
>> >
>> >
>> With small keys on CPU-bound workload this is probably the case since the
>> sort would be dominated by comparison. If you were to benchmark keys that
>> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.
>>
>>
>> > I wanted some hints how sorting is done ..
>> >
>>
>> MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
>> The actual sort is a relatively basic quicksort, but there is plenty of
>> complexity in the spill/shuffle/merge logic.
>>
>> -Todd
>>
>>
>>
>> >
>> > Pankil
>> >
>> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
>> > >wrote:
>> >
>> > > Hey Jeff,
>> > >
>> > > You may be interested in the Skewed Design specification from the Pig
>> > team:
>> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
>> > >
>> > > Regards,
>> > > Jeff
>> > >
>> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
>> > wrote:
>> > >
>> > > > My first thought is that it depends on the reduce logic. If you
>> could
>> > do
>> > > > the
>> > > > reduction in two passes then you could do an initial arbitrary
>> > partition
>> > > > for
>> > > > the majority key and bring the partitions together in a second
>> > reduction
>> > > > (or
>> > > > a map-side join). I would use a round robin strategy to assign the
>> > > > arbitrary
>> > > > partitions.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Today there's a problem about imbalanced data come out of mind .
>> > > > >
>> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
>> key
>> > > > > dominates the map output, say 99%. So 99% data set will go to one
>> > > > reducer,
>> > > > > and this reducer will become the bottleneck.
>> > > > >
>> > > > > Does hadoop have any other better ways to handle such imbalanced
>> data
>> > > set
>> > > > ?
>> > > > >
>> > > > >
>> > > > > Jeff Zhang
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: How to handle imbalanced data in hadoop ?

Posted by Todd Lipcon <to...@cloudera.com>.

Oops - I mistakenly assumed the test Reducer was just some kind of
wordcount-esque summer.

In fact, it has an O(n^2) operation, essentially:

                       sValue += values.next().toString() + '\t';

Appending to a string like this is very slow, and explains why the reducers
that get a lot of the same key are way slower. It's allocating lots of
excess objects, and doing a ton of memory copies.

Switching to a stringbuffer fixed the whole thing:

-               String sValue="";
+        StringBuffer sb = new StringBuffer();

                while (values.hasNext()) {
                        iCount++;
-                       sValue += values.next().toString() + '\t';
+                       sb.append(values.next().toString()).append('\t');

                }

Hope that helps, Pankil.

-Todd


On Mon, Nov 23, 2009 at 9:32 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Interesting. I don't have the 17 minutes issue, but the reducer with the
> identical keys is taking about twice as long as the others.
>
> Looking at counters, most of the tasks have "Reduce shuffle bytes" 0,
> whereas the slow one has reduce shuffle bytes 1,100,006 as expected.
>
> Logs on the slow one:
>
> 2009-11-23 21:20:48,524 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1100002 bytes
> 2009-11-23 21:22:53,763 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200911232110_0007_r_000009_0 is done. And is in the process of commiting
>
>
> On the fast one:
>
> 2009-11-23 21:20:45,370 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1100002 bytes
> 2009-11-23 21:21:27,535 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200911232110_0007_r_000008_0 is done. And is in the process of commiting
>
>
> So something about the merge is slow when all keys are identical, perhaps.
> Which is strange, because this isn't even much of a "merge" - there was only
> one mapper.
>
> I have a plane flight tonight, will try to take a deeper look.
>
> -Todd
>
>
>
> On Wed, Nov 18, 2009 at 2:53 PM, Pankil Doshi <fo...@gmail.com> wrote:
>
>> Ya thats true. Though it depends on my cluster configuration but still
>> other
>> reducers (0 to 8) also have 100000 keys to handle in which keys are
>> different and they get done in (1 min 30 sec on avg).
>> But 9th reducer getting all 100000 keys same takes 17 mins.
>>
>> Pankil
>>
>> On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <ru...@gmail.com> wrote:
>>
>> > Is it true that most of the 17 minutes for the reducer with the 100000
>> same
>> > keys was taken by the sort phase? If so, that means that the sorting
>> > algorithm does not handle the special case well.
>> >
>> >
>> > On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <fo...@gmail.com>
>> > wrote:
>> >
>> > > Hey  Todd,
>> > >
>> > > I will attach dataset and java source used by me. Make sure you use
>> with
>> > 10
>> > > reducers and also use partitioner class as I have provided.
>> > >
>> > > Dataset-1 has smaller key length
>> > > Dataset-2 has larger key length
>> > >
>> > > When I experiment with both dataset , According my partitioner class
>> > > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it
>> > take
>> > > maximum amount of time in all reducers.( like 17 mins) where as
>> remaining
>> > > all reducers also get 100000 keys but they all are not same , and
>> these
>> > > reducers get over in (1 min 30 sec on avg).
>> > >
>> > > Pankil
>> > >
>> > >
>> > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> > >
>> > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > With respect to Imbalanced data, Can anyone guide me how sorting
>> takes
>> > >> > place
>> > >> > in Hadoop after Map phase.
>> > >> >
>> > >> > I did some experiments and found that if there are two reducers
>> which
>> > >> have
>> > >> > same number of keys to sort and one reducer has all the keys same
>> and
>> > >> other
>> > >> > have different keys then time taken by by the reducer having all
>> keys
>> > >> same
>> > >> > is terribly large then other one.
>> > >> >
>> > >> >
>> > >> Hi Pankil,
>> > >>
>> > >> This is an interesting experiment you've done with results that I
>> > wouldn't
>> > >> quite expect. Do you have the java source available that you used to
>> run
>> > >> this experiment?
>> > >>
>> > >>
>> > >>
>> > >> > Also I found that length on my Key doesnt matter in the time taken
>> to
>> > >> sort
>> > >> > it.
>> > >> >
>> > >> >
>> > >> With small keys on CPU-bound workload this is probably the case since
>> > the
>> > >> sort would be dominated by comparison. If you were to benchmark keys
>> > that
>> > >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a
>> > difference.
>> > >>
>> > >>
>> > >> > I wanted some hints how sorting is done ..
>> > >> >
>> > >>
>> > >> MapTask.java, ReduceTask.java, and Merger.java are the key places to
>> > look.
>> > >> The actual sort is a relatively basic quicksort, but there is plenty
>> of
>> > >> complexity in the spill/shuffle/merge logic.
>> > >>
>> > >> -Todd
>> > >>
>> > >>
>> > >>
>> > >> >
>> > >> > Pankil
>> > >> >
>> > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
>> > hammer@cloudera.com
>> > >> > >wrote:
>> > >> >
>> > >> > > Hey Jeff,
>> > >> > >
>> > >> > > You may be interested in the Skewed Design specification from the
>> > Pig
>> > >> > team:
>> > >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
>> > >> > >
>> > >> > > Regards,
>> > >> > > Jeff
>> > >> > >
>> > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <
>> xcolwell@gmail.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > My first thought is that it depends on the reduce logic. If you
>> > >> could
>> > >> > do
>> > >> > > > the
>> > >> > > > reduction in two passes then you could do an initial arbitrary
>> > >> > partition
>> > >> > > > for
>> > >> > > > the majority key and bring the partitions together in a second
>> > >> > reduction
>> > >> > > > (or
>> > >> > > > a map-side join). I would use a round robin strategy to assign
>> the
>> > >> > > > arbitrary
>> > >> > > > partitions.
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zjffdu@gmail.com
>> >
>> > >> wrote:
>> > >> > > >
>> > >> > > > > Hi all,
>> > >> > > > >
>> > >> > > > > Today there's a problem about imbalanced data come out of
>> mind .
>> > >> > > > >
>> > >> > > > > I'd like to know how hadoop handle this kind of data.  e.g.
>> one
>> > >> key
>> > >> > > > > dominates the map output, say 99%. So 99% data set will go to
>> > one
>> > >> > > > reducer,
>> > >> > > > > and this reducer will become the bottleneck.
>> > >> > > > >
>> > >> > > > > Does hadoop have any other better ways to handle such
>> imbalanced
>> > >> data
>> > >> > > set
>> > >> > > > ?
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > Jeff Zhang
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: How to handle imbalanced data in hadoop ?

Posted by Todd Lipcon <to...@cloudera.com>.

Interesting. I don't have the 17 minutes issue, but the reducer with the
identical keys is taking about twice as long as the others.

Looking at counters, most of the tasks have "Reduce shuffle bytes" 0,
whereas the slow one has reduce shuffle bytes 1,100,006 as expected.

Logs on the slow one:

2009-11-23 21:20:48,524 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 1 segments left of total size: 1100002 bytes
2009-11-23 21:22:53,763 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_200911232110_0007_r_000009_0 is done. And is in the
process of commiting


On the fast one:

2009-11-23 21:20:45,370 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 1 segments left of total size: 1100002 bytes
2009-11-23 21:21:27,535 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_200911232110_0007_r_000008_0 is done. And is in the
process of commiting


So something about the merge is slow when all keys are identical, perhaps.
Which is strange, because this isn't even much of a "merge" - there was only
one mapper.

I have a plane flight tonight, will try to take a deeper look.

-Todd


On Wed, Nov 18, 2009 at 2:53 PM, Pankil Doshi <fo...@gmail.com> wrote:

> Ya thats true. Though it depends on my cluster configuration but still
> other
> reducers (0 to 8) also have 100000 keys to handle in which keys are
> different and they get done in (1 min 30 sec on avg).
> But 9th reducer getting all 100000 keys same takes 17 mins.
>
> Pankil
>
> On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <ru...@gmail.com> wrote:
>
> > Is it true that most of the 17 minutes for the reducer with the 100000
> same
> > keys was taken by the sort phase? If so, that means that the sorting
> > algorithm does not handle the special case well.
> >
> >
> > On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <fo...@gmail.com>
> > wrote:
> >
> > > Hey  Todd,
> > >
> > > I will attach dataset and java source used by me. Make sure you use
> with
> > 10
> > > reducers and also use partitioner class as I have provided.
> > >
> > > Dataset-1 has smaller key length
> > > Dataset-2 has larger key length
> > >
> > > When I experiment with both dataset , According my partitioner class
> > > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it
> > take
> > > maximum amount of time in all reducers.( like 17 mins) where as
> remaining
> > > all reducers also get 100000 keys but they all are not same , and these
> > > reducers get over in (1 min 30 sec on avg).
> > >
> > > Pankil
> > >
> > >
> > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com>
> > >> wrote:
> > >>
> > >> > With respect to Imbalanced data, Can anyone guide me how sorting
> takes
> > >> > place
> > >> > in Hadoop after Map phase.
> > >> >
> > >> > I did some experiments and found that if there are two reducers
> which
> > >> have
> > >> > same number of keys to sort and one reducer has all the keys same
> and
> > >> other
> > >> > have different keys then time taken by by the reducer having all
> keys
> > >> same
> > >> > is terribly large then other one.
> > >> >
> > >> >
> > >> Hi Pankil,
> > >>
> > >> This is an interesting experiment you've done with results that I
> > wouldn't
> > >> quite expect. Do you have the java source available that you used to
> run
> > >> this experiment?
> > >>
> > >>
> > >>
> > >> > Also I found that length on my Key doesnt matter in the time taken
> to
> > >> sort
> > >> > it.
> > >> >
> > >> >
> > >> With small keys on CPU-bound workload this is probably the case since
> > the
> > >> sort would be dominated by comparison. If you were to benchmark keys
> > that
> > >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a
> > difference.
> > >>
> > >>
> > >> > I wanted some hints how sorting is done ..
> > >> >
> > >>
> > >> MapTask.java, ReduceTask.java, and Merger.java are the key places to
> > look.
> > >> The actual sort is a relatively basic quicksort, but there is plenty
> of
> > >> complexity in the spill/shuffle/merge logic.
> > >>
> > >> -Todd
> > >>
> > >>
> > >>
> > >> >
> > >> > Pankil
> > >> >
> > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
> > hammer@cloudera.com
> > >> > >wrote:
> > >> >
> > >> > > Hey Jeff,
> > >> > >
> > >> > > You may be interested in the Skewed Design specification from the
> > Pig
> > >> > team:
> > >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > >> > >
> > >> > > Regards,
> > >> > > Jeff
> > >> > >
> > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <
> xcolwell@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > > My first thought is that it depends on the reduce logic. If you
> > >> could
> > >> > do
> > >> > > > the
> > >> > > > reduction in two passes then you could do an initial arbitrary
> > >> > partition
> > >> > > > for
> > >> > > > the majority key and bring the partitions together in a second
> > >> > reduction
> > >> > > > (or
> > >> > > > a map-side join). I would use a round robin strategy to assign
> the
> > >> > > > arbitrary
> > >> > > > partitions.
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > Today there's a problem about imbalanced data come out of mind
> .
> > >> > > > >
> > >> > > > > I'd like to know how hadoop handle this kind of data.  e.g.
> one
> > >> key
> > >> > > > > dominates the map output, say 99%. So 99% data set will go to
> > one
> > >> > > > reducer,
> > >> > > > > and this reducer will become the bottleneck.
> > >> > > > >
> > >> > > > > Does hadoop have any other better ways to handle such
> imbalanced
> > >> data
> > >> > > set
> > >> > > > ?
> > >> > > > >
> > >> > > > >
> > >> > > > > Jeff Zhang
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Pankil Doshi <fo...@gmail.com>.

Ya thats true. Though it depends on my cluster configuration but still other
reducers (0 to 8) also have 100000 keys to handle in which keys are
different and they get done in (1 min 30 sec on avg).
But 9th reducer getting all 100000 keys same takes 17 mins.

Pankil

On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <ru...@gmail.com> wrote:

> Is it true that most of the 17 minutes for the reducer with the 100000 same
> keys was taken by the sort phase? If so, that means that the sorting
> algorithm does not handle the special case well.
>
>
> On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <fo...@gmail.com>
> wrote:
>
> > Hey  Todd,
> >
> > I will attach dataset and java source used by me. Make sure you use with
> 10
> > reducers and also use partitioner class as I have provided.
> >
> > Dataset-1 has smaller key length
> > Dataset-2 has larger key length
> >
> > When I experiment with both dataset , According my partitioner class
> > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it
> take
> > maximum amount of time in all reducers.( like 17 mins) where as remaining
> > all reducers also get 100000 keys but they all are not same , and these
> > reducers get over in (1 min 30 sec on avg).
> >
> > Pankil
> >
> >
> > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com>
> >> wrote:
> >>
> >> > With respect to Imbalanced data, Can anyone guide me how sorting takes
> >> > place
> >> > in Hadoop after Map phase.
> >> >
> >> > I did some experiments and found that if there are two reducers which
> >> have
> >> > same number of keys to sort and one reducer has all the keys same and
> >> other
> >> > have different keys then time taken by by the reducer having all keys
> >> same
> >> > is terribly large then other one.
> >> >
> >> >
> >> Hi Pankil,
> >>
> >> This is an interesting experiment you've done with results that I
> wouldn't
> >> quite expect. Do you have the java source available that you used to run
> >> this experiment?
> >>
> >>
> >>
> >> > Also I found that length on my Key doesnt matter in the time taken to
> >> sort
> >> > it.
> >> >
> >> >
> >> With small keys on CPU-bound workload this is probably the case since
> the
> >> sort would be dominated by comparison. If you were to benchmark keys
> that
> >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a
> difference.
> >>
> >>
> >> > I wanted some hints how sorting is done ..
> >> >
> >>
> >> MapTask.java, ReduceTask.java, and Merger.java are the key places to
> look.
> >> The actual sort is a relatively basic quicksort, but there is plenty of
> >> complexity in the spill/shuffle/merge logic.
> >>
> >> -Todd
> >>
> >>
> >>
> >> >
> >> > Pankil
> >> >
> >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
> hammer@cloudera.com
> >> > >wrote:
> >> >
> >> > > Hey Jeff,
> >> > >
> >> > > You may be interested in the Skewed Design specification from the
> Pig
> >> > team:
> >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >> > >
> >> > > Regards,
> >> > > Jeff
> >> > >
> >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > My first thought is that it depends on the reduce logic. If you
> >> could
> >> > do
> >> > > > the
> >> > > > reduction in two passes then you could do an initial arbitrary
> >> > partition
> >> > > > for
> >> > > > the majority key and bring the partitions together in a second
> >> > reduction
> >> > > > (or
> >> > > > a map-side join). I would use a round robin strategy to assign the
> >> > > > arbitrary
> >> > > > partitions.
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > Today there's a problem about imbalanced data come out of mind .
> >> > > > >
> >> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
> >> key
> >> > > > > dominates the map output, say 99%. So 99% data set will go to
> one
> >> > > > reducer,
> >> > > > > and this reducer will become the bottleneck.
> >> > > > >
> >> > > > > Does hadoop have any other better ways to handle such imbalanced
> >> data
> >> > > set
> >> > > > ?
> >> > > > >
> >> > > > >
> >> > > > > Jeff Zhang
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Runping Qi <ru...@gmail.com>.

Is it true that most of the 17 minutes for the reducer with the 100000 same
keys was taken by the sort phase? If so, that means that the sorting
algorithm does not handle the special case well.


On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <fo...@gmail.com> wrote:

> Hey  Todd,
>
> I will attach dataset and java source used by me. Make sure you use with 10
> reducers and also use partitioner class as I have provided.
>
> Dataset-1 has smaller key length
> Dataset-2 has larger key length
>
> When I experiment with both dataset , According my partitioner class
> Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take
> maximum amount of time in all reducers.( like 17 mins) where as remaining
> all reducers also get 100000 keys but they all are not same , and these
> reducers get over in (1 min 30 sec on avg).
>
> Pankil
>
>
> On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com>
>> wrote:
>>
>> > With respect to Imbalanced data, Can anyone guide me how sorting takes
>> > place
>> > in Hadoop after Map phase.
>> >
>> > I did some experiments and found that if there are two reducers which
>> have
>> > same number of keys to sort and one reducer has all the keys same and
>> other
>> > have different keys then time taken by by the reducer having all keys
>> same
>> > is terribly large then other one.
>> >
>> >
>> Hi Pankil,
>>
>> This is an interesting experiment you've done with results that I wouldn't
>> quite expect. Do you have the java source available that you used to run
>> this experiment?
>>
>>
>>
>> > Also I found that length on my Key doesnt matter in the time taken to
>> sort
>> > it.
>> >
>> >
>> With small keys on CPU-bound workload this is probably the case since the
>> sort would be dominated by comparison. If you were to benchmark keys that
>> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.
>>
>>
>> > I wanted some hints how sorting is done ..
>> >
>>
>> MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
>> The actual sort is a relatively basic quicksort, but there is plenty of
>> complexity in the spill/shuffle/merge logic.
>>
>> -Todd
>>
>>
>>
>> >
>> > Pankil
>> >
>> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
>> > >wrote:
>> >
>> > > Hey Jeff,
>> > >
>> > > You may be interested in the Skewed Design specification from the Pig
>> > team:
>> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
>> > >
>> > > Regards,
>> > > Jeff
>> > >
>> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
>> > wrote:
>> > >
>> > > > My first thought is that it depends on the reduce logic. If you
>> could
>> > do
>> > > > the
>> > > > reduction in two passes then you could do an initial arbitrary
>> > partition
>> > > > for
>> > > > the majority key and bring the partitions together in a second
>> > reduction
>> > > > (or
>> > > > a map-side join). I would use a round robin strategy to assign the
>> > > > arbitrary
>> > > > partitions.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Today there's a problem about imbalanced data come out of mind .
>> > > > >
>> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
>> key
>> > > > > dominates the map output, say 99%. So 99% data set will go to one
>> > > > reducer,
>> > > > > and this reducer will become the bottleneck.
>> > > > >
>> > > > > Does hadoop have any other better ways to handle such imbalanced
>> data
>> > > set
>> > > > ?
>> > > > >
>> > > > >
>> > > > > Jeff Zhang
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: How to handle imbalanced data in hadoop ?

Posted by Pankil Doshi <fo...@gmail.com>.

Hey  Todd,

I will attach dataset and java source used by me. Make sure you use with 10
reducers and also use partitioner class as I have provided.

Dataset-1 has smaller key length
Dataset-2 has larger key length

When I experiment with both dataset , According my partitioner class Reducer
9 (i.e 10 if start with 1) gets all 100000 keys same and so it take maximum
amount of time in all reducers.( like 17 mins) where as remaining all
reducers also get 100000 keys but they all are not same , and these reducers
get over in (1 min 30 sec on avg).

Pankil

On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com> wrote:
>
> > With respect to Imbalanced data, Can anyone guide me how sorting takes
> > place
> > in Hadoop after Map phase.
> >
> > I did some experiments and found that if there are two reducers which
> have
> > same number of keys to sort and one reducer has all the keys same and
> other
> > have different keys then time taken by by the reducer having all keys
> same
> > is terribly large then other one.
> >
> >
> Hi Pankil,
>
> This is an interesting experiment you've done with results that I wouldn't
> quite expect. Do you have the java source available that you used to run
> this experiment?
>
>
>
> > Also I found that length on my Key doesnt matter in the time taken to
> sort
> > it.
> >
> >
> With small keys on CPU-bound workload this is probably the case since the
> sort would be dominated by comparison. If you were to benchmark keys that
> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.
>
>
> > I wanted some hints how sorting is done ..
> >
>
> MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
> The actual sort is a relatively basic quicksort, but there is plenty of
> complexity in the spill/shuffle/merge logic.
>
> -Todd
>
>
>
> >
> > Pankil
> >
> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> > >wrote:
> >
> > > Hey Jeff,
> > >
> > > You may be interested in the Skewed Design specification from the Pig
> > team:
> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > >
> > > Regards,
> > > Jeff
> > >
> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> > wrote:
> > >
> > > > My first thought is that it depends on the reduce logic. If you could
> > do
> > > > the
> > > > reduction in two passes then you could do an initial arbitrary
> > partition
> > > > for
> > > > the majority key and bring the partitions together in a second
> > reduction
> > > > (or
> > > > a map-side join). I would use a round robin strategy to assign the
> > > > arbitrary
> > > > partitions.
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Today there's a problem about imbalanced data come out of mind .
> > > > >
> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > > dominates the map output, say 99%. So 99% data set will go to one
> > > > reducer,
> > > > > and this reducer will become the bottleneck.
> > > > >
> > > > > Does hadoop have any other better ways to handle such imbalanced
> data
> > > set
> > > > ?
> > > > >
> > > > >
> > > > > Jeff Zhang
> > > > >
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com> wrote:

> With respect to Imbalanced data, Can anyone guide me how sorting takes
> place
> in Hadoop after Map phase.
>
> I did some experiments and found that if there are two reducers which have
> same number of keys to sort and one reducer has all the keys same and other
> have different keys then time taken by by the reducer having all keys same
> is terribly large then other one.
>
>
Hi Pankil,

This is an interesting experiment you've done with results that I wouldn't
quite expect. Do you have the java source available that you used to run
this experiment?



> Also I found that length on my Key doesnt matter in the time taken to sort
> it.
>
>
With small keys on CPU-bound workload this is probably the case since the
sort would be dominated by comparison. If you were to benchmark keys that
are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference.


> I wanted some hints how sorting is done ..
>

MapTask.java, ReduceTask.java, and Merger.java are the key places to look.
The actual sort is a relatively basic quicksort, but there is plenty of
complexity in the spill/shuffle/merge logic.

-Todd



>
> Pankil
>
> On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> >wrote:
>
> > Hey Jeff,
> >
> > You may be interested in the Skewed Design specification from the Pig
> team:
> > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >
> > Regards,
> > Jeff
> >
> > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> wrote:
> >
> > > My first thought is that it depends on the reduce logic. If you could
> do
> > > the
> > > reduction in two passes then you could do an initial arbitrary
> partition
> > > for
> > > the majority key and bring the partitions together in a second
> reduction
> > > (or
> > > a map-side join). I would use a round robin strategy to assign the
> > > arbitrary
> > > partitions.
> > >
> > >
> > >
> > >
> > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Today there's a problem about imbalanced data come out of mind .
> > > >
> > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > dominates the map output, say 99%. So 99% data set will go to one
> > > reducer,
> > > > and this reducer will become the bottleneck.
> > > >
> > > > Does hadoop have any other better ways to handle such imbalanced data
> > set
> > > ?
> > > >
> > > >
> > > > Jeff Zhang
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Pankil Doshi <fo...@gmail.com>.

Thanks a lot Todd.

Its very good suggestion. Will change my reducer code and see how does it
goes.

Pankil


On Tue, Nov 24, 2009 at 11:18 AM, Todd Lipcon <to...@cloudera.com> wrote:

> I don't disagree, but as I noted above, the majority of the issue Pankil
> saw
> was an inefficient reducer that was extremely slow for large value sets for
> a given key. When I fixed that issue, the sort time was within the noise
> margin between the "all the same" and "9 different values" reducers.
>
> -Todd
>
> On Tue, Nov 24, 2009 at 8:09 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > agree, the worst case for quick sort is O(n2) if all the data is the
> same.
> >
> > and in this case the running for heapsort is O(n*log n)
> >
> > Jeff Zhang
> >
> >
> > On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <te...@gmail.com> wrote:
> >
> > > 2009/11/18 Pankil Doshi <fo...@gmail.com>
> > >
> > > > With respect to Imbalanced data, Can anyone guide me how sorting
> takes
> > > > place
> > > > in Hadoop after Map phase.
> > > >
> > > > I did some experiments and found that if there are two reducers which
> > > have
> > > > same number of keys to sort and one reducer has all the keys same and
> > > other
> > > > have different keys then time taken by by the reducer having all keys
> > > same
> > > > is terribly large then other one.
> > > >
> > >
> > > Maybe that is caused by the sorting algorithm.
> > >
> > > It is make sence if we consider the default sort algorithm is quick
> sort.
> > > In quick sort if
> > > all the sorting data is quite the same, sort iteration will be very
> deep.
> > >
> > > Would you please try changing the job configuration "map.sort.class" to
> > be
> > > org.apache.hadoop.util.HeapSort?
> > >
> > >
> > > > Also I found that length on my Key doesnt matter in the time taken to
> > > sort
> > > > it.
> > > >
> > > > I wanted some hints how sorting is done ..
> > > >
> > > > Pankil
> > > >
> > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
> > hammer@cloudera.com
> > > > >wrote:
> > > >
> > > > > Hey Jeff,
> > > > >
> > > > > You may be interested in the Skewed Design specification from the
> Pig
> > > > team:
> > > > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > > > >
> > > > > Regards,
> > > > > Jeff
> > > > >
> > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xcolwell@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > My first thought is that it depends on the reduce logic. If you
> > could
> > > > do
> > > > > > the
> > > > > > reduction in two passes then you could do an initial arbitrary
> > > > partition
> > > > > > for
> > > > > > the majority key and bring the partitions together in a second
> > > > reduction
> > > > > > (or
> > > > > > a map-side join). I would use a round robin strategy to assign
> the
> > > > > > arbitrary
> > > > > > partitions.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Today there's a problem about imbalanced data come out of mind
> .
> > > > > > >
> > > > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
> > key
> > > > > > > dominates the map output, say 99%. So 99% data set will go to
> one
> > > > > > reducer,
> > > > > > > and this reducer will become the bottleneck.
> > > > > > >
> > > > > > > Does hadoop have any other better ways to handle such
> imbalanced
> > > data
> > > > > set
> > > > > > ?
> > > > > > >
> > > > > > >
> > > > > > > Jeff Zhang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Todd Lipcon <to...@cloudera.com>.

I don't disagree, but as I noted above, the majority of the issue Pankil saw
was an inefficient reducer that was extremely slow for large value sets for
a given key. When I fixed that issue, the sort time was within the noise
margin between the "all the same" and "9 different values" reducers.

-Todd

On Tue, Nov 24, 2009 at 8:09 AM, Jeff Zhang <zj...@gmail.com> wrote:

> agree, the worst case for quick sort is O(n2) if all the data is the same.
>
> and in this case the running for heapsort is O(n*log n)
>
> Jeff Zhang
>
>
> On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <te...@gmail.com> wrote:
>
> > 2009/11/18 Pankil Doshi <fo...@gmail.com>
> >
> > > With respect to Imbalanced data, Can anyone guide me how sorting takes
> > > place
> > > in Hadoop after Map phase.
> > >
> > > I did some experiments and found that if there are two reducers which
> > have
> > > same number of keys to sort and one reducer has all the keys same and
> > other
> > > have different keys then time taken by by the reducer having all keys
> > same
> > > is terribly large then other one.
> > >
> >
> > Maybe that is caused by the sorting algorithm.
> >
> > It is make sence if we consider the default sort algorithm is quick sort.
> > In quick sort if
> > all the sorting data is quite the same, sort iteration will be very deep.
> >
> > Would you please try changing the job configuration "map.sort.class" to
> be
> > org.apache.hadoop.util.HeapSort?
> >
> >
> > > Also I found that length on my Key doesnt matter in the time taken to
> > sort
> > > it.
> > >
> > > I wanted some hints how sorting is done ..
> > >
> > > Pankil
> > >
> > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <
> hammer@cloudera.com
> > > >wrote:
> > >
> > > > Hey Jeff,
> > > >
> > > > You may be interested in the Skewed Design specification from the Pig
> > > team:
> > > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > > >
> > > > Regards,
> > > > Jeff
> > > >
> > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> > > wrote:
> > > >
> > > > > My first thought is that it depends on the reduce logic. If you
> could
> > > do
> > > > > the
> > > > > reduction in two passes then you could do an initial arbitrary
> > > partition
> > > > > for
> > > > > the majority key and bring the partitions together in a second
> > > reduction
> > > > > (or
> > > > > a map-side join). I would use a round robin strategy to assign the
> > > > > arbitrary
> > > > > partitions.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Today there's a problem about imbalanced data come out of mind .
> > > > > >
> > > > > > I'd like to know how hadoop handle this kind of data.  e.g. one
> key
> > > > > > dominates the map output, say 99%. So 99% data set will go to one
> > > > > reducer,
> > > > > > and this reducer will become the bottleneck.
> > > > > >
> > > > > > Does hadoop have any other better ways to handle such imbalanced
> > data
> > > > set
> > > > > ?
> > > > > >
> > > > > >
> > > > > > Jeff Zhang
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Jeff Zhang <zj...@gmail.com>.

agree, the worst case for quick sort is O(n2) if all the data is the same.

and in this case the running for heapsort is O(n*log n)

Jeff Zhang


On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <te...@gmail.com> wrote:

> 2009/11/18 Pankil Doshi <fo...@gmail.com>
>
> > With respect to Imbalanced data, Can anyone guide me how sorting takes
> > place
> > in Hadoop after Map phase.
> >
> > I did some experiments and found that if there are two reducers which
> have
> > same number of keys to sort and one reducer has all the keys same and
> other
> > have different keys then time taken by by the reducer having all keys
> same
> > is terribly large then other one.
> >
>
> Maybe that is caused by the sorting algorithm.
>
> It is make sence if we consider the default sort algorithm is quick sort.
> In quick sort if
> all the sorting data is quite the same, sort iteration will be very deep.
>
> Would you please try changing the job configuration "map.sort.class" to be
> org.apache.hadoop.util.HeapSort?
>
>
> > Also I found that length on my Key doesnt matter in the time taken to
> sort
> > it.
> >
> > I wanted some hints how sorting is done ..
> >
> > Pankil
> >
> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> > >wrote:
> >
> > > Hey Jeff,
> > >
> > > You may be interested in the Skewed Design specification from the Pig
> > team:
> > > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> > >
> > > Regards,
> > > Jeff
> > >
> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> > wrote:
> > >
> > > > My first thought is that it depends on the reduce logic. If you could
> > do
> > > > the
> > > > reduction in two passes then you could do an initial arbitrary
> > partition
> > > > for
> > > > the majority key and bring the partitions together in a second
> > reduction
> > > > (or
> > > > a map-side join). I would use a round robin strategy to assign the
> > > > arbitrary
> > > > partitions.
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Today there's a problem about imbalanced data come out of mind .
> > > > >
> > > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > > dominates the map output, say 99%. So 99% data set will go to one
> > > > reducer,
> > > > > and this reducer will become the bottleneck.
> > > > >
> > > > > Does hadoop have any other better ways to handle such imbalanced
> data
> > > set
> > > > ?
> > > > >
> > > > >
> > > > > Jeff Zhang
> > > > >
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Ted Xu <te...@gmail.com>.

2009/11/18 Pankil Doshi <fo...@gmail.com>

> With respect to Imbalanced data, Can anyone guide me how sorting takes
> place
> in Hadoop after Map phase.
>
> I did some experiments and found that if there are two reducers which have
> same number of keys to sort and one reducer has all the keys same and other
> have different keys then time taken by by the reducer having all keys same
> is terribly large then other one.
>

Maybe that is caused by the sorting algorithm.

It is make sence if we consider the default sort algorithm is quick sort.
In quick sort if
all the sorting data is quite the same, sort iteration will be very deep.

Would you please try changing the job configuration "map.sort.class" to be
org.apache.hadoop.util.HeapSort?


> Also I found that length on my Key doesnt matter in the time taken to sort
> it.
>
> I wanted some hints how sorting is done ..
>
> Pankil
>
> On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> >wrote:
>
> > Hey Jeff,
> >
> > You may be interested in the Skewed Design specification from the Pig
> team:
> > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >
> > Regards,
> > Jeff
> >
> > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> wrote:
> >
> > > My first thought is that it depends on the reduce logic. If you could
> do
> > > the
> > > reduction in two passes then you could do an initial arbitrary
> partition
> > > for
> > > the majority key and bring the partitions together in a second
> reduction
> > > (or
> > > a map-side join). I would use a round robin strategy to assign the
> > > arbitrary
> > > partitions.
> > >
> > >
> > >
> > >
> > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Today there's a problem about imbalanced data come out of mind .
> > > >
> > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > dominates the map output, say 99%. So 99% data set will go to one
> > > reducer,
> > > > and this reducer will become the bottleneck.
> > > >
> > > > Does hadoop have any other better ways to handle such imbalanced data
> > set
> > > ?
> > > >
> > > >
> > > > Jeff Zhang
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Ted Yu <yu...@gmail.com>.

Can someone fix the typo on http://wiki.apache.org/pig/PigSkewedJoinSpec in
the first bullet ?
tow-table inner join

Thanks

On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <fo...@gmail.com> wrote:

> With respect to Imbalanced data, Can anyone guide me how sorting takes
> place
> in Hadoop after Map phase.
>
> I did some experiments and found that if there are two reducers which have
> same number of keys to sort and one reducer has all the keys same and other
> have different keys then time taken by by the reducer having all keys same
> is terribly large then other one.
>
> Also I found that length on my Key doesnt matter in the time taken to sort
> it.
>
> I wanted some hints how sorting is done ..
>
> Pankil
>
> On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <hammer@cloudera.com
> >wrote:
>
> > Hey Jeff,
> >
> > You may be interested in the Skewed Design specification from the Pig
> team:
> > http://wiki.apache.org/pig/PigSkewedJoinSpec.
> >
> > Regards,
> > Jeff
> >
> > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com>
> wrote:
> >
> > > My first thought is that it depends on the reduce logic. If you could
> do
> > > the
> > > reduction in two passes then you could do an initial arbitrary
> partition
> > > for
> > > the majority key and bring the partitions together in a second
> reduction
> > > (or
> > > a map-side join). I would use a round robin strategy to assign the
> > > arbitrary
> > > partitions.
> > >
> > >
> > >
> > >
> > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Today there's a problem about imbalanced data come out of mind .
> > > >
> > > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > > dominates the map output, say 99%. So 99% data set will go to one
> > > reducer,
> > > > and this reducer will become the bottleneck.
> > > >
> > > > Does hadoop have any other better ways to handle such imbalanced data
> > set
> > > ?
> > > >
> > > >
> > > > Jeff Zhang
> > > >
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Pankil Doshi <fo...@gmail.com>.

With respect to Imbalanced data, Can anyone guide me how sorting takes place
in Hadoop after Map phase.

I did some experiments and found that if there are two reducers which have
same number of keys to sort and one reducer has all the keys same and other
have different keys then time taken by by the reducer having all keys same
is terribly large then other one.

Also I found that length on my Key doesnt matter in the time taken to sort
it.

I wanted some hints how sorting is done ..

Pankil

On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> Hey Jeff,
>
> You may be interested in the Skewed Design specification from the Pig team:
> http://wiki.apache.org/pig/PigSkewedJoinSpec.
>
> Regards,
> Jeff
>
> On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com> wrote:
>
> > My first thought is that it depends on the reduce logic. If you could do
> > the
> > reduction in two passes then you could do an initial arbitrary partition
> > for
> > the majority key and bring the partitions together in a second reduction
> > (or
> > a map-side join). I would use a round robin strategy to assign the
> > arbitrary
> > partitions.
> >
> >
> >
> >
> > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Today there's a problem about imbalanced data come out of mind .
> > >
> > > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > > dominates the map output, say 99%. So 99% data set will go to one
> > reducer,
> > > and this reducer will become the bottleneck.
> > >
> > > Does hadoop have any other better ways to handle such imbalanced data
> set
> > ?
> > >
> > >
> > > Jeff Zhang
> > >
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Jeff,

You may be interested in the Skewed Design specification from the Pig team:
http://wiki.apache.org/pig/PigSkewedJoinSpec.

Regards,
Jeff

On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <xc...@gmail.com> wrote:

> My first thought is that it depends on the reduce logic. If you could do
> the
> reduction in two passes then you could do an initial arbitrary partition
> for
> the majority key and bring the partitions together in a second reduction
> (or
> a map-side join). I would use a round robin strategy to assign the
> arbitrary
> partitions.
>
>
>
>
> On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > Hi all,
> >
> > Today there's a problem about imbalanced data come out of mind .
> >
> > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > dominates the map output, say 99%. So 99% data set will go to one
> reducer,
> > and this reducer will become the bottleneck.
> >
> > Does hadoop have any other better ways to handle such imbalanced data set
> ?
> >
> >
> > Jeff Zhang
> >
>

Re: How to handle imbalanced data in hadoop ?

Posted by brien colwell <xc...@gmail.com>.

My first thought is that it depends on the reduce logic. If you could do the
reduction in two passes then you could do an initial arbitrary partition for
the majority key and bring the partitions together in a second reduction (or
a map-side join). I would use a round robin strategy to assign the arbitrary
partitions.

On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Hi all,
>
> Today there's a problem about imbalanced data come out of mind .
>
> I'd like to know how hadoop handle this kind of data.  e.g. one key
> dominates the map output, say 99%. So 99% data set will go to one reducer,
> and this reducer will become the bottleneck.
>
> Does hadoop have any other better ways to handle such imbalanced data set ?
>
>
> Jeff Zhang
>