You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Dave Shine <Da...@channelintelligence.com> on 2012/07/20 15:20:19 UTC

Distributing Keys across Reducers

I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.

I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?

Thanks,
Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile

[cid:image001.png@01CD6658.D1D27BC0]
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com<http://www.ciboost.com/>
facebook platform | where-to-buy | product search engines | shopping engines



________________________________
The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

RE: Distributing Keys across Reducers

Posted by Dave Shine <Da...@channelintelligence.com>.

These numbers are with everything I laid out below.  The job was running acceptably until a couple of days ago when a change increased the output of the Map phase by about 30%.  I don't think there is anything special about those additional keys that would force them all into the same reducer.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI BoostT Clients  Outperform OnlineT  www.ciboost.com


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Friday, July 20, 2012 11:56 AM
To: mapreduce-user@hadoop.apache.org
Cc: john.armstrong@ccri.com
Subject: Re: Distributing Keys across Reducers

Does applying a combiner make any difference? Or are these numbers with the combiner included?

On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine <Da...@channelintelligence.com> wrote:
> Thanks John.
>
> The key is my own WritableComparable object, and I have created custom Comparator, Partitioner, and KeyValueGroupingComparator.  However, they are all pretty generic.  The Key class is has two properties, a boolean and a string.  I'm grouping on just the string, but comparing on both properties to ensure that the reducer receives the "true" values before the "false" values.
>
> My partitioner does the basic hash of just the string portion of the key class.  I'm hoping to find some guidance on how to make that partitioner smarter and avoid this problem.
>
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile CI Boost(tm) Clients  
> Outperform Online(tm)  www.ciboost.com
>
>
> -----Original Message-----
> From: John Armstrong [mailto:jrja@ccri.com]
> Sent: Friday, July 20, 2012 10:20 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Distributing Keys across Reducers
>
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I believe this is referred to as a "key skew problem", which I know 
>> is heavily dependent on the actual data being processed.  Can anyone 
>> point me to any blog posts, white papers, etc. that might give me 
>> some options on how to deal with this issue?
>
> I don't know about blog posts or white papers, but the canonical answer here is usually using a different Partitioner.
>
> The default one takes the .hash() of each Mapper output key and reduces it modulo the number of Reducers you've specified (43, here).  So the first place I'd look is to see if there's some reason you're getting so many more outputs with one key-hash-mod-43 than the others.
>
> A common answer here is that one key alone has a huge number of outputs, in which case it's hard to do anything better with it.  Another case is that your key class' hash function is bad at telling apart a certain class of keys that occur with some regularity.  Since 43 is an odd prime, I would not expect a moderately evenly distributed hash to suddenly get spikes at certain values mod-43.
>
> So if you want to (and can) rejigger your hashes to spread things more evenly, great.  If not, you're down to writing your own partitioner.
> It's slightly different depending on which API you're using, but either way you basically have to write a function called getPartition that takes a mapper output record (key and value) and the number of reducers and returns the index (from 0 to numReducers-1) of the reducer that should handle that record.  And unless you REALLY know what you're doing, the function should probably only depend on the key.
>
> Good luck.
>
> The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.



--
Harsh J

Re: Distributing Keys across Reducers

Posted by Harsh J <ha...@cloudera.com>.

Does applying a combiner make any difference? Or are these numbers
with the combiner included?

On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine
<Da...@channelintelligence.com> wrote:
> Thanks John.
>
> The key is my own WritableComparable object, and I have created custom Comparator, Partitioner, and KeyValueGroupingComparator.  However, they are all pretty generic.  The Key class is has two properties, a boolean and a string.  I'm grouping on just the string, but comparing on both properties to ensure that the reducer receives the "true" values before the "false" values.
>
> My partitioner does the basic hash of just the string portion of the key class.  I'm hoping to find some guidance on how to make that partitioner smarter and avoid this problem.
>
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile
> CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com
>
>
> -----Original Message-----
> From: John Armstrong [mailto:jrja@ccri.com]
> Sent: Friday, July 20, 2012 10:20 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Distributing Keys across Reducers
>
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I believe this is referred to as a "key skew problem", which I know is
>> heavily dependent on the actual data being processed.  Can anyone
>> point me to any blog posts, white papers, etc. that might give me some
>> options on how to deal with this issue?
>
> I don't know about blog posts or white papers, but the canonical answer here is usually using a different Partitioner.
>
> The default one takes the .hash() of each Mapper output key and reduces it modulo the number of Reducers you've specified (43, here).  So the first place I'd look is to see if there's some reason you're getting so many more outputs with one key-hash-mod-43 than the others.
>
> A common answer here is that one key alone has a huge number of outputs, in which case it's hard to do anything better with it.  Another case is that your key class' hash function is bad at telling apart a certain class of keys that occur with some regularity.  Since 43 is an odd prime, I would not expect a moderately evenly distributed hash to suddenly get spikes at certain values mod-43.
>
> So if you want to (and can) rejigger your hashes to spread things more evenly, great.  If not, you're down to writing your own partitioner.
> It's slightly different depending on which API you're using, but either way you basically have to write a function called getPartition that takes a mapper output record (key and value) and the number of reducers and returns the index (from 0 to numReducers-1) of the reducer that should handle that record.  And unless you REALLY know what you're doing, the function should probably only depend on the key.
>
> Good luck.
>
> The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.



-- 
Harsh J

RE: Distributing Keys across Reducers

Posted by Dave Shine <Da...@channelintelligence.com>.

Thanks John.

The key is my own WritableComparable object, and I have created custom Comparator, Partitioner, and KeyValueGroupingComparator.  However, they are all pretty generic.  The Key class is has two properties, a boolean and a string.  I'm grouping on just the string, but comparing on both properties to ensure that the reducer receives the "true" values before the "false" values.

My partitioner does the basic hash of just the string portion of the key class.  I'm hoping to find some guidance on how to make that partitioner smarter and avoid this problem.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com

-----Original Message-----
From: John Armstrong [mailto:jrja@ccri.com]
Sent: Friday, July 20, 2012 10:20 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Distributing Keys across Reducers

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I believe this is referred to as a "key skew problem", which I know is
> heavily dependent on the actual data being processed.  Can anyone
> point me to any blog posts, white papers, etc. that might give me some
> options on how to deal with this issue?

I don't know about blog posts or white papers, but the canonical answer here is usually using a different Partitioner.

The default one takes the .hash() of each Mapper output key and reduces it modulo the number of Reducers you've specified (43, here).  So the first place I'd look is to see if there's some reason you're getting so many more outputs with one key-hash-mod-43 than the others.

A common answer here is that one key alone has a huge number of outputs, in which case it's hard to do anything better with it.  Another case is that your key class' hash function is bad at telling apart a certain class of keys that occur with some regularity.  Since 43 is an odd prime, I would not expect a moderately evenly distributed hash to suddenly get spikes at certain values mod-43.

So if you want to (and can) rejigger your hashes to spread things more evenly, great.  If not, you're down to writing your own partitioner.
It's slightly different depending on which API you're using, but either way you basically have to write a function called getPartition that takes a mapper output record (key and value) and the number of reducers and returns the index (from 0 to numReducers-1) of the reducer that should handle that record.  And unless you REALLY know what you're doing, the function should probably only depend on the key.

Good luck.

The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

Re: Distributing Keys across Reducers

Posted by John Armstrong <jr...@ccri.com>.

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I believe this is referred to as a “key skew problem”, which I know is
> heavily dependent on the actual data being processed.  Can anyone point
> me to any blog posts, white papers, etc. that might give me some options
> on how to deal with this issue?

I don't know about blog posts or white papers, but the canonical answer 
here is usually using a different Partitioner.

The default one takes the .hash() of each Mapper output key and reduces 
it modulo the number of Reducers you've specified (43, here).  So the 
first place I'd look is to see if there's some reason you're getting so 
many more outputs with one key-hash-mod-43 than the others.

A common answer here is that one key alone has a huge number of outputs, 
in which case it's hard to do anything better with it.  Another case is 
that your key class' hash function is bad at telling apart a certain 
class of keys that occur with some regularity.  Since 43 is an odd 
prime, I would not expect a moderately evenly distributed hash to 
suddenly get spikes at certain values mod-43.

So if you want to (and can) rejigger your hashes to spread things more 
evenly, great.  If not, you're down to writing your own partitioner. 
It's slightly different depending on which API you're using, but either 
way you basically have to write a function called getPartition that 
takes a mapper output record (key and value) and the number of reducers 
and returns the index (from 0 to numReducers-1) of the reducer that 
should handle that record.  And unless you REALLY know what you're 
doing, the function should probably only depend on the key.

Good luck.

Re: Distributing Keys across Reducers

Posted by Tim Broberg <Ti...@exar.com>.

Good to know. Thanks for the update.

    - Tim.

On Jul 25, 2012, at 5:21 AM, "Dave Shine" <Da...@channelintelligence.com> wrote:

> Just wanted to follow up on this issue.  It turned out that I was overlooking the obvious.  Turns out that over 8% of the mapper output had exactly the same key, which was actually an invalid value.  By changing the mapper to not emit records with an invalid key the problem went away.
> 
> Moral of the story, verify the data before you blame the software.
> 
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile
> CI BoostT Clients  Outperform OnlineT  www.ciboost.com
> 
> 
> -----Original Message-----
> From: Dave Shine [mailto:Dave.Shine@channelintelligence.com] 
> Sent: Friday, July 20, 2012 1:13 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: RE: Distributing Keys across Reducers
> 
> Yes, that is a possibility, but it will take some significant rearchitecture.  I was assuming that was what I was going to have to do until I saw the key distribution problem and though I might be able to buy some relief by addressing that.
> 
> The job runs once per day, starting at 1:00AM EDT.  I have changed it to use a fewer number of reducers just to see how that effects the distribution.
> 
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com
> 
> 
> -----Original Message-----
> From: Tim Broberg [mailto:Tim.Broberg@exar.com]
> Sent: Friday, July 20, 2012 1:03 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: RE: Distributing Keys across Reducers
> 
> Just a thought, but can you deal with the problem with increased granularity by simply making the jobs smaller?
> 
> If you have enough jobs, when one takes twice as long there will be plenty of other small jobs to employ the other nodes, right?
> 
>    - Tim.
> 
> ________________________________________
> From: David Rosenstrauch [darose@darose.net]
> Sent: Friday, July 20, 2012 7:45 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Distributing Keys across Reducers
> 
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.
>> 
>> I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?
> 
> Hadoop lets you override the default partitioner and replace it with your own.  This lets you write a custom partitioning scheme which distributes your data more evenly.
> 
> HTH,
> 
> DR
> 
> The information contained in this email is intended only for the personal and confidential use of the recipient(s) named above.  The information and any attached documents contained in this message may be Exar confidential and/or legally privileged.  If you are not the intended recipient, you are hereby notified that any review, use, dissemination or reproduction of this message is strictly prohibited and may be unlawful.  If you have received this communication in error, please notify us immediately by return email and delete the original message.
> 
> The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

RE: Distributing Keys across Reducers

Posted by Dave Shine <Da...@channelintelligence.com>.

Just wanted to follow up on this issue.  It turned out that I was overlooking the obvious.  Turns out that over 8% of the mapper output had exactly the same key, which was actually an invalid value.  By changing the mapper to not emit records with an invalid key the problem went away.

Moral of the story, verify the data before you blame the software.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI BoostT Clients  Outperform OnlineT  www.ciboost.com

-----Original Message-----
From: Dave Shine [mailto:Dave.Shine@channelintelligence.com] 
Sent: Friday, July 20, 2012 1:13 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Distributing Keys across Reducers

Yes, that is a possibility, but it will take some significant rearchitecture.  I was assuming that was what I was going to have to do until I saw the key distribution problem and though I might be able to buy some relief by addressing that.

The job runs once per day, starting at 1:00AM EDT.  I have changed it to use a fewer number of reducers just to see how that effects the distribution.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com

-----Original Message-----
From: Tim Broberg [mailto:Tim.Broberg@exar.com]
Sent: Friday, July 20, 2012 1:03 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Distributing Keys across Reducers

Just a thought, but can you deal with the problem with increased granularity by simply making the jobs smaller?

If you have enough jobs, when one takes twice as long there will be plenty of other small jobs to employ the other nodes, right?

    - Tim.

________________________________________
From: David Rosenstrauch [darose@darose.net]
Sent: Friday, July 20, 2012 7:45 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Distributing Keys across Reducers

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.
>
> I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?

Hadoop lets you override the default partitioner and replace it with your own.  This lets you write a custom partitioning scheme which distributes your data more evenly.

HTH,

DR

The information contained in this email is intended only for the personal and confidential use of the recipient(s) named above.  The information and any attached documents contained in this message may be Exar confidential and/or legally privileged.  If you are not the intended recipient, you are hereby notified that any review, use, dissemination or reproduction of this message is strictly prohibited and may be unlawful.  If you have received this communication in error, please notify us immediately by return email and delete the original message.

The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

RE: Distributing Keys across Reducers

Posted by Dave Shine <Da...@channelintelligence.com>.

Yes, that is a possibility, but it will take some significant rearchitecture.  I was assuming that was what I was going to have to do until I saw the key distribution problem and though I might be able to buy some relief by addressing that.

The job runs once per day, starting at 1:00AM EDT.  I have changed it to use a fewer number of reducers just to see how that effects the distribution.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com

-----Original Message-----
From: Tim Broberg [mailto:Tim.Broberg@exar.com]
Sent: Friday, July 20, 2012 1:03 PM
To: mapreduce-user@hadoop.apache.org
Subject: RE: Distributing Keys across Reducers

Just a thought, but can you deal with the problem with increased granularity by simply making the jobs smaller?

If you have enough jobs, when one takes twice as long there will be plenty of other small jobs to employ the other nodes, right?

    - Tim.

________________________________________
From: David Rosenstrauch [darose@darose.net]
Sent: Friday, July 20, 2012 7:45 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Distributing Keys across Reducers

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.
>
> I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?

Hadoop lets you override the default partitioner and replace it with your own.  This lets you write a custom partitioning scheme which distributes your data more evenly.

HTH,

DR

The information contained in this email is intended only for the personal and confidential use of the recipient(s) named above.  The information and any attached documents contained in this message may be Exar confidential and/or legally privileged.  If you are not the intended recipient, you are hereby notified that any review, use, dissemination or reproduction of this message is strictly prohibited and may be unlawful.  If you have received this communication in error, please notify us immediately by return email and delete the original message.

The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

RE: Distributing Keys across Reducers

Posted by Tim Broberg <Ti...@exar.com>.

Just a thought, but can you deal with the problem with increased granularity by simply making the jobs smaller?

If you have enough jobs, when one takes twice as long there will be plenty of other small jobs to employ the other nodes, right?

    - Tim.

________________________________________
From: David Rosenstrauch [darose@darose.net]
Sent: Friday, July 20, 2012 7:45 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Distributing Keys across Reducers

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.
>
> I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?

Hadoop lets you override the default partitioner and replace it with
your own.  This lets you write a custom partitioning scheme which
distributes your data more evenly.

HTH,

DR

The information contained in this email is intended only for the personal and confidential use of the recipient(s) named above.  The information and any attached documents contained in this message may be Exar confidential and/or legally privileged.  If you are not the intended recipient, you are hereby notified that any review, use, dissemination or reproduction of this message is strictly prohibited and may be unlawful.  If you have received this communication in error, please notify us immediately by return email and delete the original message.

Re: Distributing Keys across Reducers

Posted by David Rosenstrauch <da...@darose.net>.

On 07/20/2012 09:20 AM, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.
>
> I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?

Hadoop lets you override the default partitioner and replace it with 
your own.  This lets you write a custom partitioning scheme which 
distributes your data more evenly.

HTH,

DR

RE: Distributing Keys across Reducers

Posted by Dave Shine <Da...@channelintelligence.com>.

Thanks Syed.  I'm not using HBase, so I don't think this is related to my problem.

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com<http://www.ciboost.com/>

From: syed kather [mailto:in.abdul@gmail.com]
Sent: Friday, July 20, 2012 9:58 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Distributing Keys across Reducers

Dave Shine ,
    Can you share how many data is been taken by map task .If map task is uneven then it might be Hot Spotting Problem.
Have an look on http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ .
  I had also faced same problem i am trying implement this HbaseWD.

            Thanks and Regards,
        S SYED ABDUL KATHER

On Fri, Jul 20, 2012 at 6:50 PM, Dave Shine <Da...@channelintelligence.com>> wrote:
I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows per reduce task.  However I actually got around 60 million for most of the tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution caused the job to run exceedingly long.

I believe this is referred to as a "key skew problem", which I know is heavily dependent on the actual data being processed.  Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue?

Thanks,
Dave Shine
Sr. Software Engineer
321.939.5093<tel:321.939.5093> direct |  407.314.0122<tel:407.314.0122> mobile

[cid:image001.png@01CD6668.0EAF16C0]
CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com<http://www.ciboost.com/>
facebook platform | where-to-buy | product search engines | shopping engines

________________________________
The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

Re: Distributing Keys across Reducers

Posted by syed kather <in...@gmail.com>.

Dave Shine ,
    Can you share how many data is been taken by map task .If map task is
uneven then it might be Hot Spotting Problem.
Have an look on
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
 .
  I had also faced same problem i am trying implement this HbaseWD.

            Thanks and Regards,
        S SYED ABDUL KATHER


*
*

On Fri, Jul 20, 2012 at 6:50 PM, Dave Shine <
Dave.Shine@channelintelligence.com> wrote:

>  I have a job that is emitting over 3 billion rows from the map to the
> reduce.  The job is configured with 43 reduce tasks.  A perfectly even
> distribution would amount to about 70 million rows per reduce task.
> However I actually got around 60 million for most of the tasks, one task
> got over 100 million, and one task got almost 350 million.  This uneven
> distribution caused the job to run exceedingly long.****
>
> ** **
>
> I believe this is referred to as a “key skew problem”, which I know is
> heavily dependent on the actual data being processed.  Can anyone point me
> to any blog posts, white papers, etc. that might give me some options on
> how to deal with this issue? ****
>
> ** **
>
> Thanks,****
>
> *Dave Shine*****
>
> Sr. Software Engineer****
>
> 321.939.5093 direct |  407.314.0122 mobile****
>
> ** **
>
> [image: cid:D34AFA33-EA7B-4B08-9DD4-2C8DFBE66338]****
>
> *CI Boost™ Clients*  *Outperform Online™  *www.ciboost.com****
>
> facebook platform | where-to-buy | product search engines | shopping
> engines****
>
> ** **
>
> ** **
>
> ------------------------------
> The information contained in this email message is considered confidential
> and proprietary to the sender and is intended solely for review and use by
> the named recipient. Any unauthorized review, use or distribution is
> strictly prohibited. If you have received this message in error, please
> advise the sender by reply email and delete the message.
>

Re: Distributing Keys across Reducers

Posted by Christoph Schmitz <ch...@1und1.de>.

Hi Dave,

I haven't actually done this in practice, so take this with a grain of 
salt ;-)

One way to circumvent your problem might be to add entropy to the keys, 
i.e., if your keys are "a", "b" etc. and you got too many "a"s and too 
many "b"s, you could inflate your keys randomly to be (a, 1), ..., (a, 
100), (b, 1), ..., (b, 100) etc. and partition over those.

If you know the distribution of the key space beforehand, you could 
inflate each key in such as way as to make the resulting distribution 
uniform.

The downside of this approach is that you need to collect the reducer 
outputs for (a, 1) through (a, 100) and compute the value for "a" (same 
for "b", etc. of course). Depending on what you do, this might be a 
simple operation or a second MapReduce job.

There's a blog post explaining this idea:

http://blog.rapleaf.com/dev/2010/03/08/dealing-with-skewed-key-sizes-in-cascading/

Regards,
Christoph

On 20.07.2012 15:20, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the
> reduce.  The job is configured with 43 reduce tasks.  A perfectly even
> distribution would amount to about 70 million rows per reduce task.
> However I actually got around 60 million for most of the tasks, one task
> got over 100 million, and one task got almost 350 million.  This uneven
> distribution caused the job to run exceedingly long.
>
> I believe this is referred to as a “key skew problem”, which I know is
> heavily dependent on the actual data being processed.  Can anyone point
> me to any blog posts, white papers, etc. that might give me some options
> on how to deal with this issue?