You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2013/09/01 19:49:02 UTC

Gathering Statistics for use by Mappers?

I am solving a problem which will involve several chained Map-Reduce tasks.
On the second task I want the mapper to send every value in a specific
'area'  (imagine customers in a zip code) of the data space to a reducer.
These need to be compared and, following an earlier suggestion, when the
number of customers in a 'zipcode' gets large the reducer will perform
poorly or not at all.
  As long as every mapper knows the total number of customers in every zip
code they can make the same decision as to whether and how to break up a
specific zip code.
  At the end of the first reduce pass I have processed every customer and
should have been able to gather statistics before launching the next job.
  I see two options: First to use counters so I might say something like
  context.getCounter("Zipcodes", MyZipcode.toString()).increment(1);
1)  The number of zipcodes might be 4000 or as high as 50,000 in which case
I would need to raise the maximum number of counters
2) in addition to using my normal set of key-value pairs emit special keys
for zipcodes with the key being zipcode (say with "_" prepended to
distinguish it from other Text keys and the value being a count as
WordCount does.

Approach 1 maintains a large number of counters but at the end of my first
job I will have the statistic2 and can pass then to the second job.
Approach 2 doubles the number of key-values to sort (we are in the tens to
hundreds of millions in the larger jobs) and adds processing code as well
as the requirement to read and process result files.

Any comments on the two approaches or ideas that I might not have thought
of.

Re: Gathering Statistics for use by Mappers?

Posted by Steve Lewis <lo...@gmail.com>.

There is a max counters - 6000 I think bit it can be raised by setting
"mapreduce.job.counters.max - I have done this but don't know
the performance implications.

What I am doing is comparing all items with a specific key with each other.
This requires all elements to be held in memory. If the number of items
gets too large, the task may be split among multiple reducers but the
mapper needs this information as it starts assigning keys - hence I need
the statistics early. Keys with fewer values can be handled by a single
reducer.
  For the business logic combiners would probably be of little use. Even
for two records having the same key, say zipcode. the odds that they are
'next door neighbors' and could be combined are pretty low.
   Yes, the number of keys is not high but every value is a pretty complex
serialized object with a lot of processing to handle it.


On Sun, Sep 1, 2013 at 1:02 PM, Tim Robertson <ti...@gmail.com>wrote:

> Hey Steve,
>
> If I recall correctly the total number of counters you have is limited.
>  It's been a while since I looked at that code, but I seem to recall the
> counters get pushed to JT in heartbeat messaging and are held in JT memory.
>  Anyway, 1) sounds like you'll hit limits, so I'd suggest starting with 2) .
>
> Can you elaborate what will happen in the subsequent jobs?  Is this
> something you can take multiple passes on?
>
> E.g. in Job 1, emit the keys from mappers as a compound of (ZipCode :
> module100(zipcode)), so that at worst case you are dealing with a 100th of
> a zipcode at a time in a reducer.  You'd then pass the first job output
> into the second to combine again and group the 100ths of the zipcode data
> together.  This would probably only work on some operations though.
>
> Are you making use of Combiners?  They effectively do what I say above but
> do a mini-reduce as the output of each map.
>
> Are you using Hive or vanilla MR?  The hive folks are making a lot of
> advances in this area (e.g. ORC files).
>
> 10s to 100s millions of keys does not sound that many though.
>
> Cheers,
> Tim
>
>
>
>
>
>
>
> On Sun, Sep 1, 2013 at 7:49 PM, Steve Lewis <lo...@gmail.com> wrote:
>
>> I am solving a problem which will involve several chained Map-Reduce
>> tasks. On the second task I want the mapper to send every value in a
>> specific 'area'  (imagine customers in a zip code) of the data space to a
>> reducer. These need to be compared and, following an earlier suggestion,
>> when the number of customers in a 'zipcode' gets large the reducer will
>> perform poorly or not at all.
>>   As long as every mapper knows the total number of customers in every
>> zip code they can make the same decision as to whether and how to break up
>> a specific zip code.
>>   At the end of the first reduce pass I have processed every customer and
>> should have been able to gather statistics before launching the next job.
>>   I see two options: First to use counters so I might say something like
>>   context.getCounter("Zipcodes", MyZipcode.toString()).increment(1);
>> 1)  The number of zipcodes might be 4000 or as high as 50,000 in which
>> case I would need to raise the maximum number of counters
>> 2) in addition to using my normal set of key-value pairs emit special
>> keys for zipcodes with the key being zipcode (say with "_" prepended to
>> distinguish it from other Text keys and the value being a count as
>> WordCount does.
>>
>> Approach 1 maintains a large number of counters but at the end of my
>> first job I will have the statistic2 and can pass then to the second job.
>> Approach 2 doubles the number of key-values to sort (we are in the tens
>> to hundreds of millions in the larger jobs) and adds processing code as
>> well as the requirement to read and process result files.
>>
>> Any comments on the two approaches or ideas that I might not have thought
>> of.
>>
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Gathering Statistics for use by Mappers?

Posted by Tim Robertson <ti...@gmail.com>.

Hey Steve,

If I recall correctly the total number of counters you have is limited.
 It's been a while since I looked at that code, but I seem to recall the
counters get pushed to JT in heartbeat messaging and are held in JT memory.
 Anyway, 1) sounds like you'll hit limits, so I'd suggest starting with 2) .

Can you elaborate what will happen in the subsequent jobs?  Is this
something you can take multiple passes on?

E.g. in Job 1, emit the keys from mappers as a compound of (ZipCode :
module100(zipcode)), so that at worst case you are dealing with a 100th of
a zipcode at a time in a reducer.  You'd then pass the first job output
into the second to combine again and group the 100ths of the zipcode data
together.  This would probably only work on some operations though.

Are you making use of Combiners?  They effectively do what I say above but
do a mini-reduce as the output of each map.

Are you using Hive or vanilla MR?  The hive folks are making a lot of
advances in this area (e.g. ORC files).

10s to 100s millions of keys does not sound that many though.

Cheers,
Tim

On Sun, Sep 1, 2013 at 7:49 PM, Steve Lewis <lo...@gmail.com> wrote:

> I am solving a problem which will involve several chained Map-Reduce
> tasks. On the second task I want the mapper to send every value in a
> specific 'area'  (imagine customers in a zip code) of the data space to a
> reducer. These need to be compared and, following an earlier suggestion,
> when the number of customers in a 'zipcode' gets large the reducer will
> perform poorly or not at all.
>   As long as every mapper knows the total number of customers in every zip
> code they can make the same decision as to whether and how to break up a
> specific zip code.
>   At the end of the first reduce pass I have processed every customer and
> should have been able to gather statistics before launching the next job.
>   I see two options: First to use counters so I might say something like
>   context.getCounter("Zipcodes", MyZipcode.toString()).increment(1);
> 1)  The number of zipcodes might be 4000 or as high as 50,000 in which
> case I would need to raise the maximum number of counters
> 2) in addition to using my normal set of key-value pairs emit special keys
> for zipcodes with the key being zipcode (say with "_" prepended to
> distinguish it from other Text keys and the value being a count as
> WordCount does.
>
> Approach 1 maintains a large number of counters but at the end of my first
> job I will have the statistic2 and can pass then to the second job.
> Approach 2 doubles the number of key-values to sort (we are in the tens to
> hundreds of millions in the larger jobs) and adds processing code as well
> as the requirement to read and process result files.
>
> Any comments on the two approaches or ideas that I might not have thought
> of.
>
>
>