You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2012/11/21 17:38:38 UTC

guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks

RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Jamal,

This is what I am using...

After you start your job, visit jobtracker's WebUI <ip-address>:50030
And look for Cluster summary. Reduce Task Capacity shall hint you what optimally set your number to. I could be wrong but it works for me. :)
Cluster Summary (Heap Size is *** MB/966.69 MB)
Running Map Tasks

Running Reduce Tasks

Total Submissions

Nodes

Occupied Map Slots

Occupied Reduce Slots

Reserved Map Slots

Reserved Reduce Slots

Map Task Capacity

Reduce Task Capacity

Avg. Tasks/Node

Blacklisted Nodes

Excluded Nodes



Rgds,
AK47

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Wednesday, November 21, 2012 11:39 AM
To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Thank you for the info Bejoy.

Cheers!
Manoj.



On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Manoj
>
> If you intend to calculate the number of reducers based on the input size,
> then in your driver class you should get the size of the input dir in hdfs
> and say you intended to give n bytes to a reducer then the number of
> reducers can be computed as
> Total input size/ bytes per reducer.
>
> You can round this value and use it to set the number of reducers in conf
> programatically.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Manoj Babu <ma...@gmail.com>
> *Date: *Wed, 21 Nov 2012 23:28:00 +0530
> *To: *<us...@hadoop.apache.org>
> *Cc: *bejoy.hadoop@gmail.com<be...@gmail.com>
> *Subject: *Re: guessing number of reducers.
>
> Hi,
>
> How to set no of reducers in job conf dynamically?
> For example some days i am getting 500GB of data on heavy traffic and some
> days 100GB only.
>
> Thanks in advance!
>
> Cheers!
> Manoj.
>
>
>
> On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Bejoy,
>>
>>
>>
>> I’ve read somethere about keeping number of mapred.reduce.tasks below the
>> reduce task capcity. Here is what I just tested:
>>
>>
>>
>> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>>
>>
>>
>> 1 Reducer   – 22mins
>>
>> 4 Reducers – 11.5mins
>>
>> 8 Reducers – 5mins
>>
>> 10 Reducers – 7mins
>>
>> 12 Reducers – 6:5mins
>>
>> 16 Reducers – 5.5mins
>>
>>
>>
>> 8 Reducers have won the race. But Reducers at the max capacity was very
>> clos. J
>>
>>
>>
>> AK47
>>
>>
>>
>>
>>
>> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
>> *Sent:* Wednesday, November 21, 2012 11:51 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: guessing number of reducers.
>>
>>
>>
>> Hi Sasha
>>
>> In general the number of reduce tasks is chosen mainly based on the data
>> volume to reduce phase. In tools like hive and pig by default for every 1GB
>> of map output there will be a reducer. So if you have 100 gigs of map
>> output then 100 reducers.
>> If your tasks are more CPU intensive then you need lesser volume of data
>> per reducer for better performance results.
>>
>> In general it is better to have the number of reduce tasks slightly less
>> than the number of available reduce slots in the cluster.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>  ------------------------------
>>
>> *From: *jamal sasha <ja...@gmail.com>
>>
>> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>>
>> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>>
>> *ReplyTo: *user@hadoop.apache.org
>>
>> *Subject: *guessing number of reducers.
>>
>>
>>
>> By default the number of reducers is set to 1..
>> Is there a good way to guess optimal number of reducers....
>> Or let's say i have tbs worth of data... mappers are of order 5000 or
>> so...
>> But ultimately i am calculating , let's say, some average of whole
>> data... say average transaction occurring...
>> Now the output will be just one line in one "part"... rest of them will
>> be empty.So i am guessing i need loads of reducers but then most of them
>> will be empty but at the same time one reducer won't suffice..
>> What's the best way to solve this..
>> How to guess optimal number of reducers..
>> Thanks
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>

Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Thank you for the info Bejoy.

Cheers!
Manoj.



On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Manoj
>
> If you intend to calculate the number of reducers based on the input size,
> then in your driver class you should get the size of the input dir in hdfs
> and say you intended to give n bytes to a reducer then the number of
> reducers can be computed as
> Total input size/ bytes per reducer.
>
> You can round this value and use it to set the number of reducers in conf
> programatically.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Manoj Babu <ma...@gmail.com>
> *Date: *Wed, 21 Nov 2012 23:28:00 +0530
> *To: *<us...@hadoop.apache.org>
> *Cc: *bejoy.hadoop@gmail.com<be...@gmail.com>
> *Subject: *Re: guessing number of reducers.
>
> Hi,
>
> How to set no of reducers in job conf dynamically?
> For example some days i am getting 500GB of data on heavy traffic and some
> days 100GB only.
>
> Thanks in advance!
>
> Cheers!
> Manoj.
>
>
>
> On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Bejoy,
>>
>>
>>
>> I’ve read somethere about keeping number of mapred.reduce.tasks below the
>> reduce task capcity. Here is what I just tested:
>>
>>
>>
>> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>>
>>
>>
>> 1 Reducer   – 22mins
>>
>> 4 Reducers – 11.5mins
>>
>> 8 Reducers – 5mins
>>
>> 10 Reducers – 7mins
>>
>> 12 Reducers – 6:5mins
>>
>> 16 Reducers – 5.5mins
>>
>>
>>
>> 8 Reducers have won the race. But Reducers at the max capacity was very
>> clos. J
>>
>>
>>
>> AK47
>>
>>
>>
>>
>>
>> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
>> *Sent:* Wednesday, November 21, 2012 11:51 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: guessing number of reducers.
>>
>>
>>
>> Hi Sasha
>>
>> In general the number of reduce tasks is chosen mainly based on the data
>> volume to reduce phase. In tools like hive and pig by default for every 1GB
>> of map output there will be a reducer. So if you have 100 gigs of map
>> output then 100 reducers.
>> If your tasks are more CPU intensive then you need lesser volume of data
>> per reducer for better performance results.
>>
>> In general it is better to have the number of reduce tasks slightly less
>> than the number of available reduce slots in the cluster.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>  ------------------------------
>>
>> *From: *jamal sasha <ja...@gmail.com>
>>
>> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>>
>> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>>
>> *ReplyTo: *user@hadoop.apache.org
>>
>> *Subject: *guessing number of reducers.
>>
>>
>>
>> By default the number of reducers is set to 1..
>> Is there a good way to guess optimal number of reducers....
>> Or let's say i have tbs worth of data... mappers are of order 5000 or
>> so...
>> But ultimately i am calculating , let's say, some average of whole
>> data... say average transaction occurring...
>> Now the output will be just one line in one "part"... rest of them will
>> be empty.So i am guessing i need loads of reducers but then most of them
>> will be empty but at the same time one reducer won't suffice..
>> What's the best way to solve this..
>> How to guess optimal number of reducers..
>> Thanks
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>

Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Thank you for the info Bejoy.

Cheers!
Manoj.



On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Manoj
>
> If you intend to calculate the number of reducers based on the input size,
> then in your driver class you should get the size of the input dir in hdfs
> and say you intended to give n bytes to a reducer then the number of
> reducers can be computed as
> Total input size/ bytes per reducer.
>
> You can round this value and use it to set the number of reducers in conf
> programatically.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Manoj Babu <ma...@gmail.com>
> *Date: *Wed, 21 Nov 2012 23:28:00 +0530
> *To: *<us...@hadoop.apache.org>
> *Cc: *bejoy.hadoop@gmail.com<be...@gmail.com>
> *Subject: *Re: guessing number of reducers.
>
> Hi,
>
> How to set no of reducers in job conf dynamically?
> For example some days i am getting 500GB of data on heavy traffic and some
> days 100GB only.
>
> Thanks in advance!
>
> Cheers!
> Manoj.
>
>
>
> On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Bejoy,
>>
>>
>>
>> I’ve read somethere about keeping number of mapred.reduce.tasks below the
>> reduce task capcity. Here is what I just tested:
>>
>>
>>
>> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>>
>>
>>
>> 1 Reducer   – 22mins
>>
>> 4 Reducers – 11.5mins
>>
>> 8 Reducers – 5mins
>>
>> 10 Reducers – 7mins
>>
>> 12 Reducers – 6:5mins
>>
>> 16 Reducers – 5.5mins
>>
>>
>>
>> 8 Reducers have won the race. But Reducers at the max capacity was very
>> clos. J
>>
>>
>>
>> AK47
>>
>>
>>
>>
>>
>> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
>> *Sent:* Wednesday, November 21, 2012 11:51 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: guessing number of reducers.
>>
>>
>>
>> Hi Sasha
>>
>> In general the number of reduce tasks is chosen mainly based on the data
>> volume to reduce phase. In tools like hive and pig by default for every 1GB
>> of map output there will be a reducer. So if you have 100 gigs of map
>> output then 100 reducers.
>> If your tasks are more CPU intensive then you need lesser volume of data
>> per reducer for better performance results.
>>
>> In general it is better to have the number of reduce tasks slightly less
>> than the number of available reduce slots in the cluster.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>  ------------------------------
>>
>> *From: *jamal sasha <ja...@gmail.com>
>>
>> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>>
>> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>>
>> *ReplyTo: *user@hadoop.apache.org
>>
>> *Subject: *guessing number of reducers.
>>
>>
>>
>> By default the number of reducers is set to 1..
>> Is there a good way to guess optimal number of reducers....
>> Or let's say i have tbs worth of data... mappers are of order 5000 or
>> so...
>> But ultimately i am calculating , let's say, some average of whole
>> data... say average transaction occurring...
>> Now the output will be just one line in one "part"... rest of them will
>> be empty.So i am guessing i need loads of reducers but then most of them
>> will be empty but at the same time one reducer won't suffice..
>> What's the best way to solve this..
>> How to guess optimal number of reducers..
>> Thanks
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>

Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Thank you for the info Bejoy.

Cheers!
Manoj.



On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi Manoj
>
> If you intend to calculate the number of reducers based on the input size,
> then in your driver class you should get the size of the input dir in hdfs
> and say you intended to give n bytes to a reducer then the number of
> reducers can be computed as
> Total input size/ bytes per reducer.
>
> You can round this value and use it to set the number of reducers in conf
> programatically.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Manoj Babu <ma...@gmail.com>
> *Date: *Wed, 21 Nov 2012 23:28:00 +0530
> *To: *<us...@hadoop.apache.org>
> *Cc: *bejoy.hadoop@gmail.com<be...@gmail.com>
> *Subject: *Re: guessing number of reducers.
>
> Hi,
>
> How to set no of reducers in job conf dynamically?
> For example some days i am getting 500GB of data on heavy traffic and some
> days 100GB only.
>
> Thanks in advance!
>
> Cheers!
> Manoj.
>
>
>
> On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:
>
>>  Bejoy,
>>
>>
>>
>> I’ve read somethere about keeping number of mapred.reduce.tasks below the
>> reduce task capcity. Here is what I just tested:
>>
>>
>>
>> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>>
>>
>>
>> 1 Reducer   – 22mins
>>
>> 4 Reducers – 11.5mins
>>
>> 8 Reducers – 5mins
>>
>> 10 Reducers – 7mins
>>
>> 12 Reducers – 6:5mins
>>
>> 16 Reducers – 5.5mins
>>
>>
>>
>> 8 Reducers have won the race. But Reducers at the max capacity was very
>> clos. J
>>
>>
>>
>> AK47
>>
>>
>>
>>
>>
>> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
>> *Sent:* Wednesday, November 21, 2012 11:51 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: guessing number of reducers.
>>
>>
>>
>> Hi Sasha
>>
>> In general the number of reduce tasks is chosen mainly based on the data
>> volume to reduce phase. In tools like hive and pig by default for every 1GB
>> of map output there will be a reducer. So if you have 100 gigs of map
>> output then 100 reducers.
>> If your tasks are more CPU intensive then you need lesser volume of data
>> per reducer for better performance results.
>>
>> In general it is better to have the number of reduce tasks slightly less
>> than the number of available reduce slots in the cluster.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>>  ------------------------------
>>
>> *From: *jamal sasha <ja...@gmail.com>
>>
>> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>>
>> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>>
>> *ReplyTo: *user@hadoop.apache.org
>>
>> *Subject: *guessing number of reducers.
>>
>>
>>
>> By default the number of reducers is set to 1..
>> Is there a good way to guess optimal number of reducers....
>> Or let's say i have tbs worth of data... mappers are of order 5000 or
>> so...
>> But ultimately i am calculating , let's say, some average of whole
>> data... say average transaction occurring...
>> Now the output will be just one line in one "part"... rest of them will
>> be empty.So i am guessing i need loads of reducers but then most of them
>> will be empty but at the same time one reducer won't suffice..
>> What's the best way to solve this..
>> How to guess optimal number of reducers..
>> Thanks
>>  NOTICE: This e-mail message and any attachments are confidential,
>> subject to copyright and may be privileged. Any unauthorized use, copying
>> or disclosure is prohibited. If you are not the intended recipient, please
>> delete and contact the sender immediately. Please consider the environment
>> before printing this e-mail. AVIS : le présent courriel et toute pièce
>> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
>> et peuvent être couverts par le secret professionnel. Toute utilisation,
>> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
>> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
>> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
>> courriel
>>
>
>

Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Manoj

If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and  say you intended to give n bytes to a reducer then the number of reducers can be computed as
Total input size/ bytes per reducer.

You can round this value and use it to set the number of reducers in conf programatically.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Manoj Babu <ma...@gmail.com>
Date: Wed, 21 Nov 2012 23:28:00 
To: <us...@hadoop.apache.org>
Cc: bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: Re: guessing number of reducers.

Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Manoj

If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and  say you intended to give n bytes to a reducer then the number of reducers can be computed as
Total input size/ bytes per reducer.

You can round this value and use it to set the number of reducers in conf programatically.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Manoj Babu <ma...@gmail.com>
Date: Wed, 21 Nov 2012 23:28:00 
To: <us...@hadoop.apache.org>
Cc: bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: Re: guessing number of reducers.

Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Manoj

If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and  say you intended to give n bytes to a reducer then the number of reducers can be computed as
Total input size/ bytes per reducer.

You can round this value and use it to set the number of reducers in conf programatically.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Manoj Babu <ma...@gmail.com>
Date: Wed, 21 Nov 2012 23:28:00 
To: <us...@hadoop.apache.org>
Cc: bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: Re: guessing number of reducers.

Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Manoj

If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and  say you intended to give n bytes to a reducer then the number of reducers can be computed as
Total input size/ bytes per reducer.

You can round this value and use it to set the number of reducers in conf programatically.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Manoj Babu <ma...@gmail.com>
Date: Wed, 21 Nov 2012 23:28:00 
To: <us...@hadoop.apache.org>
Cc: bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: Re: guessing number of reducers.

Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>


Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by jamal sasha <ja...@gmail.com>.
Thanks for the input guys. This helps alot
:)

On Wednesday, November 21, 2012, Bejoy KS <be...@gmail.com> wrote:
> Hi Andy
>
> It is usually so because if you have more reduce tasks than the reduce
slots in your cluster then a few of the reduce tasks will be in queue
waiting for its turn. So it is better to keep the num of reduce tasks
slightly less than the reduce task capacity so that all reduce tasks run at
once in parallel.
>
> But in some cases each reducer can process only certain volume of data
due to some constraints, like data beyond a certain limit may lead to OOMs.
In such cases you may need to configure the number of reducers totally
based on your data and not based on slots.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: "Kartashov, Andy" <An...@mpac.ca>
> Date: Wed, 21 Nov 2012 17:49:50 +0000
> To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com
<be...@gmail.com>
> Subject: RE: guessing number of reducers.
>
> Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
clos. J
>
>
>
> AK47
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> Sent: Wednesday, November 21, 2012 11:51 AM
> To: user@hadoop.apache.org
> Subject: Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
volume to reduce phase. In tools like hive and pig by default for every 1GB
of map output there will be a reducer. So if you have 100 gigs of map
output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> ________________________________
>
> From: jamal sasha <ja...@gmail.com>
>
> Date: Wed, 21 Nov 2012 11:38:38 -0500
>
> To: user@hadoop.apache.org<us...@hadoop.apache.org>

Re: guessing number of reducers.

Posted by jamal sasha <ja...@gmail.com>.
Thanks for the input guys. This helps alot
:)

On Wednesday, November 21, 2012, Bejoy KS <be...@gmail.com> wrote:
> Hi Andy
>
> It is usually so because if you have more reduce tasks than the reduce
slots in your cluster then a few of the reduce tasks will be in queue
waiting for its turn. So it is better to keep the num of reduce tasks
slightly less than the reduce task capacity so that all reduce tasks run at
once in parallel.
>
> But in some cases each reducer can process only certain volume of data
due to some constraints, like data beyond a certain limit may lead to OOMs.
In such cases you may need to configure the number of reducers totally
based on your data and not based on slots.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: "Kartashov, Andy" <An...@mpac.ca>
> Date: Wed, 21 Nov 2012 17:49:50 +0000
> To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com
<be...@gmail.com>
> Subject: RE: guessing number of reducers.
>
> Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
clos. J
>
>
>
> AK47
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> Sent: Wednesday, November 21, 2012 11:51 AM
> To: user@hadoop.apache.org
> Subject: Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
volume to reduce phase. In tools like hive and pig by default for every 1GB
of map output there will be a reducer. So if you have 100 gigs of map
output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> ________________________________
>
> From: jamal sasha <ja...@gmail.com>
>
> Date: Wed, 21 Nov 2012 11:38:38 -0500
>
> To: user@hadoop.apache.org<us...@hadoop.apache.org>

Re: guessing number of reducers.

Posted by jamal sasha <ja...@gmail.com>.
Thanks for the input guys. This helps alot
:)

On Wednesday, November 21, 2012, Bejoy KS <be...@gmail.com> wrote:
> Hi Andy
>
> It is usually so because if you have more reduce tasks than the reduce
slots in your cluster then a few of the reduce tasks will be in queue
waiting for its turn. So it is better to keep the num of reduce tasks
slightly less than the reduce task capacity so that all reduce tasks run at
once in parallel.
>
> But in some cases each reducer can process only certain volume of data
due to some constraints, like data beyond a certain limit may lead to OOMs.
In such cases you may need to configure the number of reducers totally
based on your data and not based on slots.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: "Kartashov, Andy" <An...@mpac.ca>
> Date: Wed, 21 Nov 2012 17:49:50 +0000
> To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com
<be...@gmail.com>
> Subject: RE: guessing number of reducers.
>
> Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
clos. J
>
>
>
> AK47
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> Sent: Wednesday, November 21, 2012 11:51 AM
> To: user@hadoop.apache.org
> Subject: Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
volume to reduce phase. In tools like hive and pig by default for every 1GB
of map output there will be a reducer. So if you have 100 gigs of map
output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> ________________________________
>
> From: jamal sasha <ja...@gmail.com>
>
> Date: Wed, 21 Nov 2012 11:38:38 -0500
>
> To: user@hadoop.apache.org<us...@hadoop.apache.org>

Re: guessing number of reducers.

Posted by jamal sasha <ja...@gmail.com>.
Thanks for the input guys. This helps alot
:)

On Wednesday, November 21, 2012, Bejoy KS <be...@gmail.com> wrote:
> Hi Andy
>
> It is usually so because if you have more reduce tasks than the reduce
slots in your cluster then a few of the reduce tasks will be in queue
waiting for its turn. So it is better to keep the num of reduce tasks
slightly less than the reduce task capacity so that all reduce tasks run at
once in parallel.
>
> But in some cases each reducer can process only certain volume of data
due to some constraints, like data beyond a certain limit may lead to OOMs.
In such cases you may need to configure the number of reducers totally
based on your data and not based on slots.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ________________________________
> From: "Kartashov, Andy" <An...@mpac.ca>
> Date: Wed, 21 Nov 2012 17:49:50 +0000
> To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com
<be...@gmail.com>
> Subject: RE: guessing number of reducers.
>
> Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
clos. J
>
>
>
> AK47
>
>
>
>
>
> From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> Sent: Wednesday, November 21, 2012 11:51 AM
> To: user@hadoop.apache.org
> Subject: Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
volume to reduce phase. In tools like hive and pig by default for every 1GB
of map output there will be a reducer. So if you have 100 gigs of map
output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> ________________________________
>
> From: jamal sasha <ja...@gmail.com>
>
> Date: Wed, 21 Nov 2012 11:38:38 -0500
>
> To: user@hadoop.apache.org<us...@hadoop.apache.org>

Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Andy

It is usually so because if you have more reduce tasks than the reduce slots in your cluster then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks run at once in parallel.

But in some cases each reducer can process only certain volume of data due to some constraints, like data beyond a certain limit may lead to OOMs. In such cases you may need to configure the number of reducers totally based on your data and not based on slots.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Kartashov, Andy" <An...@mpac.ca>
Date: Wed, 21 Nov 2012 17:49:50 
To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: RE: guessing number of reducers.

Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel


Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Jamal,

   I use a different approach based on the no of cores. If you have, say a
4 cores machine then you can have (0.75*no cores)no.  of MR slots.
For example, if you have 4 physical cores OR 8 virtual cores then you can
have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as per
your requirement.

Regards,
    Mohammad Tariq



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Andy

It is usually so because if you have more reduce tasks than the reduce slots in your cluster then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks run at once in parallel.

But in some cases each reducer can process only certain volume of data due to some constraints, like data beyond a certain limit may lead to OOMs. In such cases you may need to configure the number of reducers totally based on your data and not based on slots.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Kartashov, Andy" <An...@mpac.ca>
Date: Wed, 21 Nov 2012 17:49:50 
To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: RE: guessing number of reducers.

Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Andy

It is usually so because if you have more reduce tasks than the reduce slots in your cluster then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks run at once in parallel.

But in some cases each reducer can process only certain volume of data due to some constraints, like data beyond a certain limit may lead to OOMs. In such cases you may need to configure the number of reducers totally based on your data and not based on slots.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Kartashov, Andy" <An...@mpac.ca>
Date: Wed, 21 Nov 2012 17:49:50 
To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: RE: guessing number of reducers.

Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Andy

It is usually so because if you have more reduce tasks than the reduce slots in your cluster then a few of the reduce tasks will be in queue waiting for its turn. So it is better to keep the num of reduce tasks slightly less than the reduce task capacity so that all reduce tasks run at once in parallel.

But in some cases each reducer can process only certain volume of data due to some constraints, like data beyond a certain limit may lead to OOMs. In such cases you may need to configure the number of reducers totally based on your data and not based on slots.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: "Kartashov, Andy" <An...@mpac.ca>
Date: Wed, 21 Nov 2012 17:49:50 
To: user@hadoop.apache.org<us...@hadoop.apache.org>; bejoy.hadoop@gmail.com<be...@gmail.com>
Subject: RE: guessing number of reducers.

Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel


Re: guessing number of reducers.

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Jamal,

   I use a different approach based on the no of cores. If you have, say a
4 cores machine then you can have (0.75*no cores)no.  of MR slots.
For example, if you have 4 physical cores OR 8 virtual cores then you can
have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as per
your requirement.

Regards,
    Mohammad Tariq



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Jamal,

   I use a different approach based on the no of cores. If you have, say a
4 cores machine then you can have (0.75*no cores)no.  of MR slots.
For example, if you have 4 physical cores OR 8 virtual cores then you can
have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as per
your requirement.

Regards,
    Mohammad Tariq



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by Manoj Babu <ma...@gmail.com>.
Hi,

How to set no of reducers in job conf dynamically?
For example some days i am getting 500GB of data on heavy traffic and some
days 100GB only.

Thanks in advance!

Cheers!
Manoj.



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

Re: guessing number of reducers.

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Jamal,

   I use a different approach based on the no of cores. If you have, say a
4 cores machine then you can have (0.75*no cores)no.  of MR slots.
For example, if you have 4 physical cores OR 8 virtual cores then you can
have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as per
your requirement.

Regards,
    Mohammad Tariq



On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy <An...@mpac.ca>wrote:

>  Bejoy,
>
>
>
> I’ve read somethere about keeping number of mapred.reduce.tasks below the
> reduce task capcity. Here is what I just tested:
>
>
>
> Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:
>
>
>
> 1 Reducer   – 22mins
>
> 4 Reducers – 11.5mins
>
> 8 Reducers – 5mins
>
> 10 Reducers – 7mins
>
> 12 Reducers – 6:5mins
>
> 16 Reducers – 5.5mins
>
>
>
> 8 Reducers have won the race. But Reducers at the max capacity was very
> clos. J
>
>
>
> AK47
>
>
>
>
>
> *From:* Bejoy KS [mailto:bejoy.hadoop@gmail.com]
> *Sent:* Wednesday, November 21, 2012 11:51 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: guessing number of reducers.
>
>
>
> Hi Sasha
>
> In general the number of reduce tasks is chosen mainly based on the data
> volume to reduce phase. In tools like hive and pig by default for every 1GB
> of map output there will be a reducer. So if you have 100 gigs of map
> output then 100 reducers.
> If your tasks are more CPU intensive then you need lesser volume of data
> per reducer for better performance results.
>
> In general it is better to have the number of reduce tasks slightly less
> than the number of available reduce slots in the cluster.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>  ------------------------------
>
> *From: *jamal sasha <ja...@gmail.com>
>
> *Date: *Wed, 21 Nov 2012 11:38:38 -0500
>
> *To: *user@hadoop.apache.org<us...@hadoop.apache.org>
>
> *ReplyTo: *user@hadoop.apache.org
>
> *Subject: *guessing number of reducers.
>
>
>
> By default the number of reducers is set to 1..
> Is there a good way to guess optimal number of reducers....
> Or let's say i have tbs worth of data... mappers are of order 5000 or so...
> But ultimately i am calculating , let's say, some average of whole data...
> say average transaction occurring...
> Now the output will be just one line in one "part"... rest of them will be
> empty.So i am guessing i need loads of reducers but then most of them will
> be empty but at the same time one reducer won't suffice..
> What's the best way to solve this..
> How to guess optimal number of reducers..
> Thanks
>  NOTICE: This e-mail message and any attachments are confidential, subject
> to copyright and may be privileged. Any unauthorized use, copying or
> disclosure is prohibited. If you are not the intended recipient, please
> delete and contact the sender immediately. Please consider the environment
> before printing this e-mail. AVIS : le présent courriel et toute pièce
> jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
> et peuvent être couverts par le secret professionnel. Toute utilisation,
> copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
> destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
> l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
> courriel
>

RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Bejoy,

I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested:

Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity:

1 Reducer   - 22mins
4 Reducers - 11.5mins
8 Reducers - 5mins
10 Reducers - 7mins
12 Reducers - 6:5mins
16 Reducers - 5.5mins

8 Reducers have won the race. But Reducers at the max capacity was very clos. :)

AK47


From: Bejoy KS [mailto:bejoy.hadoop@gmail.com]
Sent: Wednesday, November 21, 2012 11:51 AM
To: user@hadoop.apache.org
Subject: Re: guessing number of reducers.

Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results.

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 -0500
To: user@hadoop.apache.org<us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. 

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks


RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Jamal,

This is what I am using...

After you start your job, visit jobtracker's WebUI <ip-address>:50030
And look for Cluster summary. Reduce Task Capacity shall hint you what optimally set your number to. I could be wrong but it works for me. :)
Cluster Summary (Heap Size is *** MB/966.69 MB)
Running Map Tasks

Running Reduce Tasks

Total Submissions

Nodes

Occupied Map Slots

Occupied Reduce Slots

Reserved Map Slots

Reserved Reduce Slots

Map Task Capacity

Reduce Task Capacity

Avg. Tasks/Node

Blacklisted Nodes

Excluded Nodes



Rgds,
AK47

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Wednesday, November 21, 2012 11:39 AM
To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. 

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. 

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks


Re: guessing number of reducers.

Posted by Bejoy KS <be...@gmail.com>.
Hi Sasha

In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers.
If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. 

In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: jamal sasha <ja...@gmail.com>
Date: Wed, 21 Nov 2012 11:38:38 
To: user@hadoop.apache.org<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data...
say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be
empty.So i am guessing i need loads of reducers but then most of them will
be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks


RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Jamal,

This is what I am using...

After you start your job, visit jobtracker's WebUI <ip-address>:50030
And look for Cluster summary. Reduce Task Capacity shall hint you what optimally set your number to. I could be wrong but it works for me. :)
Cluster Summary (Heap Size is *** MB/966.69 MB)
Running Map Tasks

Running Reduce Tasks

Total Submissions

Nodes

Occupied Map Slots

Occupied Reduce Slots

Reserved Map Slots

Reserved Reduce Slots

Map Task Capacity

Reduce Task Capacity

Avg. Tasks/Node

Blacklisted Nodes

Excluded Nodes



Rgds,
AK47

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Wednesday, November 21, 2012 11:39 AM
To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel

RE: guessing number of reducers.

Posted by "Kartashov, Andy" <An...@mpac.ca>.
Jamal,

This is what I am using...

After you start your job, visit jobtracker's WebUI <ip-address>:50030
And look for Cluster summary. Reduce Task Capacity shall hint you what optimally set your number to. I could be wrong but it works for me. :)
Cluster Summary (Heap Size is *** MB/966.69 MB)
Running Map Tasks

Running Reduce Tasks

Total Submissions

Nodes

Occupied Map Slots

Occupied Reduce Slots

Reserved Map Slots

Reserved Reduce Slots

Map Task Capacity

Reduce Task Capacity

Avg. Tasks/Node

Blacklisted Nodes

Excluded Nodes



Rgds,
AK47

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Wednesday, November 21, 2012 11:39 AM
To: user@hadoop.apache.org
Subject: guessing number of reducers.

By default the number of reducers is set to 1..
Is there a good way to guess optimal number of reducers....
Or let's say i have tbs worth of data... mappers are of order 5000 or so...
But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...
Now the output will be just one line in one "part"... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice..
What's the best way to solve this..
How to guess optimal number of reducers..
Thanks
NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel