You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by nutch buddy <nu...@gmail.com> on 2012/08/21 14:19:33 UTC

why is num of map tasks gets overridden?

I configure a job in hadoop ,set the number of map tasks in the code to 8.

Then I run the job and it gets 152 map tasks. Can't get why its being
overriden and whhere it get 152 from.

The mapred-site.xml has 24 as mapred.map.tasks.

any idea?

Re: why is num of map tasks gets overridden?

Posted by peter <zh...@gmail.com>.
You can consider to add Nodes.  

--  
peter
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On 2012年8月22日Wednesday at 下午1:57, nutch buddy wrote:

> So what can I do If I have a given input, and my job needs a lot of memroy per map task?
> I can't control the amount of map tasks, and my total memory per machine is limited - I'll eventaully get each machine's memory full.
>  
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <dechouxb@gmail.com (mailto:dechouxb@gmail.com)> wrote:
> > > Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html) determines the number of maps.  
> >  
> > http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >  
> > Bertrand
> >  
> >  
> > On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nutch.buddy@gmail.com (mailto:nutch.buddy@gmail.com)> wrote:
> > >  
> > > I configure a job in hadoop ,set the number of map tasks in the code to 8.
> > >  
> > >  
> > > Then I run the job and it gets 152 map tasks. Can't get why its being overriden and whhere it get 152 from.
> > >  
> > >  
> > > The mapred-site.xml has 24 as mapred.map.tasks.
> > >  
> > >  
> > > any idea?
> > >  
> > >  
> > >  
> >  
> >  
> >  
> >  
> >  
> > --  
> > Bertrand Dechoux
>  


Re: why is num of map tasks gets overridden?

Posted by Bejoy Ks <be...@gmail.com>.
Hi

You can adjust the slots in a TaskTracker/Node using
map slots -> mapred.tasktracker.map.tasks.maximum
reduce slots -> mapred.tasktracker.reduce.tasks.maximum

It is a property at Task Tracker level, so you cannot override it on a job.
You need to edit each TT's mapred-site.xml (I believe you need
to restart TT as well).

Regards
Bejoy KS

On Thu, Aug 23, 2012 at 4:42 PM, nutch buddy <nu...@gmail.com> wrote:

> how do I adjust number of slots per node?
> and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
> here?
>
> thanks
>
>
> On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> 3) Similarly to 2, you could consider multithreading. So in each physical
>> node you would only to have the equivalent in memory of what is required
>> for a map while having the processing power of many. But it will depend on
>> your context ie how you are using the memory.
>>
>> But 1) is really the key indeed : <number of slots per physical node> *
>> <maximum memory per slot> shouldn't be superior to what is available in
>> your physical node.
>>
>>  Regards
>>
>> Bertrand
>>
>>
>> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi
>>>
>>> There are two options I can think of now
>>>
>>> 1) If all your jobs are memory intensive I'd recommend you to adjust
>>> your task slots per node accordingly
>>> 2) If only a few jobs are memory intensive, you can think of each map
>>> task processing lesser volume of data. For that set mapred.max.splitsize to
>>> the maximum data chuck a map task can process with your current memory
>>> constrain.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * nutch buddy <nu...@gmail.com>
>>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: why is num of map tasks gets overridden?
>>>
>>> So what can I do If I have a given input, and my job needs a lot of
>>> memroy per map task?
>>> I can't control the amount of map tasks, and my total memory per machine
>>> is limited - I'll eventaully get each machine's memory full.
>>>
>>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>>> default InputFormat behavior is to split the total number of bytes into the
>>>>> right number of fragments. However, in the default case the DFS block size
>>>>> of the input files is treated as an upper bound for input splits. A lower
>>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>>
>>>>
>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>>
>>>>> I configure a job in hadoop ,set the number of map tasks in the code
>>>>> to 8.
>>>>>
>>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>>> overriden and whhere it get 152 from.
>>>>>
>>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>>
>>>>> any idea?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: why is num of map tasks gets overridden?

Posted by Bejoy Ks <be...@gmail.com>.
Hi

You can adjust the slots in a TaskTracker/Node using
map slots -> mapred.tasktracker.map.tasks.maximum
reduce slots -> mapred.tasktracker.reduce.tasks.maximum

It is a property at Task Tracker level, so you cannot override it on a job.
You need to edit each TT's mapred-site.xml (I believe you need
to restart TT as well).

Regards
Bejoy KS

On Thu, Aug 23, 2012 at 4:42 PM, nutch buddy <nu...@gmail.com> wrote:

> how do I adjust number of slots per node?
> and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
> here?
>
> thanks
>
>
> On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> 3) Similarly to 2, you could consider multithreading. So in each physical
>> node you would only to have the equivalent in memory of what is required
>> for a map while having the processing power of many. But it will depend on
>> your context ie how you are using the memory.
>>
>> But 1) is really the key indeed : <number of slots per physical node> *
>> <maximum memory per slot> shouldn't be superior to what is available in
>> your physical node.
>>
>>  Regards
>>
>> Bertrand
>>
>>
>> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi
>>>
>>> There are two options I can think of now
>>>
>>> 1) If all your jobs are memory intensive I'd recommend you to adjust
>>> your task slots per node accordingly
>>> 2) If only a few jobs are memory intensive, you can think of each map
>>> task processing lesser volume of data. For that set mapred.max.splitsize to
>>> the maximum data chuck a map task can process with your current memory
>>> constrain.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * nutch buddy <nu...@gmail.com>
>>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: why is num of map tasks gets overridden?
>>>
>>> So what can I do If I have a given input, and my job needs a lot of
>>> memroy per map task?
>>> I can't control the amount of map tasks, and my total memory per machine
>>> is limited - I'll eventaully get each machine's memory full.
>>>
>>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>>> default InputFormat behavior is to split the total number of bytes into the
>>>>> right number of fragments. However, in the default case the DFS block size
>>>>> of the input files is treated as an upper bound for input splits. A lower
>>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>>
>>>>
>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>>
>>>>> I configure a job in hadoop ,set the number of map tasks in the code
>>>>> to 8.
>>>>>
>>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>>> overriden and whhere it get 152 from.
>>>>>
>>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>>
>>>>> any idea?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: why is num of map tasks gets overridden?

Posted by Bejoy Ks <be...@gmail.com>.
Hi

You can adjust the slots in a TaskTracker/Node using
map slots -> mapred.tasktracker.map.tasks.maximum
reduce slots -> mapred.tasktracker.reduce.tasks.maximum

It is a property at Task Tracker level, so you cannot override it on a job.
You need to edit each TT's mapred-site.xml (I believe you need
to restart TT as well).

Regards
Bejoy KS

On Thu, Aug 23, 2012 at 4:42 PM, nutch buddy <nu...@gmail.com> wrote:

> how do I adjust number of slots per node?
> and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
> here?
>
> thanks
>
>
> On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> 3) Similarly to 2, you could consider multithreading. So in each physical
>> node you would only to have the equivalent in memory of what is required
>> for a map while having the processing power of many. But it will depend on
>> your context ie how you are using the memory.
>>
>> But 1) is really the key indeed : <number of slots per physical node> *
>> <maximum memory per slot> shouldn't be superior to what is available in
>> your physical node.
>>
>>  Regards
>>
>> Bertrand
>>
>>
>> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi
>>>
>>> There are two options I can think of now
>>>
>>> 1) If all your jobs are memory intensive I'd recommend you to adjust
>>> your task slots per node accordingly
>>> 2) If only a few jobs are memory intensive, you can think of each map
>>> task processing lesser volume of data. For that set mapred.max.splitsize to
>>> the maximum data chuck a map task can process with your current memory
>>> constrain.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * nutch buddy <nu...@gmail.com>
>>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: why is num of map tasks gets overridden?
>>>
>>> So what can I do If I have a given input, and my job needs a lot of
>>> memroy per map task?
>>> I can't control the amount of map tasks, and my total memory per machine
>>> is limited - I'll eventaully get each machine's memory full.
>>>
>>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>>> default InputFormat behavior is to split the total number of bytes into the
>>>>> right number of fragments. However, in the default case the DFS block size
>>>>> of the input files is treated as an upper bound for input splits. A lower
>>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>>
>>>>
>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>>
>>>>> I configure a job in hadoop ,set the number of map tasks in the code
>>>>> to 8.
>>>>>
>>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>>> overriden and whhere it get 152 from.
>>>>>
>>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>>
>>>>> any idea?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: why is num of map tasks gets overridden?

Posted by Bejoy Ks <be...@gmail.com>.
Hi

You can adjust the slots in a TaskTracker/Node using
map slots -> mapred.tasktracker.map.tasks.maximum
reduce slots -> mapred.tasktracker.reduce.tasks.maximum

It is a property at Task Tracker level, so you cannot override it on a job.
You need to edit each TT's mapred-site.xml (I believe you need
to restart TT as well).

Regards
Bejoy KS

On Thu, Aug 23, 2012 at 4:42 PM, nutch buddy <nu...@gmail.com> wrote:

> how do I adjust number of slots per node?
> and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
> here?
>
> thanks
>
>
> On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> 3) Similarly to 2, you could consider multithreading. So in each physical
>> node you would only to have the equivalent in memory of what is required
>> for a map while having the processing power of many. But it will depend on
>> your context ie how you are using the memory.
>>
>> But 1) is really the key indeed : <number of slots per physical node> *
>> <maximum memory per slot> shouldn't be superior to what is available in
>> your physical node.
>>
>>  Regards
>>
>> Bertrand
>>
>>
>> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>>
>>> **
>>> Hi
>>>
>>> There are two options I can think of now
>>>
>>> 1) If all your jobs are memory intensive I'd recommend you to adjust
>>> your task slots per node accordingly
>>> 2) If only a few jobs are memory intensive, you can think of each map
>>> task processing lesser volume of data. For that set mapred.max.splitsize to
>>> the maximum data chuck a map task can process with your current memory
>>> constrain.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * nutch buddy <nu...@gmail.com>
>>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: why is num of map tasks gets overridden?
>>>
>>> So what can I do If I have a given input, and my job needs a lot of
>>> memroy per map task?
>>> I can't control the amount of map tasks, and my total memory per machine
>>> is limited - I'll eventaully get each machine's memory full.
>>>
>>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>>> default InputFormat behavior is to split the total number of bytes into the
>>>>> right number of fragments. However, in the default case the DFS block size
>>>>> of the input files is treated as an upper bound for input splits. A lower
>>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>>
>>>>
>>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>>
>>>>> I configure a job in hadoop ,set the number of map tasks in the code
>>>>> to 8.
>>>>>
>>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>>> overriden and whhere it get 152 from.
>>>>>
>>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>>
>>>>> any idea?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
how do I adjust number of slots per node?
and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
here?

thanks

On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> 3) Similarly to 2, you could consider multithreading. So in each physical
> node you would only to have the equivalent in memory of what is required
> for a map while having the processing power of many. But it will depend on
> your context ie how you are using the memory.
>
> But 1) is really the key indeed : <number of slots per physical node> *
> <maximum memory per slot> shouldn't be superior to what is available in
> your physical node.
>
> Regards
>
> Bertrand
>
>
> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi
>>
>> There are two options I can think of now
>>
>> 1) If all your jobs are memory intensive I'd recommend you to adjust your
>> task slots per node accordingly
>> 2) If only a few jobs are memory intensive, you can think of each map
>> task processing lesser volume of data. For that set mapred.max.splitsize to
>> the maximum data chuck a map task can process with your current memory
>> constrain.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * nutch buddy <nu...@gmail.com>
>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: why is num of map tasks gets overridden?
>>
>> So what can I do If I have a given input, and my job needs a lot of
>> memroy per map task?
>> I can't control the amount of map tasks, and my total memory per machine
>> is limited - I'll eventaully get each machine's memory full.
>>
>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>> default InputFormat behavior is to split the total number of bytes into the
>>>> right number of fragments. However, in the default case the DFS block size
>>>> of the input files is treated as an upper bound for input splits. A lower
>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> Bertrand
>>>
>>>
>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>
>>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>>> 8.
>>>>
>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>> overriden and whhere it get 152 from.
>>>>
>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>
>>>> any idea?
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
how do I adjust number of slots per node?
and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
here?

thanks

On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> 3) Similarly to 2, you could consider multithreading. So in each physical
> node you would only to have the equivalent in memory of what is required
> for a map while having the processing power of many. But it will depend on
> your context ie how you are using the memory.
>
> But 1) is really the key indeed : <number of slots per physical node> *
> <maximum memory per slot> shouldn't be superior to what is available in
> your physical node.
>
> Regards
>
> Bertrand
>
>
> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi
>>
>> There are two options I can think of now
>>
>> 1) If all your jobs are memory intensive I'd recommend you to adjust your
>> task slots per node accordingly
>> 2) If only a few jobs are memory intensive, you can think of each map
>> task processing lesser volume of data. For that set mapred.max.splitsize to
>> the maximum data chuck a map task can process with your current memory
>> constrain.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * nutch buddy <nu...@gmail.com>
>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: why is num of map tasks gets overridden?
>>
>> So what can I do If I have a given input, and my job needs a lot of
>> memroy per map task?
>> I can't control the amount of map tasks, and my total memory per machine
>> is limited - I'll eventaully get each machine's memory full.
>>
>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>> default InputFormat behavior is to split the total number of bytes into the
>>>> right number of fragments. However, in the default case the DFS block size
>>>> of the input files is treated as an upper bound for input splits. A lower
>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> Bertrand
>>>
>>>
>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>
>>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>>> 8.
>>>>
>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>> overriden and whhere it get 152 from.
>>>>
>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>
>>>> any idea?
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
how do I adjust number of slots per node?
and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
here?

thanks

On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> 3) Similarly to 2, you could consider multithreading. So in each physical
> node you would only to have the equivalent in memory of what is required
> for a map while having the processing power of many. But it will depend on
> your context ie how you are using the memory.
>
> But 1) is really the key indeed : <number of slots per physical node> *
> <maximum memory per slot> shouldn't be superior to what is available in
> your physical node.
>
> Regards
>
> Bertrand
>
>
> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi
>>
>> There are two options I can think of now
>>
>> 1) If all your jobs are memory intensive I'd recommend you to adjust your
>> task slots per node accordingly
>> 2) If only a few jobs are memory intensive, you can think of each map
>> task processing lesser volume of data. For that set mapred.max.splitsize to
>> the maximum data chuck a map task can process with your current memory
>> constrain.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * nutch buddy <nu...@gmail.com>
>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: why is num of map tasks gets overridden?
>>
>> So what can I do If I have a given input, and my job needs a lot of
>> memroy per map task?
>> I can't control the amount of map tasks, and my total memory per machine
>> is limited - I'll eventaully get each machine's memory full.
>>
>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>> default InputFormat behavior is to split the total number of bytes into the
>>>> right number of fragments. However, in the default case the DFS block size
>>>> of the input files is treated as an upper bound for input splits. A lower
>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> Bertrand
>>>
>>>
>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>
>>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>>> 8.
>>>>
>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>> overriden and whhere it get 152 from.
>>>>
>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>
>>>> any idea?
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
how do I adjust number of slots per node?
and also,  is the parameter maperd.tasktracker.map.tasks.maximum relevant
here?

thanks

On Wed, Aug 22, 2012 at 9:23 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> 3) Similarly to 2, you could consider multithreading. So in each physical
> node you would only to have the equivalent in memory of what is required
> for a map while having the processing power of many. But it will depend on
> your context ie how you are using the memory.
>
> But 1) is really the key indeed : <number of slots per physical node> *
> <maximum memory per slot> shouldn't be superior to what is available in
> your physical node.
>
> Regards
>
> Bertrand
>
>
> On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:
>
>> **
>> Hi
>>
>> There are two options I can think of now
>>
>> 1) If all your jobs are memory intensive I'd recommend you to adjust your
>> task slots per node accordingly
>> 2) If only a few jobs are memory intensive, you can think of each map
>> task processing lesser volume of data. For that set mapred.max.splitsize to
>> the maximum data chuck a map task can process with your current memory
>> constrain.
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> ------------------------------
>> *From: * nutch buddy <nu...@gmail.com>
>> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: why is num of map tasks gets overridden?
>>
>> So what can I do If I have a given input, and my job needs a lot of
>> memroy per map task?
>> I can't control the amount of map tasks, and my total memory per machine
>> is limited - I'll eventaully get each machine's memory full.
>>
>> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>>> parameter is just a hint to the InputFormat for the number of maps. The
>>>> default InputFormat behavior is to split the total number of bytes into the
>>>> right number of fragments. However, in the default case the DFS block size
>>>> of the input files is treated as an upper bound for input splits. A lower
>>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>>
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> Bertrand
>>>
>>>
>>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>>
>>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>>> 8.
>>>>
>>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>>> overriden and whhere it get 152 from.
>>>>
>>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>>
>>>> any idea?
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
3) Similarly to 2, you could consider multithreading. So in each physical
node you would only to have the equivalent in memory of what is required
for a map while having the processing power of many. But it will depend on
your context ie how you are using the memory.

But 1) is really the key indeed : <number of slots per physical node> *
<maximum memory per slot> shouldn't be superior to what is available in
your physical node.

Regards

Bertrand

On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi
>
> There are two options I can think of now
>
> 1) If all your jobs are memory intensive I'd recommend you to adjust your
> task slots per node accordingly
> 2) If only a few jobs are memory intensive, you can think of each map task
> processing lesser volume of data. For that set mapred.max.splitsize to the
> maximum data chuck a map task can process with your current memory
> constrain.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * nutch buddy <nu...@gmail.com>
> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: why is num of map tasks gets overridden?
>
> So what can I do If I have a given input, and my job needs a lot of memroy
> per map task?
> I can't control the amount of map tasks, and my total memory per machine
> is limited - I'll eventaully get each machine's memory full.
>
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>> parameter is just a hint to the InputFormat for the number of maps. The
>>> default InputFormat behavior is to split the total number of bytes into the
>>> right number of fragments. However, in the default case the DFS block size
>>> of the input files is treated as an upper bound for input splits. A lower
>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> Bertrand
>>
>>
>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>
>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>> 8.
>>>
>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>> overriden and whhere it get 152 from.
>>>
>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>
>>> any idea?
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
3) Similarly to 2, you could consider multithreading. So in each physical
node you would only to have the equivalent in memory of what is required
for a map while having the processing power of many. But it will depend on
your context ie how you are using the memory.

But 1) is really the key indeed : <number of slots per physical node> *
<maximum memory per slot> shouldn't be superior to what is available in
your physical node.

Regards

Bertrand

On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi
>
> There are two options I can think of now
>
> 1) If all your jobs are memory intensive I'd recommend you to adjust your
> task slots per node accordingly
> 2) If only a few jobs are memory intensive, you can think of each map task
> processing lesser volume of data. For that set mapred.max.splitsize to the
> maximum data chuck a map task can process with your current memory
> constrain.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * nutch buddy <nu...@gmail.com>
> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: why is num of map tasks gets overridden?
>
> So what can I do If I have a given input, and my job needs a lot of memroy
> per map task?
> I can't control the amount of map tasks, and my total memory per machine
> is limited - I'll eventaully get each machine's memory full.
>
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>> parameter is just a hint to the InputFormat for the number of maps. The
>>> default InputFormat behavior is to split the total number of bytes into the
>>> right number of fragments. However, in the default case the DFS block size
>>> of the input files is treated as an upper bound for input splits. A lower
>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> Bertrand
>>
>>
>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>
>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>> 8.
>>>
>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>> overriden and whhere it get 152 from.
>>>
>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>
>>> any idea?
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
3) Similarly to 2, you could consider multithreading. So in each physical
node you would only to have the equivalent in memory of what is required
for a map while having the processing power of many. But it will depend on
your context ie how you are using the memory.

But 1) is really the key indeed : <number of slots per physical node> *
<maximum memory per slot> shouldn't be superior to what is available in
your physical node.

Regards

Bertrand

On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi
>
> There are two options I can think of now
>
> 1) If all your jobs are memory intensive I'd recommend you to adjust your
> task slots per node accordingly
> 2) If only a few jobs are memory intensive, you can think of each map task
> processing lesser volume of data. For that set mapred.max.splitsize to the
> maximum data chuck a map task can process with your current memory
> constrain.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * nutch buddy <nu...@gmail.com>
> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: why is num of map tasks gets overridden?
>
> So what can I do If I have a given input, and my job needs a lot of memroy
> per map task?
> I can't control the amount of map tasks, and my total memory per machine
> is limited - I'll eventaully get each machine's memory full.
>
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>> parameter is just a hint to the InputFormat for the number of maps. The
>>> default InputFormat behavior is to split the total number of bytes into the
>>> right number of fragments. However, in the default case the DFS block size
>>> of the input files is treated as an upper bound for input splits. A lower
>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> Bertrand
>>
>>
>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>
>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>> 8.
>>>
>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>> overriden and whhere it get 152 from.
>>>
>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>
>>> any idea?
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
3) Similarly to 2, you could consider multithreading. So in each physical
node you would only to have the equivalent in memory of what is required
for a map while having the processing power of many. But it will depend on
your context ie how you are using the memory.

But 1) is really the key indeed : <number of slots per physical node> *
<maximum memory per slot> shouldn't be superior to what is available in
your physical node.

Regards

Bertrand

On Wed, Aug 22, 2012 at 8:03 AM, Bejoy KS <be...@gmail.com> wrote:

> **
> Hi
>
> There are two options I can think of now
>
> 1) If all your jobs are memory intensive I'd recommend you to adjust your
> task slots per node accordingly
> 2) If only a few jobs are memory intensive, you can think of each map task
> processing lesser volume of data. For that set mapred.max.splitsize to the
> maximum data chuck a map task can process with your current memory
> constrain.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * nutch buddy <nu...@gmail.com>
> *Date: *Wed, 22 Aug 2012 08:57:31 +0300
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: why is num of map tasks gets overridden?
>
> So what can I do If I have a given input, and my job needs a lot of memroy
> per map task?
> I can't control the amount of map tasks, and my total memory per machine
> is limited - I'll eventaully get each machine's memory full.
>
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> Actually controlling the number of maps is subtle. The mapred.map.tasks
>>> parameter is just a hint to the InputFormat for the number of maps. The
>>> default InputFormat behavior is to split the total number of bytes into the
>>> right number of fragments. However, in the default case the DFS block size
>>> of the input files is treated as an upper bound for input splits. A lower
>>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>>
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> Bertrand
>>
>>
>> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>>
>>> I configure a job in hadoop ,set the number of map tasks in the code to
>>> 8.
>>>
>>> Then I run the job and it gets 152 map tasks. Can't get why its being
>>> overriden and whhere it get 152 from.
>>>
>>> The mapred-site.xml has 24 as mapred.map.tasks.
>>>
>>> any idea?
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bejoy KS <be...@gmail.com>.
Hi

There are two options I can think of now

1) If all your jobs are memory intensive I'd recommend you to adjust your task slots per node accordingly
2) If only a few jobs are memory intensive, you can think of each map task processing lesser volume of data. For that set mapred.max.splitsize to the maximum data chuck a map task can process with your current memory constrain.
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: nutch buddy <nu...@gmail.com>
Date: Wed, 22 Aug 2012 08:57:31 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: why is num of map tasks gets overridden?

So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>


Re: why is num of map tasks gets overridden?

Posted by Bejoy KS <be...@gmail.com>.
Hi

There are two options I can think of now

1) If all your jobs are memory intensive I'd recommend you to adjust your task slots per node accordingly
2) If only a few jobs are memory intensive, you can think of each map task processing lesser volume of data. For that set mapred.max.splitsize to the maximum data chuck a map task can process with your current memory constrain.
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: nutch buddy <nu...@gmail.com>
Date: Wed, 22 Aug 2012 08:57:31 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: why is num of map tasks gets overridden?

So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>


Re: why is num of map tasks gets overridden?

Posted by peter <zh...@gmail.com>.
You can consider to add Nodes.  

--  
peter
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On 2012年8月22日Wednesday at 下午1:57, nutch buddy wrote:

> So what can I do If I have a given input, and my job needs a lot of memroy per map task?
> I can't control the amount of map tasks, and my total memory per machine is limited - I'll eventaully get each machine's memory full.
>  
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <dechouxb@gmail.com (mailto:dechouxb@gmail.com)> wrote:
> > > Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html) determines the number of maps.  
> >  
> > http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >  
> > Bertrand
> >  
> >  
> > On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nutch.buddy@gmail.com (mailto:nutch.buddy@gmail.com)> wrote:
> > >  
> > > I configure a job in hadoop ,set the number of map tasks in the code to 8.
> > >  
> > >  
> > > Then I run the job and it gets 152 map tasks. Can't get why its being overriden and whhere it get 152 from.
> > >  
> > >  
> > > The mapred-site.xml has 24 as mapred.map.tasks.
> > >  
> > >  
> > > any idea?
> > >  
> > >  
> > >  
> >  
> >  
> >  
> >  
> >  
> > --  
> > Bertrand Dechoux
>  


Re: why is num of map tasks gets overridden?

Posted by peter <zh...@gmail.com>.
You can consider to add Nodes.  

--  
peter
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On 2012年8月22日Wednesday at 下午1:57, nutch buddy wrote:

> So what can I do If I have a given input, and my job needs a lot of memroy per map task?
> I can't control the amount of map tasks, and my total memory per machine is limited - I'll eventaully get each machine's memory full.
>  
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <dechouxb@gmail.com (mailto:dechouxb@gmail.com)> wrote:
> > > Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html) determines the number of maps.  
> >  
> > http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >  
> > Bertrand
> >  
> >  
> > On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nutch.buddy@gmail.com (mailto:nutch.buddy@gmail.com)> wrote:
> > >  
> > > I configure a job in hadoop ,set the number of map tasks in the code to 8.
> > >  
> > >  
> > > Then I run the job and it gets 152 map tasks. Can't get why its being overriden and whhere it get 152 from.
> > >  
> > >  
> > > The mapred-site.xml has 24 as mapred.map.tasks.
> > >  
> > >  
> > > any idea?
> > >  
> > >  
> > >  
> >  
> >  
> >  
> >  
> >  
> > --  
> > Bertrand Dechoux
>  


Re: why is num of map tasks gets overridden?

Posted by Bejoy KS <be...@gmail.com>.
Hi

There are two options I can think of now

1) If all your jobs are memory intensive I'd recommend you to adjust your task slots per node accordingly
2) If only a few jobs are memory intensive, you can think of each map task processing lesser volume of data. For that set mapred.max.splitsize to the maximum data chuck a map task can process with your current memory constrain.
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: nutch buddy <nu...@gmail.com>
Date: Wed, 22 Aug 2012 08:57:31 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: why is num of map tasks gets overridden?

So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>


Re: why is num of map tasks gets overridden?

Posted by Bejoy KS <be...@gmail.com>.
Hi

There are two options I can think of now

1) If all your jobs are memory intensive I'd recommend you to adjust your task slots per node accordingly
2) If only a few jobs are memory intensive, you can think of each map task processing lesser volume of data. For that set mapred.max.splitsize to the maximum data chuck a map task can process with your current memory constrain.
 
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: nutch buddy <nu...@gmail.com>
Date: Wed, 22 Aug 2012 08:57:31 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: why is num of map tasks gets overridden?

So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>


Re: why is num of map tasks gets overridden?

Posted by peter <zh...@gmail.com>.
You can consider to add Nodes.  

--  
peter
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On 2012年8月22日Wednesday at 下午1:57, nutch buddy wrote:

> So what can I do If I have a given input, and my job needs a lot of memroy per map task?
> I can't control the amount of map tasks, and my total memory per machine is limited - I'll eventaully get each machine's memory full.
>  
> On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <dechouxb@gmail.com (mailto:dechouxb@gmail.com)> wrote:
> > > Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html) determines the number of maps.  
> >  
> > http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >  
> > Bertrand
> >  
> >  
> > On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nutch.buddy@gmail.com (mailto:nutch.buddy@gmail.com)> wrote:
> > >  
> > > I configure a job in hadoop ,set the number of map tasks in the code to 8.
> > >  
> > >  
> > > Then I run the job and it gets 152 map tasks. Can't get why its being overriden and whhere it get 152 from.
> > >  
> > >  
> > > The mapred-site.xml has 24 as mapred.map.tasks.
> > >  
> > >  
> > > any idea?
> > >  
> > >  
> > >  
> >  
> >  
> >  
> >  
> >  
> > --  
> > Bertrand Dechoux
>  


Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by nutch buddy <nu...@gmail.com>.
So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
>
> Actually controlling the number of maps is subtle. The mapred.map.tasks
> parameter is just a hint to the InputFormat for the number of maps. The
> default InputFormat behavior is to split the total number of bytes into the
> right number of fragments. However, in the default case the DFS block size
> of the input files is treated as an upper bound for input splits. A lower
> bound on the split size can be set via mapred.min.split.size. Thus, if you
> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
> maps, unless your mapred.map.tasks is even larger. Ultimately the
> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Bertrand

On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com> wrote:

> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>
> Then I run the job and it gets 152 map tasks. Can't get why its being
> overriden and whhere it get 152 from.
>
> The mapred-site.xml has 24 as mapred.map.tasks.
>
> any idea?
>



-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
>
> Actually controlling the number of maps is subtle. The mapred.map.tasks
> parameter is just a hint to the InputFormat for the number of maps. The
> default InputFormat behavior is to split the total number of bytes into the
> right number of fragments. However, in the default case the DFS block size
> of the input files is treated as an upper bound for input splits. A lower
> bound on the split size can be set via mapred.min.split.size. Thus, if you
> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
> maps, unless your mapred.map.tasks is even larger. Ultimately the
> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Bertrand

On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com> wrote:

> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>
> Then I run the job and it gets 152 map tasks. Can't get why its being
> overriden and whhere it get 152 from.
>
> The mapred-site.xml has 24 as mapred.map.tasks.
>
> any idea?
>



-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
>
> Actually controlling the number of maps is subtle. The mapred.map.tasks
> parameter is just a hint to the InputFormat for the number of maps. The
> default InputFormat behavior is to split the total number of bytes into the
> right number of fragments. However, in the default case the DFS block size
> of the input files is treated as an upper bound for input splits. A lower
> bound on the split size can be set via mapred.min.split.size. Thus, if you
> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
> maps, unless your mapred.map.tasks is even larger. Ultimately the
> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Bertrand

On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com> wrote:

> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>
> Then I run the job and it gets 152 map tasks. Can't get why its being
> overriden and whhere it get 152 from.
>
> The mapred-site.xml has 24 as mapred.map.tasks.
>
> any idea?
>



-- 
Bertrand Dechoux

Re: why is num of map tasks gets overridden?

Posted by Bertrand Dechoux <de...@gmail.com>.
>
> Actually controlling the number of maps is subtle. The mapred.map.tasks
> parameter is just a hint to the InputFormat for the number of maps. The
> default InputFormat behavior is to split the total number of bytes into the
> right number of fragments. However, in the default case the DFS block size
> of the input files is treated as an upper bound for input splits. A lower
> bound on the split size can be set via mapred.min.split.size. Thus, if you
> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
> maps, unless your mapred.map.tasks is even larger. Ultimately the
> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Bertrand

On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nu...@gmail.com> wrote:

> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>
> Then I run the job and it gets 152 map tasks. Can't get why its being
> overriden and whhere it get 152 from.
>
> The mapred-site.xml has 24 as mapred.map.tasks.
>
> any idea?
>



-- 
Bertrand Dechoux