You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Shashidhar Rao <ra...@gmail.com> on 2013/05/11 16:32:30 UTC

Need help about task slots

Hi Users,

I am new to Hadoop and confused about task slots in a cluster. How would I
know how many task slots would be required for a job. Is there any
empirical formula or on what basis should I set the number of task slots.

Advanced Thanks

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> sorry for my blunder as well. my previous post for for Tariq in a wrong
> post.
>
> Thanks.
> Rahul
>
>
> On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Oh! I though distcp works on complete files rather then mappers per
>> datablock.
>> So I guess parallelism would still be there if there are multipel files..
>> please correct if ther is anything wrong.
>>
>> Thank,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>>> use distcp for local to HDFS copy :
>>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>>> *
>>> *
>>> And yes. When you use distcp from local to HDFS, you can't take the
>>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello guys,
>>>>
>>>>              My 2 cents :
>>>>
>>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>>> created by the InputFormat you are using and the no. of reducers by the no.
>>>> of partitions you get after the map phase. Having said that, you should
>>>> also keep the no of slots, available per slave, in mind, along with the
>>>> available memory. But as a general rule you could use this approach :
>>>>
>>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>>> or whatever you think is OK with you.
>>>>
>>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am also new to Hadoop world , here is my take on your question , if
>>>>> there is something missing then others would surely correct that.
>>>>>
>>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>>> capacity of the datanode hardware , once the slots per data node is
>>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>>> into the config files and remain fixed , until changed.In YARN , its
>>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>>> very much possible that a datanode at certain point of time running  10
>>>>> tasks and another similar datanode is only running 4 tasks.
>>>>>
>>>>> Coming to your question. Based of the data set size , block size of
>>>>> dfs and input formater , the number of map tasks are decided , generally
>>>>> for file based inputformats its one mapper per data block , however there
>>>>> are way to change this using configuration settings.Reduce tasks are set
>>>>> using job configuration.
>>>>>
>>>>> General rule as I have read from various documents is that Mappers
>>>>> should run atleast a minute , so you can run a sample to find out a good
>>>>> size of data block which would make you mapper run more than a minute. Now
>>>>> it again depends on your SLA , in case you are not looking for a very small
>>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>>
>>>>> But again its all theory , not sure how these things are handled in
>>>>> actual prod clusters.
>>>>>
>>>>> HTH,
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi Users,
>>>>>>
>>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>>> would I know how many task slots would be required for a job. Is there any
>>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>>
>>>>>> Advanced Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> sorry for my blunder as well. my previous post for for Tariq in a wrong
> post.
>
> Thanks.
> Rahul
>
>
> On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Oh! I though distcp works on complete files rather then mappers per
>> datablock.
>> So I guess parallelism would still be there if there are multipel files..
>> please correct if ther is anything wrong.
>>
>> Thank,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>>> use distcp for local to HDFS copy :
>>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>>> *
>>> *
>>> And yes. When you use distcp from local to HDFS, you can't take the
>>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello guys,
>>>>
>>>>              My 2 cents :
>>>>
>>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>>> created by the InputFormat you are using and the no. of reducers by the no.
>>>> of partitions you get after the map phase. Having said that, you should
>>>> also keep the no of slots, available per slave, in mind, along with the
>>>> available memory. But as a general rule you could use this approach :
>>>>
>>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>>> or whatever you think is OK with you.
>>>>
>>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am also new to Hadoop world , here is my take on your question , if
>>>>> there is something missing then others would surely correct that.
>>>>>
>>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>>> capacity of the datanode hardware , once the slots per data node is
>>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>>> into the config files and remain fixed , until changed.In YARN , its
>>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>>> very much possible that a datanode at certain point of time running  10
>>>>> tasks and another similar datanode is only running 4 tasks.
>>>>>
>>>>> Coming to your question. Based of the data set size , block size of
>>>>> dfs and input formater , the number of map tasks are decided , generally
>>>>> for file based inputformats its one mapper per data block , however there
>>>>> are way to change this using configuration settings.Reduce tasks are set
>>>>> using job configuration.
>>>>>
>>>>> General rule as I have read from various documents is that Mappers
>>>>> should run atleast a minute , so you can run a sample to find out a good
>>>>> size of data block which would make you mapper run more than a minute. Now
>>>>> it again depends on your SLA , in case you are not looking for a very small
>>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>>
>>>>> But again its all theory , not sure how these things are handled in
>>>>> actual prod clusters.
>>>>>
>>>>> HTH,
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi Users,
>>>>>>
>>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>>> would I know how many task slots would be required for a job. Is there any
>>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>>
>>>>>> Advanced Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> sorry for my blunder as well. my previous post for for Tariq in a wrong
> post.
>
> Thanks.
> Rahul
>
>
> On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Oh! I though distcp works on complete files rather then mappers per
>> datablock.
>> So I guess parallelism would still be there if there are multipel files..
>> please correct if ther is anything wrong.
>>
>> Thank,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>>> use distcp for local to HDFS copy :
>>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>>> *
>>> *
>>> And yes. When you use distcp from local to HDFS, you can't take the
>>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello guys,
>>>>
>>>>              My 2 cents :
>>>>
>>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>>> created by the InputFormat you are using and the no. of reducers by the no.
>>>> of partitions you get after the map phase. Having said that, you should
>>>> also keep the no of slots, available per slave, in mind, along with the
>>>> available memory. But as a general rule you could use this approach :
>>>>
>>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>>> or whatever you think is OK with you.
>>>>
>>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am also new to Hadoop world , here is my take on your question , if
>>>>> there is something missing then others would surely correct that.
>>>>>
>>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>>> capacity of the datanode hardware , once the slots per data node is
>>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>>> into the config files and remain fixed , until changed.In YARN , its
>>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>>> very much possible that a datanode at certain point of time running  10
>>>>> tasks and another similar datanode is only running 4 tasks.
>>>>>
>>>>> Coming to your question. Based of the data set size , block size of
>>>>> dfs and input formater , the number of map tasks are decided , generally
>>>>> for file based inputformats its one mapper per data block , however there
>>>>> are way to change this using configuration settings.Reduce tasks are set
>>>>> using job configuration.
>>>>>
>>>>> General rule as I have read from various documents is that Mappers
>>>>> should run atleast a minute , so you can run a sample to find out a good
>>>>> size of data block which would make you mapper run more than a minute. Now
>>>>> it again depends on your SLA , in case you are not looking for a very small
>>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>>
>>>>> But again its all theory , not sure how these things are handled in
>>>>> actual prod clusters.
>>>>>
>>>>> HTH,
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi Users,
>>>>>>
>>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>>> would I know how many task slots would be required for a job. Is there any
>>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>>
>>>>>> Advanced Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hahaha..I think we could continue this over there..

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> sorry for my blunder as well. my previous post for for Tariq in a wrong
> post.
>
> Thanks.
> Rahul
>
>
> On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Oh! I though distcp works on complete files rather then mappers per
>> datablock.
>> So I guess parallelism would still be there if there are multipel files..
>> please correct if ther is anything wrong.
>>
>> Thank,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>>> use distcp for local to HDFS copy :
>>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>>> *
>>> *
>>> And yes. When you use distcp from local to HDFS, you can't take the
>>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>>
>>>> Hello guys,
>>>>
>>>>              My 2 cents :
>>>>
>>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>>> created by the InputFormat you are using and the no. of reducers by the no.
>>>> of partitions you get after the map phase. Having said that, you should
>>>> also keep the no of slots, available per slave, in mind, along with the
>>>> available memory. But as a general rule you could use this approach :
>>>>
>>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>>> or whatever you think is OK with you.
>>>>
>>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am also new to Hadoop world , here is my take on your question , if
>>>>> there is something missing then others would surely correct that.
>>>>>
>>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>>> capacity of the datanode hardware , once the slots per data node is
>>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>>> into the config files and remain fixed , until changed.In YARN , its
>>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>>> very much possible that a datanode at certain point of time running  10
>>>>> tasks and another similar datanode is only running 4 tasks.
>>>>>
>>>>> Coming to your question. Based of the data set size , block size of
>>>>> dfs and input formater , the number of map tasks are decided , generally
>>>>> for file based inputformats its one mapper per data block , however there
>>>>> are way to change this using configuration settings.Reduce tasks are set
>>>>> using job configuration.
>>>>>
>>>>> General rule as I have read from various documents is that Mappers
>>>>> should run atleast a minute , so you can run a sample to find out a good
>>>>> size of data block which would make you mapper run more than a minute. Now
>>>>> it again depends on your SLA , in case you are not looking for a very small
>>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>>
>>>>> But again its all theory , not sure how these things are handled in
>>>>> actual prod clusters.
>>>>>
>>>>> HTH,
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi Users,
>>>>>>
>>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>>> would I know how many task slots would be required for a job. Is there any
>>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>>
>>>>>> Advanced Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
sorry for my blunder as well. my previous post for for Tariq in a wrong
post.

Thanks.
Rahul


On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Oh! I though distcp works on complete files rather then mappers per
> datablock.
> So I guess parallelism would still be there if there are multipel files..
> please correct if ther is anything wrong.
>
> Thank,
> Rahul
>
>
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>> use distcp for local to HDFS copy :
>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>> *
>> *
>> And yes. When you use distcp from local to HDFS, you can't take the
>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello guys,
>>>
>>>              My 2 cents :
>>>
>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>> created by the InputFormat you are using and the no. of reducers by the no.
>>> of partitions you get after the map phase. Having said that, you should
>>> also keep the no of slots, available per slave, in mind, along with the
>>> available memory. But as a general rule you could use this approach :
>>>
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>> or whatever you think is OK with you.
>>>
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am also new to Hadoop world , here is my take on your question , if
>>>> there is something missing then others would surely correct that.
>>>>
>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>> capacity of the datanode hardware , once the slots per data node is
>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>> into the config files and remain fixed , until changed.In YARN , its
>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>> very much possible that a datanode at certain point of time running  10
>>>> tasks and another similar datanode is only running 4 tasks.
>>>>
>>>> Coming to your question. Based of the data set size , block size of dfs
>>>> and input formater , the number of map tasks are decided , generally for
>>>> file based inputformats its one mapper per data block , however there are
>>>> way to change this using configuration settings.Reduce tasks are set using
>>>> job configuration.
>>>>
>>>> General rule as I have read from various documents is that Mappers
>>>> should run atleast a minute , so you can run a sample to find out a good
>>>> size of data block which would make you mapper run more than a minute. Now
>>>> it again depends on your SLA , in case you are not looking for a very small
>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>
>>>> But again its all theory , not sure how these things are handled in
>>>> actual prod clusters.
>>>>
>>>> HTH,
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi Users,
>>>>>
>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>> would I know how many task slots would be required for a job. Is there any
>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>
>>>>> Advanced Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
sorry for my blunder as well. my previous post for for Tariq in a wrong
post.

Thanks.
Rahul


On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Oh! I though distcp works on complete files rather then mappers per
> datablock.
> So I guess parallelism would still be there if there are multipel files..
> please correct if ther is anything wrong.
>
> Thank,
> Rahul
>
>
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>> use distcp for local to HDFS copy :
>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>> *
>> *
>> And yes. When you use distcp from local to HDFS, you can't take the
>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello guys,
>>>
>>>              My 2 cents :
>>>
>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>> created by the InputFormat you are using and the no. of reducers by the no.
>>> of partitions you get after the map phase. Having said that, you should
>>> also keep the no of slots, available per slave, in mind, along with the
>>> available memory. But as a general rule you could use this approach :
>>>
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>> or whatever you think is OK with you.
>>>
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am also new to Hadoop world , here is my take on your question , if
>>>> there is something missing then others would surely correct that.
>>>>
>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>> capacity of the datanode hardware , once the slots per data node is
>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>> into the config files and remain fixed , until changed.In YARN , its
>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>> very much possible that a datanode at certain point of time running  10
>>>> tasks and another similar datanode is only running 4 tasks.
>>>>
>>>> Coming to your question. Based of the data set size , block size of dfs
>>>> and input formater , the number of map tasks are decided , generally for
>>>> file based inputformats its one mapper per data block , however there are
>>>> way to change this using configuration settings.Reduce tasks are set using
>>>> job configuration.
>>>>
>>>> General rule as I have read from various documents is that Mappers
>>>> should run atleast a minute , so you can run a sample to find out a good
>>>> size of data block which would make you mapper run more than a minute. Now
>>>> it again depends on your SLA , in case you are not looking for a very small
>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>
>>>> But again its all theory , not sure how these things are handled in
>>>> actual prod clusters.
>>>>
>>>> HTH,
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi Users,
>>>>>
>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>> would I know how many task slots would be required for a job. Is there any
>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>
>>>>> Advanced Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
sorry for my blunder as well. my previous post for for Tariq in a wrong
post.

Thanks.
Rahul


On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Oh! I though distcp works on complete files rather then mappers per
> datablock.
> So I guess parallelism would still be there if there are multipel files..
> please correct if ther is anything wrong.
>
> Thank,
> Rahul
>
>
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>> use distcp for local to HDFS copy :
>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>> *
>> *
>> And yes. When you use distcp from local to HDFS, you can't take the
>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello guys,
>>>
>>>              My 2 cents :
>>>
>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>> created by the InputFormat you are using and the no. of reducers by the no.
>>> of partitions you get after the map phase. Having said that, you should
>>> also keep the no of slots, available per slave, in mind, along with the
>>> available memory. But as a general rule you could use this approach :
>>>
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>> or whatever you think is OK with you.
>>>
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am also new to Hadoop world , here is my take on your question , if
>>>> there is something missing then others would surely correct that.
>>>>
>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>> capacity of the datanode hardware , once the slots per data node is
>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>> into the config files and remain fixed , until changed.In YARN , its
>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>> very much possible that a datanode at certain point of time running  10
>>>> tasks and another similar datanode is only running 4 tasks.
>>>>
>>>> Coming to your question. Based of the data set size , block size of dfs
>>>> and input formater , the number of map tasks are decided , generally for
>>>> file based inputformats its one mapper per data block , however there are
>>>> way to change this using configuration settings.Reduce tasks are set using
>>>> job configuration.
>>>>
>>>> General rule as I have read from various documents is that Mappers
>>>> should run atleast a minute , so you can run a sample to find out a good
>>>> size of data block which would make you mapper run more than a minute. Now
>>>> it again depends on your SLA , in case you are not looking for a very small
>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>
>>>> But again its all theory , not sure how these things are handled in
>>>> actual prod clusters.
>>>>
>>>> HTH,
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi Users,
>>>>>
>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>> would I know how many task slots would be required for a job. Is there any
>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>
>>>>> Advanced Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
sorry for my blunder as well. my previous post for for Tariq in a wrong
post.

Thanks.
Rahul


On Sun, May 12, 2013 at 6:03 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Oh! I though distcp works on complete files rather then mappers per
> datablock.
> So I guess parallelism would still be there if there are multipel files..
> please correct if ther is anything wrong.
>
> Thank,
> Rahul
>
>
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> @Rahul : I'm sorry as I am not aware of any such document. But you could
>> use distcp for local to HDFS copy :
>> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
>> *
>> *
>> And yes. When you use distcp from local to HDFS, you can't take the
>> pleasure of parallelism as the data is stored in a non distributed fashion.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>>
>>> Hello guys,
>>>
>>>              My 2 cents :
>>>
>>> Actually no. of mappers is primarily governed by the no. of InputSplits
>>> created by the InputFormat you are using and the no. of reducers by the no.
>>> of partitions you get after the map phase. Having said that, you should
>>> also keep the no of slots, available per slave, in mind, along with the
>>> available memory. But as a general rule you could use this approach :
>>>
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>>> configure. For example, if you have 12 physical cores (or 24 virtual
>>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>>> you could choose how many mappers and reducers you want to use. With 18 MR
>>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>>> or whatever you think is OK with you.
>>>
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am also new to Hadoop world , here is my take on your question , if
>>>> there is something missing then others would surely correct that.
>>>>
>>>> For per-YARN , the slots are fixed and computed based on the crunching
>>>> capacity of the datanode hardware , once the slots per data node is
>>>> ascertained , they are divided into Map and reducer slots and that goes
>>>> into the config files and remain fixed , until changed.In YARN , its
>>>> decided at runtime based on the kind of requirement of particular task.Its
>>>> very much possible that a datanode at certain point of time running  10
>>>> tasks and another similar datanode is only running 4 tasks.
>>>>
>>>> Coming to your question. Based of the data set size , block size of dfs
>>>> and input formater , the number of map tasks are decided , generally for
>>>> file based inputformats its one mapper per data block , however there are
>>>> way to change this using configuration settings.Reduce tasks are set using
>>>> job configuration.
>>>>
>>>> General rule as I have read from various documents is that Mappers
>>>> should run atleast a minute , so you can run a sample to find out a good
>>>> size of data block which would make you mapper run more than a minute. Now
>>>> it again depends on your SLA , in case you are not looking for a very small
>>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>>
>>>> But again its all theory , not sure how these things are handled in
>>>> actual prod clusters.
>>>>
>>>> HTH,
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi Users,
>>>>>
>>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>>> would I know how many task slots would be required for a job. Is there any
>>>>> empirical formula or on what basis should I set the number of task slots.
>>>>>
>>>>> Advanced Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by yypvsxf19870706 <yy...@gmail.com>.
Hi
    
    The concept of task slots is used in MRv1.
     In the new version of Hadoop ,MRv2 uses yarn instead of slots.
      You can read it from Hadoop definitive 3rd.




发自我的 iPhone

在 2013-5-12,20:11,Mohammad Tariq <do...@gmail.com> 写道:

> Sorry for the blunder guys.
> 
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
> 
> 
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:
>> @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy :
>> bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/
>> 
>> And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion.
>> 
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>> 
>> 
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hello guys, 
>>> 
>>>             My 2 cents : 
>>> 
>>> Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach :
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. 
>>> 
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>> 
>>> 
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that.
>>>> 
>>>> For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running  10 tasks and another similar datanode is only running 4 tasks.
>>>> 
>>>> Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration.
>>>> 
>>>> General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime.
>>>> 
>>>> But again its all theory , not sure how these things are handled in actual prod clusters.
>>>> 
>>>> HTH,
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Rahul
>>>> 
>>>> 
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <ra...@gmail.com> wrote:
>>>>> Hi Users,
>>>>> 
>>>>> I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots.
>>>>> 
>>>>> Advanced Thanks
> 

Re: Need help about task slots

Posted by yypvsxf19870706 <yy...@gmail.com>.
Hi
    
    The concept of task slots is used in MRv1.
     In the new version of Hadoop ,MRv2 uses yarn instead of slots.
      You can read it from Hadoop definitive 3rd.




�����ҵ� iPhone

�� 2013-5-12��20:11��Mohammad Tariq <do...@gmail.com> ���

> Sorry for the blunder guys.
> 
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
> 
> 
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:
>> @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy :
>> bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/
>> 
>> And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion.
>> 
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>> 
>> 
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hello guys, 
>>> 
>>>             My 2 cents : 
>>> 
>>> Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach :
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. 
>>> 
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>> 
>>> 
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that.
>>>> 
>>>> For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running  10 tasks and another similar datanode is only running 4 tasks.
>>>> 
>>>> Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration.
>>>> 
>>>> General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime.
>>>> 
>>>> But again its all theory , not sure how these things are handled in actual prod clusters.
>>>> 
>>>> HTH,
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Rahul
>>>> 
>>>> 
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <ra...@gmail.com> wrote:
>>>>> Hi Users,
>>>>> 
>>>>> I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots.
>>>>> 
>>>>> Advanced Thanks
> 

Re: Need help about task slots

Posted by yypvsxf19870706 <yy...@gmail.com>.
Hi
    
    The concept of task slots is used in MRv1.
     In the new version of Hadoop ,MRv2 uses yarn instead of slots.
      You can read it from Hadoop definitive 3rd.




发自我的 iPhone

在 2013-5-12,20:11,Mohammad Tariq <do...@gmail.com> 写道:

> Sorry for the blunder guys.
> 
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
> 
> 
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:
>> @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy :
>> bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/
>> 
>> And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion.
>> 
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>> 
>> 
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hello guys, 
>>> 
>>>             My 2 cents : 
>>> 
>>> Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach :
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. 
>>> 
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>> 
>>> 
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that.
>>>> 
>>>> For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running  10 tasks and another similar datanode is only running 4 tasks.
>>>> 
>>>> Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration.
>>>> 
>>>> General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime.
>>>> 
>>>> But again its all theory , not sure how these things are handled in actual prod clusters.
>>>> 
>>>> HTH,
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Rahul
>>>> 
>>>> 
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <ra...@gmail.com> wrote:
>>>>> Hi Users,
>>>>> 
>>>>> I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots.
>>>>> 
>>>>> Advanced Thanks
> 

Re: Need help about task slots

Posted by yypvsxf19870706 <yy...@gmail.com>.
Hi
    
    The concept of task slots is used in MRv1.
     In the new version of Hadoop ,MRv2 uses yarn instead of slots.
      You can read it from Hadoop definitive 3rd.




�����ҵ� iPhone

�� 2013-5-12��20:11��Mohammad Tariq <do...@gmail.com> ���

> Sorry for the blunder guys.
> 
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
> 
> 
> On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:
>> @Rahul : I'm sorry as I am not aware of any such document. But you could use distcp for local to HDFS copy :
>> bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/
>> 
>> And yes. When you use distcp from local to HDFS, you can't take the pleasure of parallelism as the data is stored in a non distributed fashion.
>> 
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>> 
>> 
>> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> Hello guys, 
>>> 
>>>             My 2 cents : 
>>> 
>>> Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a general rule you could use this approach :
>>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you. 
>>> 
>>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>> 
>>> 
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>> 
>>> 
>>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I am also new to Hadoop world , here is my take on your question , if there is something missing then others would surely correct that.
>>>> 
>>>> For per-YARN , the slots are fixed and computed based on the crunching capacity of the datanode hardware , once the slots per data node is ascertained , they are divided into Map and reducer slots and that goes into the config files and remain fixed , until changed.In YARN , its decided at runtime based on the kind of requirement of particular task.Its very much possible that a datanode at certain point of time running  10 tasks and another similar datanode is only running 4 tasks.
>>>> 
>>>> Coming to your question. Based of the data set size , block size of dfs and input formater , the number of map tasks are decided , generally for file based inputformats its one mapper per data block , however there are way to change this using configuration settings.Reduce tasks are set using job configuration.
>>>> 
>>>> General rule as I have read from various documents is that Mappers should run atleast a minute , so you can run a sample to find out a good size of data block which would make you mapper run more than a minute. Now it again depends on your SLA , in case you are not looking for a very small SLA you can choose to run less mappers at the expense of higher runtime.
>>>> 
>>>> But again its all theory , not sure how these things are handled in actual prod clusters.
>>>> 
>>>> HTH,
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Rahul
>>>> 
>>>> 
>>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <ra...@gmail.com> wrote:
>>>>> Hi Users,
>>>>> 
>>>>> I am new to Hadoop and confused about task slots in a cluster. How would I know how many task slots would be required for a job. Is there any empirical formula or on what basis should I set the number of task slots.
>>>>> 
>>>>> Advanced Thanks
> 

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry for the blunder guys.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry for the blunder guys.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry for the blunder guys.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry for the blunder guys.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Oh! I though distcp works on complete files rather then mappers per
datablock.
So I guess parallelism would still be there if there are multipel files..
please correct if ther is anything wrong.

Thank,
Rahul


On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq <do...@gmail.com> wrote:

> @Rahul : I'm sorry as I am not aware of any such document. But you could
> use distcp for local to HDFS copy :
> *bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
> *
> *
> And yes. When you use distcp from local to HDFS, you can't take the
> pleasure of parallelism as the data is stored in a non distributed fashion.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Hello guys,
>>
>>              My 2 cents :
>>
>> Actually no. of mappers is primarily governed by the no. of InputSplits
>> created by the InputFormat you are using and the no. of reducers by the no.
>> of partitions you get after the map phase. Having said that, you should
>> also keep the no of slots, available per slave, in mind, along with the
>> available memory. But as a general rule you could use this approach :
>>
>> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
>> configure. For example, if you have 12 physical cores (or 24 virtual
>> cores), you would have (24*.75)=18 slots. Now, based on your requirement
>> you could choose how many mappers and reducers you want to use. With 18 MR
>> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
>> or whatever you think is OK with you.
>>
>> I don't know if it ,makes much sense, but it helps me pretty decently.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am also new to Hadoop world , here is my take on your question , if
>>> there is something missing then others would surely correct that.
>>>
>>> For per-YARN , the slots are fixed and computed based on the crunching
>>> capacity of the datanode hardware , once the slots per data node is
>>> ascertained , they are divided into Map and reducer slots and that goes
>>> into the config files and remain fixed , until changed.In YARN , its
>>> decided at runtime based on the kind of requirement of particular task.Its
>>> very much possible that a datanode at certain point of time running  10
>>> tasks and another similar datanode is only running 4 tasks.
>>>
>>> Coming to your question. Based of the data set size , block size of dfs
>>> and input formater , the number of map tasks are decided , generally for
>>> file based inputformats its one mapper per data block , however there are
>>> way to change this using configuration settings.Reduce tasks are set using
>>> job configuration.
>>>
>>> General rule as I have read from various documents is that Mappers
>>> should run atleast a minute , so you can run a sample to find out a good
>>> size of data block which would make you mapper run more than a minute. Now
>>> it again depends on your SLA , in case you are not looking for a very small
>>> SLA you can choose to run less mappers at the expense of higher runtime.
>>>
>>> But again its all theory , not sure how these things are handled in
>>> actual prod clusters.
>>>
>>> HTH,
>>>
>>>
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Users,
>>>>
>>>> I am new to Hadoop and confused about task slots in a cluster. How
>>>> would I know how many task slots would be required for a job. Is there any
>>>> empirical formula or on what basis should I set the number of task slots.
>>>>
>>>> Advanced Thanks
>>>>
>>>
>>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello guys,
>
>              My 2 cents :
>
> Actually no. of mappers is primarily governed by the no. of InputSplits
> created by the InputFormat you are using and the no. of reducers by the no.
> of partitions you get after the map phase. Having said that, you should
> also keep the no of slots, available per slave, in mind, along with the
> available memory. But as a general rule you could use this approach :
>
> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
> configure. For example, if you have 12 physical cores (or 24 virtual
> cores), you would have (24*.75)=18 slots. Now, based on your requirement
> you could choose how many mappers and reducers you want to use. With 18 MR
> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
> or whatever you think is OK with you.
>
> I don't know if it ,makes much sense, but it helps me pretty decently.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I am also new to Hadoop world , here is my take on your question , if
>> there is something missing then others would surely correct that.
>>
>> For per-YARN , the slots are fixed and computed based on the crunching
>> capacity of the datanode hardware , once the slots per data node is
>> ascertained , they are divided into Map and reducer slots and that goes
>> into the config files and remain fixed , until changed.In YARN , its
>> decided at runtime based on the kind of requirement of particular task.Its
>> very much possible that a datanode at certain point of time running  10
>> tasks and another similar datanode is only running 4 tasks.
>>
>> Coming to your question. Based of the data set size , block size of dfs
>> and input formater , the number of map tasks are decided , generally for
>> file based inputformats its one mapper per data block , however there are
>> way to change this using configuration settings.Reduce tasks are set using
>> job configuration.
>>
>> General rule as I have read from various documents is that Mappers should
>> run atleast a minute , so you can run a sample to find out a good size of
>> data block which would make you mapper run more than a minute. Now it again
>> depends on your SLA , in case you are not looking for a very small SLA you
>> can choose to run less mappers at the expense of higher runtime.
>>
>> But again its all theory , not sure how these things are handled in
>> actual prod clusters.
>>
>> HTH,
>>
>>
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Users,
>>>
>>> I am new to Hadoop and confused about task slots in a cluster. How would
>>> I know how many task slots would be required for a job. Is there any
>>> empirical formula or on what basis should I set the number of task slots.
>>>
>>> Advanced Thanks
>>>
>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello guys,
>
>              My 2 cents :
>
> Actually no. of mappers is primarily governed by the no. of InputSplits
> created by the InputFormat you are using and the no. of reducers by the no.
> of partitions you get after the map phase. Having said that, you should
> also keep the no of slots, available per slave, in mind, along with the
> available memory. But as a general rule you could use this approach :
>
> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
> configure. For example, if you have 12 physical cores (or 24 virtual
> cores), you would have (24*.75)=18 slots. Now, based on your requirement
> you could choose how many mappers and reducers you want to use. With 18 MR
> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
> or whatever you think is OK with you.
>
> I don't know if it ,makes much sense, but it helps me pretty decently.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I am also new to Hadoop world , here is my take on your question , if
>> there is something missing then others would surely correct that.
>>
>> For per-YARN , the slots are fixed and computed based on the crunching
>> capacity of the datanode hardware , once the slots per data node is
>> ascertained , they are divided into Map and reducer slots and that goes
>> into the config files and remain fixed , until changed.In YARN , its
>> decided at runtime based on the kind of requirement of particular task.Its
>> very much possible that a datanode at certain point of time running  10
>> tasks and another similar datanode is only running 4 tasks.
>>
>> Coming to your question. Based of the data set size , block size of dfs
>> and input formater , the number of map tasks are decided , generally for
>> file based inputformats its one mapper per data block , however there are
>> way to change this using configuration settings.Reduce tasks are set using
>> job configuration.
>>
>> General rule as I have read from various documents is that Mappers should
>> run atleast a minute , so you can run a sample to find out a good size of
>> data block which would make you mapper run more than a minute. Now it again
>> depends on your SLA , in case you are not looking for a very small SLA you
>> can choose to run less mappers at the expense of higher runtime.
>>
>> But again its all theory , not sure how these things are handled in
>> actual prod clusters.
>>
>> HTH,
>>
>>
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Users,
>>>
>>> I am new to Hadoop and confused about task slots in a cluster. How would
>>> I know how many task slots would be required for a job. Is there any
>>> empirical formula or on what basis should I set the number of task slots.
>>>
>>> Advanced Thanks
>>>
>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello guys,
>
>              My 2 cents :
>
> Actually no. of mappers is primarily governed by the no. of InputSplits
> created by the InputFormat you are using and the no. of reducers by the no.
> of partitions you get after the map phase. Having said that, you should
> also keep the no of slots, available per slave, in mind, along with the
> available memory. But as a general rule you could use this approach :
>
> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
> configure. For example, if you have 12 physical cores (or 24 virtual
> cores), you would have (24*.75)=18 slots. Now, based on your requirement
> you could choose how many mappers and reducers you want to use. With 18 MR
> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
> or whatever you think is OK with you.
>
> I don't know if it ,makes much sense, but it helps me pretty decently.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I am also new to Hadoop world , here is my take on your question , if
>> there is something missing then others would surely correct that.
>>
>> For per-YARN , the slots are fixed and computed based on the crunching
>> capacity of the datanode hardware , once the slots per data node is
>> ascertained , they are divided into Map and reducer slots and that goes
>> into the config files and remain fixed , until changed.In YARN , its
>> decided at runtime based on the kind of requirement of particular task.Its
>> very much possible that a datanode at certain point of time running  10
>> tasks and another similar datanode is only running 4 tasks.
>>
>> Coming to your question. Based of the data set size , block size of dfs
>> and input formater , the number of map tasks are decided , generally for
>> file based inputformats its one mapper per data block , however there are
>> way to change this using configuration settings.Reduce tasks are set using
>> job configuration.
>>
>> General rule as I have read from various documents is that Mappers should
>> run atleast a minute , so you can run a sample to find out a good size of
>> data block which would make you mapper run more than a minute. Now it again
>> depends on your SLA , in case you are not looking for a very small SLA you
>> can choose to run less mappers at the expense of higher runtime.
>>
>> But again its all theory , not sure how these things are handled in
>> actual prod clusters.
>>
>> HTH,
>>
>>
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Users,
>>>
>>> I am new to Hadoop and confused about task slots in a cluster. How would
>>> I know how many task slots would be required for a job. Is there any
>>> empirical formula or on what basis should I set the number of task slots.
>>>
>>> Advanced Thanks
>>>
>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
@Rahul : I'm sorry as I am not aware of any such document. But you could
use distcp for local to HDFS copy :
*bin/hadoop  distcp  file:///home/tariq/in.txt  hdfs://localhost:9000/*
*
*
And yes. When you use distcp from local to HDFS, you can't take the
pleasure of parallelism as the data is stored in a non distributed fashion.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello guys,
>
>              My 2 cents :
>
> Actually no. of mappers is primarily governed by the no. of InputSplits
> created by the InputFormat you are using and the no. of reducers by the no.
> of partitions you get after the map phase. Having said that, you should
> also keep the no of slots, available per slave, in mind, along with the
> available memory. But as a general rule you could use this approach :
>
> Take the no. of virtual CPUs*.75 and that's the no. of slots you can
> configure. For example, if you have 12 physical cores (or 24 virtual
> cores), you would have (24*.75)=18 slots. Now, based on your requirement
> you could choose how many mappers and reducers you want to use. With 18 MR
> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
> or whatever you think is OK with you.
>
> I don't know if it ,makes much sense, but it helps me pretty decently.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi,
>>
>> I am also new to Hadoop world , here is my take on your question , if
>> there is something missing then others would surely correct that.
>>
>> For per-YARN , the slots are fixed and computed based on the crunching
>> capacity of the datanode hardware , once the slots per data node is
>> ascertained , they are divided into Map and reducer slots and that goes
>> into the config files and remain fixed , until changed.In YARN , its
>> decided at runtime based on the kind of requirement of particular task.Its
>> very much possible that a datanode at certain point of time running  10
>> tasks and another similar datanode is only running 4 tasks.
>>
>> Coming to your question. Based of the data set size , block size of dfs
>> and input formater , the number of map tasks are decided , generally for
>> file based inputformats its one mapper per data block , however there are
>> way to change this using configuration settings.Reduce tasks are set using
>> job configuration.
>>
>> General rule as I have read from various documents is that Mappers should
>> run atleast a minute , so you can run a sample to find out a good size of
>> data block which would make you mapper run more than a minute. Now it again
>> depends on your SLA , in case you are not looking for a very small SLA you
>> can choose to run less mappers at the expense of higher runtime.
>>
>> But again its all theory , not sure how these things are handled in
>> actual prod clusters.
>>
>> HTH,
>>
>>
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Users,
>>>
>>> I am new to Hadoop and confused about task slots in a cluster. How would
>>> I know how many task slots would be required for a job. Is there any
>>> empirical formula or on what basis should I set the number of task slots.
>>>
>>> Advanced Thanks
>>>
>>
>>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hello guys,

             My 2 cents :

Actually no. of mappers is primarily governed by the no. of InputSplits
created by the InputFormat you are using and the no. of reducers by the no.
of partitions you get after the map phase. Having said that, you should
also keep the no of slots, available per slave, in mind, along with the
available memory. But as a general rule you could use this approach :

Take the no. of virtual CPUs*.75 and that's the no. of slots you can
configure. For example, if you have 12 physical cores (or 24 virtual
cores), you would have (24*.75)=18 slots. Now, based on your requirement
you could choose how many mappers and reducers you want to use. With 18 MR
slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
or whatever you think is OK with you.

I don't know if it ,makes much sense, but it helps me pretty decently.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I am also new to Hadoop world , here is my take on your question , if
> there is something missing then others would surely correct that.
>
> For per-YARN , the slots are fixed and computed based on the crunching
> capacity of the datanode hardware , once the slots per data node is
> ascertained , they are divided into Map and reducer slots and that goes
> into the config files and remain fixed , until changed.In YARN , its
> decided at runtime based on the kind of requirement of particular task.Its
> very much possible that a datanode at certain point of time running  10
> tasks and another similar datanode is only running 4 tasks.
>
> Coming to your question. Based of the data set size , block size of dfs
> and input formater , the number of map tasks are decided , generally for
> file based inputformats its one mapper per data block , however there are
> way to change this using configuration settings.Reduce tasks are set using
> job configuration.
>
> General rule as I have read from various documents is that Mappers should
> run atleast a minute , so you can run a sample to find out a good size of
> data block which would make you mapper run more than a minute. Now it again
> depends on your SLA , in case you are not looking for a very small SLA you
> can choose to run less mappers at the expense of higher runtime.
>
> But again its all theory , not sure how these things are handled in actual
> prod clusters.
>
> HTH,
>
>
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi Users,
>>
>> I am new to Hadoop and confused about task slots in a cluster. How would
>> I know how many task slots would be required for a job. Is there any
>> empirical formula or on what basis should I set the number of task slots.
>>
>> Advanced Thanks
>>
>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hello guys,

             My 2 cents :

Actually no. of mappers is primarily governed by the no. of InputSplits
created by the InputFormat you are using and the no. of reducers by the no.
of partitions you get after the map phase. Having said that, you should
also keep the no of slots, available per slave, in mind, along with the
available memory. But as a general rule you could use this approach :

Take the no. of virtual CPUs*.75 and that's the no. of slots you can
configure. For example, if you have 12 physical cores (or 24 virtual
cores), you would have (24*.75)=18 slots. Now, based on your requirement
you could choose how many mappers and reducers you want to use. With 18 MR
slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
or whatever you think is OK with you.

I don't know if it ,makes much sense, but it helps me pretty decently.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I am also new to Hadoop world , here is my take on your question , if
> there is something missing then others would surely correct that.
>
> For per-YARN , the slots are fixed and computed based on the crunching
> capacity of the datanode hardware , once the slots per data node is
> ascertained , they are divided into Map and reducer slots and that goes
> into the config files and remain fixed , until changed.In YARN , its
> decided at runtime based on the kind of requirement of particular task.Its
> very much possible that a datanode at certain point of time running  10
> tasks and another similar datanode is only running 4 tasks.
>
> Coming to your question. Based of the data set size , block size of dfs
> and input formater , the number of map tasks are decided , generally for
> file based inputformats its one mapper per data block , however there are
> way to change this using configuration settings.Reduce tasks are set using
> job configuration.
>
> General rule as I have read from various documents is that Mappers should
> run atleast a minute , so you can run a sample to find out a good size of
> data block which would make you mapper run more than a minute. Now it again
> depends on your SLA , in case you are not looking for a very small SLA you
> can choose to run less mappers at the expense of higher runtime.
>
> But again its all theory , not sure how these things are handled in actual
> prod clusters.
>
> HTH,
>
>
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi Users,
>>
>> I am new to Hadoop and confused about task slots in a cluster. How would
>> I know how many task slots would be required for a job. Is there any
>> empirical formula or on what basis should I set the number of task slots.
>>
>> Advanced Thanks
>>
>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hello guys,

             My 2 cents :

Actually no. of mappers is primarily governed by the no. of InputSplits
created by the InputFormat you are using and the no. of reducers by the no.
of partitions you get after the map phase. Having said that, you should
also keep the no of slots, available per slave, in mind, along with the
available memory. But as a general rule you could use this approach :

Take the no. of virtual CPUs*.75 and that's the no. of slots you can
configure. For example, if you have 12 physical cores (or 24 virtual
cores), you would have (24*.75)=18 slots. Now, based on your requirement
you could choose how many mappers and reducers you want to use. With 18 MR
slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
or whatever you think is OK with you.

I don't know if it ,makes much sense, but it helps me pretty decently.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I am also new to Hadoop world , here is my take on your question , if
> there is something missing then others would surely correct that.
>
> For per-YARN , the slots are fixed and computed based on the crunching
> capacity of the datanode hardware , once the slots per data node is
> ascertained , they are divided into Map and reducer slots and that goes
> into the config files and remain fixed , until changed.In YARN , its
> decided at runtime based on the kind of requirement of particular task.Its
> very much possible that a datanode at certain point of time running  10
> tasks and another similar datanode is only running 4 tasks.
>
> Coming to your question. Based of the data set size , block size of dfs
> and input formater , the number of map tasks are decided , generally for
> file based inputformats its one mapper per data block , however there are
> way to change this using configuration settings.Reduce tasks are set using
> job configuration.
>
> General rule as I have read from various documents is that Mappers should
> run atleast a minute , so you can run a sample to find out a good size of
> data block which would make you mapper run more than a minute. Now it again
> depends on your SLA , in case you are not looking for a very small SLA you
> can choose to run less mappers at the expense of higher runtime.
>
> But again its all theory , not sure how these things are handled in actual
> prod clusters.
>
> HTH,
>
>
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi Users,
>>
>> I am new to Hadoop and confused about task slots in a cluster. How would
>> I know how many task slots would be required for a job. Is there any
>> empirical formula or on what basis should I set the number of task slots.
>>
>> Advanced Thanks
>>
>
>

Re: Need help about task slots

Posted by Mohammad Tariq <do...@gmail.com>.
Hello guys,

             My 2 cents :

Actually no. of mappers is primarily governed by the no. of InputSplits
created by the InputFormat you are using and the no. of reducers by the no.
of partitions you get after the map phase. Having said that, you should
also keep the no of slots, available per slave, in mind, along with the
available memory. But as a general rule you could use this approach :

Take the no. of virtual CPUs*.75 and that's the no. of slots you can
configure. For example, if you have 12 physical cores (or 24 virtual
cores), you would have (24*.75)=18 slots. Now, based on your requirement
you could choose how many mappers and reducers you want to use. With 18 MR
slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers
or whatever you think is OK with you.

I don't know if it ,makes much sense, but it helps me pretty decently.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> I am also new to Hadoop world , here is my take on your question , if
> there is something missing then others would surely correct that.
>
> For per-YARN , the slots are fixed and computed based on the crunching
> capacity of the datanode hardware , once the slots per data node is
> ascertained , they are divided into Map and reducer slots and that goes
> into the config files and remain fixed , until changed.In YARN , its
> decided at runtime based on the kind of requirement of particular task.Its
> very much possible that a datanode at certain point of time running  10
> tasks and another similar datanode is only running 4 tasks.
>
> Coming to your question. Based of the data set size , block size of dfs
> and input formater , the number of map tasks are decided , generally for
> file based inputformats its one mapper per data block , however there are
> way to change this using configuration settings.Reduce tasks are set using
> job configuration.
>
> General rule as I have read from various documents is that Mappers should
> run atleast a minute , so you can run a sample to find out a good size of
> data block which would make you mapper run more than a minute. Now it again
> depends on your SLA , in case you are not looking for a very small SLA you
> can choose to run less mappers at the expense of higher runtime.
>
> But again its all theory , not sure how these things are handled in actual
> prod clusters.
>
> HTH,
>
>
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>> Hi Users,
>>
>> I am new to Hadoop and confused about task slots in a cluster. How would
>> I know how many task slots would be required for a job. Is there any
>> empirical formula or on what basis should I set the number of task slots.
>>
>> Advanced Thanks
>>
>
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Hi,

I am also new to Hadoop world , here is my take on your question , if there
is something missing then others would surely correct that.

For per-YARN , the slots are fixed and computed based on the crunching
capacity of the datanode hardware , once the slots per data node is
ascertained , they are divided into Map and reducer slots and that goes
into the config files and remain fixed , until changed.In YARN , its
decided at runtime based on the kind of requirement of particular task.Its
very much possible that a datanode at certain point of time running  10
tasks and another similar datanode is only running 4 tasks.

Coming to your question. Based of the data set size , block size of dfs and
input formater , the number of map tasks are decided , generally for file
based inputformats its one mapper per data block , however there are way to
change this using configuration settings.Reduce tasks are set using job
configuration.

General rule as I have read from various documents is that Mappers should
run atleast a minute , so you can run a sample to find out a good size of
data block which would make you mapper run more than a minute. Now it again
depends on your SLA , in case you are not looking for a very small SLA you
can choose to run less mappers at the expense of higher runtime.

But again its all theory , not sure how these things are handled in actual
prod clusters.

HTH,



Thanks,
Rahul


On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi Users,
>
> I am new to Hadoop and confused about task slots in a cluster. How would I
> know how many task slots would be required for a job. Is there any
> empirical formula or on what basis should I set the number of task slots.
>
> Advanced Thanks
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Hi,

I am also new to Hadoop world , here is my take on your question , if there
is something missing then others would surely correct that.

For per-YARN , the slots are fixed and computed based on the crunching
capacity of the datanode hardware , once the slots per data node is
ascertained , they are divided into Map and reducer slots and that goes
into the config files and remain fixed , until changed.In YARN , its
decided at runtime based on the kind of requirement of particular task.Its
very much possible that a datanode at certain point of time running  10
tasks and another similar datanode is only running 4 tasks.

Coming to your question. Based of the data set size , block size of dfs and
input formater , the number of map tasks are decided , generally for file
based inputformats its one mapper per data block , however there are way to
change this using configuration settings.Reduce tasks are set using job
configuration.

General rule as I have read from various documents is that Mappers should
run atleast a minute , so you can run a sample to find out a good size of
data block which would make you mapper run more than a minute. Now it again
depends on your SLA , in case you are not looking for a very small SLA you
can choose to run less mappers at the expense of higher runtime.

But again its all theory , not sure how these things are handled in actual
prod clusters.

HTH,



Thanks,
Rahul


On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi Users,
>
> I am new to Hadoop and confused about task slots in a cluster. How would I
> know how many task slots would be required for a job. Is there any
> empirical formula or on what basis should I set the number of task slots.
>
> Advanced Thanks
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Hi,

I am also new to Hadoop world , here is my take on your question , if there
is something missing then others would surely correct that.

For per-YARN , the slots are fixed and computed based on the crunching
capacity of the datanode hardware , once the slots per data node is
ascertained , they are divided into Map and reducer slots and that goes
into the config files and remain fixed , until changed.In YARN , its
decided at runtime based on the kind of requirement of particular task.Its
very much possible that a datanode at certain point of time running  10
tasks and another similar datanode is only running 4 tasks.

Coming to your question. Based of the data set size , block size of dfs and
input formater , the number of map tasks are decided , generally for file
based inputformats its one mapper per data block , however there are way to
change this using configuration settings.Reduce tasks are set using job
configuration.

General rule as I have read from various documents is that Mappers should
run atleast a minute , so you can run a sample to find out a good size of
data block which would make you mapper run more than a minute. Now it again
depends on your SLA , in case you are not looking for a very small SLA you
can choose to run less mappers at the expense of higher runtime.

But again its all theory , not sure how these things are handled in actual
prod clusters.

HTH,



Thanks,
Rahul


On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi Users,
>
> I am new to Hadoop and confused about task slots in a cluster. How would I
> know how many task slots would be required for a job. Is there any
> empirical formula or on what basis should I set the number of task slots.
>
> Advanced Thanks
>

Re: Need help about task slots

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Hi,

I am also new to Hadoop world , here is my take on your question , if there
is something missing then others would surely correct that.

For per-YARN , the slots are fixed and computed based on the crunching
capacity of the datanode hardware , once the slots per data node is
ascertained , they are divided into Map and reducer slots and that goes
into the config files and remain fixed , until changed.In YARN , its
decided at runtime based on the kind of requirement of particular task.Its
very much possible that a datanode at certain point of time running  10
tasks and another similar datanode is only running 4 tasks.

Coming to your question. Based of the data set size , block size of dfs and
input formater , the number of map tasks are decided , generally for file
based inputformats its one mapper per data block , however there are way to
change this using configuration settings.Reduce tasks are set using job
configuration.

General rule as I have read from various documents is that Mappers should
run atleast a minute , so you can run a sample to find out a good size of
data block which would make you mapper run more than a minute. Now it again
depends on your SLA , in case you are not looking for a very small SLA you
can choose to run less mappers at the expense of higher runtime.

But again its all theory , not sure how these things are handled in actual
prod clusters.

HTH,



Thanks,
Rahul


On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao
<ra...@gmail.com>wrote:

> Hi Users,
>
> I am new to Hadoop and confused about task slots in a cluster. How would I
> know how many task slots would be required for a job. Is there any
> empirical formula or on what basis should I set the number of task slots.
>
> Advanced Thanks
>