You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rosanna Man <ro...@auditude.com> on 2011/04/28 20:18:08 UTC

Using capacity scheduler

Hi all,

We are using capacity scheduler to schedule resources among different queues
for 1 user (hadoop) only. We have set the queues to have equal share of the
resources. However, when 1st task starts in the first queue and is consuming
all the resources, the 2nd task starts in the 2nd queue will be starved from
reducer until the first task finished. A lot of processing is being stuck
when a large query is executing.

We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler
before but it gives an error when the mapper gives no output (which is fine
in our use cases).

Anyone can give us some advice?

Thanks,
Rosanna

Re: Using capacity scheduler

Posted by Sreekanth Ramakrishnan <sr...@yahoo-inc.com>.
The queue specific configurations are not hive client specific, they have to be configured on JobTracker before JT is started up. All the Hive Cli should try setting is which queue they will want the DAG from hive query to be submitted to.

So your capacity-scheduler.xml in $HADOOP_CONF_DIR should have:
<property><name>mapred.capacity-scheduler.queue.myqueue.maximum-capacity</name><value>50</value></property>

Also sorry for confusing you, this feature is only available in the 0.21 and Yahoo! Distribution of Hadoop. You could try to get the capacity scheduler jar from the Yahoo! Hadoop distribution and replace it in your normal Hadoop distribution of your cluster and restart it. AFAIK, I don't think scheduler contract between JT and Scheduler have changed between Apache Hadoop 0.20 and Yahoo! Hadoop 0.20. But I would suggest you to try it out at your own risk :-)


On 5/3/11 2:52 AM, "Rosanna Man" <ro...@auditude.com> wrote:

Hi Sreekanth,

When you mention about setting the max task limit, do you mean by executing

set mapred.capacity-scheduler.queue.<queue-name>.maximum-capacity = <a percentage> ?

Is it only available on hadoop 0.21?

Thanks,
Rosanna

On 5/1/11 8:42 PM, "Sreekanth Ramakrishnan" <sr...@yahoo-inc.com> wrote:


The design goal of CapacityScheduler is maximizing the utilization of cluster resources but it does not fairly allocate the share amongst the total number of users present in the system.

The user limit states the number of concurrent users who can use the slots in the queue. But then these limits are elastic in nature, as there is no preemption as the slots get freed up the new tasks will be allotted those slot to meet the user limit.

In order for your requirement, you can possibly submit the large tasks to a queue which have max task limit set, so your long running jobs don't take up whole of the cluster capacity and submit shorter, smaller jobs to fast moving queue with something like 10% user limit which allows 10 concurrent user per queue.

The actual distribution of the of the capacity across longer/shorter jobs depends on your workload.


On 4/30/11 1:14 AM, "Rosanna Man" <ro...@auditude.com> wrote:

Hi Sreekanth,

Thank you very much for your clarification. Setting the max task limits on queues will work but can we do something on the max user limit? Is it pre-emptible also? We are exploring about the possibility of running the queries with different users for capacity scheduler to maximize the use of the resources.

Basically, our goal is to maximize the resources (mappers and reducers) while providing a fair share to the short tasks while a big task is running. How do you normally achieve hat?

Thanks,
Rosanna

On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" <sr...@yahoo-inc.com> wrote:

Hi

Currently CapacityScheduler does not have pre-emption. So basically when the Job1 starts finishing and freeing up the Job2's tasks will start getting scheduled. One way you can prevent that queue capacities are not elastic in nature is by setting max task limits on queues. That way your job1 will never execeed first queues capacity




On 4/28/11 11:48 PM, "Rosanna Man" <ro...@auditude.com> wrote:

Hi all,

We are using capacity scheduler to schedule resources among different queues for 1 user (hadoop) only. We have set the queues to have equal share of the resources. However, when 1st task starts in the first queue and is consuming all the resources, the 2nd task starts in the 2nd queue will be starved from reducer until the first task finished. A lot of processing is being stuck when a large query is executing.

We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before but it gives an error when the mapper gives no output (which is fine in our use cases).

Anyone can give us some advice?

Thanks,
Rosanna



--
Sreekanth Ramakrishnan

Re: Using capacity scheduler

Posted by Rosanna Man <ro...@auditude.com>.
Hi Sreekanth,

When you mention about setting the max task limit, do you mean by executing

set mapred.capacity-scheduler.queue.<queue-name>.maximum-capacity = <a
percentage> ?

Is it only available on hadoop 0.21?

Thanks,
Rosanna

On 5/1/11 8:42 PM, "Sreekanth Ramakrishnan" <sr...@yahoo-inc.com> wrote:

> 
> The design goal of CapacityScheduler is maximizing the utilization of cluster
> resources but it does not fairly allocate the share amongst the total number
> of users present in the system.
> 
> The user limit states the number of concurrent users who can use the slots in
> the queue. But then these limits are elastic in nature, as there is no
> preemption as the slots get freed up the new tasks will be allotted those slot
> to meet the user limit.
> 
> In order for your requirement, you can possibly submit the large tasks to a
> queue which have max task limit set, so your long running jobs don¹t take up
> whole of the cluster capacity and submit shorter, smaller jobs to fast moving
> queue with something like 10% user limit which allows 10 concurrent user per
> queue.
> 
> The actual distribution of the of the capacity across longer/shorter jobs
> depends on your workload.
>  
> 
> On 4/30/11 1:14 AM, "Rosanna Man" <ro...@auditude.com> wrote:
> 
>> Hi Sreekanth,
>> 
>> Thank you very much for your clarification. Setting the max task limits on
>> queues will work but can we do something on the max user limit? Is it
>> pre-emptible also? We are exploring about the possibility of running the
>> queries with different users for capacity scheduler to maximize the use of
>> the resources.
>> 
>> Basically, our goal is to maximize the resources (mappers and reducers) while
>> providing a fair share to the short tasks while a big task is running. How do
>> you normally achieve hat?
>> 
>> Thanks,
>> Rosanna
>> 
>> On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" <sr...@yahoo-inc.com> wrote:
>> 
>>> Hi
>>> 
>>> Currently CapacityScheduler does not have pre-emption. So basically when the
>>> Job1 starts finishing and freeing up the Job2¹s tasks will start getting
>>> scheduled. One way you can prevent that queue capacities are not elastic in
>>> nature is by setting max task limits on queues. That way your job1 will
>>> never execeed first queues capacity
>>>     
>>> 
>>> 
>>> 
>>> On 4/28/11 11:48 PM, "Rosanna Man" <ro...@auditude.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> We are using capacity scheduler to schedule resources among different
>>>> queues for 1 user (hadoop) only. We have set the queues to have equal share
>>>> of the resources. However, when 1st task starts in the first queue and is
>>>> consuming all the resources, the 2nd task starts in the 2nd queue will be
>>>> starved from reducer until the first task finished. A lot of processing is
>>>> being stuck when a large query is executing.
>>>> 
>>>> We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler
>>>> before but it gives an error when the mapper gives no output (which is fine
>>>> in our use cases).
>>>> 
>>>> Anyone can give us some advice?
>>>> 
>>>> Thanks,
>>>> Rosanna
>> 


Re: Using capacity scheduler

Posted by Sreekanth Ramakrishnan <sr...@yahoo-inc.com>.
The design goal of CapacityScheduler is maximizing the utilization of cluster resources but it does not fairly allocate the share amongst the total number of users present in the system.

The user limit states the number of concurrent users who can use the slots in the queue. But then these limits are elastic in nature, as there is no preemption as the slots get freed up the new tasks will be allotted those slot to meet the user limit.

In order for your requirement, you can possibly submit the large tasks to a queue which have max task limit set, so your long running jobs don't take up whole of the cluster capacity and submit shorter, smaller jobs to fast moving queue with something like 10% user limit which allows 10 concurrent user per queue.

The actual distribution of the of the capacity across longer/shorter jobs depends on your workload.


On 4/30/11 1:14 AM, "Rosanna Man" <ro...@auditude.com> wrote:

Hi Sreekanth,

Thank you very much for your clarification. Setting the max task limits on queues will work but can we do something on the max user limit? Is it pre-emptible also? We are exploring about the possibility of running the queries with different users for capacity scheduler to maximize the use of the resources.

Basically, our goal is to maximize the resources (mappers and reducers) while providing a fair share to the short tasks while a big task is running. How do you normally achieve hat?

Thanks,
Rosanna

On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" <sr...@yahoo-inc.com> wrote:

Hi

Currently CapacityScheduler does not have pre-emption. So basically when the Job1 starts finishing and freeing up the Job2's tasks will start getting scheduled. One way you can prevent that queue capacities are not elastic in nature is by setting max task limits on queues. That way your job1 will never execeed first queues capacity




On 4/28/11 11:48 PM, "Rosanna Man" <ro...@auditude.com> wrote:

Hi all,

We are using capacity scheduler to schedule resources among different queues for 1 user (hadoop) only. We have set the queues to have equal share of the resources. However, when 1st task starts in the first queue and is consuming all the resources, the 2nd task starts in the 2nd queue will be starved from reducer until the first task finished. A lot of processing is being stuck when a large query is executing.

We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before but it gives an error when the mapper gives no output (which is fine in our use cases).

Anyone can give us some advice?

Thanks,
Rosanna


--
Sreekanth Ramakrishnan

Re: Using capacity scheduler

Posted by Rosanna Man <ro...@auditude.com>.
Hi Sreekanth,

Thank you very much for your clarification. Setting the max task limits on
queues will work but can we do something on the max user limit? Is it
pre-emptible also? We are exploring about the possibility of running the
queries with different users for capacity scheduler to maximize the use of
the resources.

Basically, our goal is to maximize the resources (mappers and reducers)
while providing a fair share to the short tasks while a big task is running.
How do you normally achieve hat?

Thanks,
Rosanna

On 4/28/11 8:09 PM, "Sreekanth Ramakrishnan" <sr...@yahoo-inc.com> wrote:

> Hi
> 
> Currently CapacityScheduler does not have pre-emption. So basically when the
> Job1 starts finishing and freeing up the Job2¹s tasks will start getting
> scheduled. One way you can prevent that queue capacities are not elastic in
> nature is by setting max task limits on queues. That way your job1 will never
> execeed first queues capacity
>     
> 
> 
> 
> On 4/28/11 11:48 PM, "Rosanna Man" <ro...@auditude.com> wrote:
> 
>> Hi all,
>> 
>> We are using capacity scheduler to schedule resources among different queues
>> for 1 user (hadoop) only. We have set the queues to have equal share of the
>> resources. However, when 1st task starts in the first queue and is consuming
>> all the resources, the 2nd task starts in the 2nd queue will be starved from
>> reducer until the first task finished. A lot of processing is being stuck
>> when a large query is executing.
>> 
>> We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before
>> but it gives an error when the mapper gives no output (which is fine in our
>> use cases).
>> 
>> Anyone can give us some advice?
>> 
>> Thanks,
>> Rosanna


Re: Using capacity scheduler

Posted by Sreekanth Ramakrishnan <sr...@yahoo-inc.com>.
Hi

Currently CapacityScheduler does not have pre-emption. So basically when the Job1 starts finishing and freeing up the Job2's tasks will start getting scheduled. One way you can prevent that queue capacities are not elastic in nature is by setting max task limits on queues. That way your job1 will never execeed first queues capacity




On 4/28/11 11:48 PM, "Rosanna Man" <ro...@auditude.com> wrote:

Hi all,

We are using capacity scheduler to schedule resources among different queues for 1 user (hadoop) only. We have set the queues to have equal share of the resources. However, when 1st task starts in the first queue and is consuming all the resources, the 2nd task starts in the 2nd queue will be starved from reducer until the first task finished. A lot of processing is being stuck when a large query is executing.

We are using 0.20.2 hive in amazon aws. We tried to use Fair Scheduler before but it gives an error when the mapper gives no output (which is fine in our use cases).

Anyone can give us some advice?

Thanks,
Rosanna

--
Sreekanth Ramakrishnan