You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Daniel Templeton <Da...@Sun.COM> on 2009/05/07 22:06:06 UTC

GridEngine module for Hadoop on Demand

Hi,

I have a functioning module for Grid Engine for HoD, but some parts of 
it are currently hard-coded to my workstation.  In cleaning up those 
elements, I need some advice.  Hopefully this is the right forum.

So, in the hodlib/NodePools/torque.py file, there's a runWorkers() 
method.  In that method, it makes a single call to pbsdsh to start the 
NameNode, DataNodes, JobTracker, and TaskTracker.  I know nada about 
Torque, so please tell me if I'm interpreting this correctly.  It would 
appear that the pbsdsh somehow reads out of the environment how many 
hodring processes it should start up and executes them remotely, and 
each hodring then figures out what service it should run.

In Grid Engine, the rough equivalent of pbsdsh is qrsh.  (I think.)  
With qrsh, the master assigns the HoD job a set of nodes, and I then 
have to step through that set of nodes and qrsh to each one to start the 
hodring services.  As far as I can tell, the total number of hodring 
services I need to start is 1 for the NameNode + 1 for the JobTracker + 
n for the DataNodes + m for the TaskTrackers.  The thing that I'm not 
grokking is how the hodrings know what services to start, and how I 
should be parceling them out across the nodes of the cluster.  Should I 
be making sure I have two hodrings per node, one for the DataNode and 
one of the TaskTracker?  If I were to go start a dozen hodrings, one on 
each of a dozen machines, would they work out among themselves how many 
should be DataNodes and how many should be TaskTrackers?

One more thing.  If the above is on the mark, that means you're 
consuming a queue slot for each DataNode unless you use an external hdfs 
service.  That seems like a waste of cluster resources since slots tend 
to correspond more to compute resources than I/O.  I have to wonder if 
it wouldn't be more efficient from a cluster perspective to have each 
hodring start a DataNode and a TaskTracker.  It would slightly 
oversubscribe that job slot, but that may be better than grossly 
undersubscribing two.

Thanks,
Daniel

Re: GridEngine module for Hadoop on Demand

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.
Daniel Templeton wrote:
> Thanks for the reply.  Grid Engine works roughly the same way wrt 
> parallel jobs, except that we call the tasks the master and the 
> slaves.  Grid Engine does not have a pbsdsh equivalent, but it would 
> be a really trivial wrapper script to write for qrsh, which is pbsdsh 
> minus the automatic use of the nodes files (called the pe_hostfile in 
> Grid Engine).
>
> I assume from the 3-task minimum that the JobTracker gets a slot, the 
> NameNode gets a slot, and there has to be at least one slot running a 
> DataNode/TaskTracker.  Correct?  
Yes
> Should a single job be prevented from running more than on hodring on 
> a single host?
>
More than one hodrings  can be launched on a single host. However, this 
means more than 1 instance of a slave would get launched - like 2 
tasktrackers and 2 datanodes. In practice, we've seen that while this is 
also OK, when we start running M/R tasks on such a system, it slows down 
the system quite a bit. Hence I don't think this is really useful.
> How do I go about contributing this Grid Engine extension to the HoD 
> source base?
>
Please feel free to submit a patch if you've figured out all the 
details. It should be against the current code base. Please refer to 
http://wiki.apache.org/hadoop/HowToContribute for details on contributing.

Thanks!
Hemanth
> Hemanth Yamijala wrote:
>> Daniel Templeton wrote:
>>> Hi,
>>>
>>> I have a functioning module for Grid Engine for HoD, but some parts 
>>> of it are currently hard-coded to my workstation.  In cleaning up 
>>> those elements, I need some advice.  Hopefully this is the right forum.
>>>
>>> So, in the hodlib/NodePools/torque.py file, there's a runWorkers() 
>>> method.  In that method, it makes a single call to pbsdsh to start 
>>> the NameNode, DataNodes, JobTracker, and TaskTracker.  I know nada 
>>> about Torque, so please tell me if I'm interpreting this correctly.  
>>> It would appear that the pbsdsh somehow reads out of the environment 
>>> how many hodring processes it should start up and executes them 
>>> remotely, and each hodring then figures out what service it should run.
>>>
>> Roughly right. In Torque, when a set of nodes are assigned to a job, 
>> the first node in that list is special (it's called mother superior - 
>> MS), the other nodes are called sisters. The job that's submitted to 
>> torque is a HOD process called 'ringmaster'. The ringmaster starts on 
>> the MS and invokes runWorkers which executes pbsdsh. AFAIK, pbsdsh 
>> reads the environment and gets a 'nodes' file that Torque writes out. 
>> This file contains all the sisters allocated for the job (including 
>> the MS). It executes the command passed to pbsdsh - another HOD 
>> process, called hodring - on all of these nodes. The Hodring 
>> processes work with the ringmaster and decide which service to run. 
>> In a sense the ringmaster coordinates which service to start where, 
>> and inform the hodring to start that service.
>>> In Grid Engine, the rough equivalent of pbsdsh is qrsh.  (I think.)  
>>> With qrsh, the master assigns the HoD job a set of nodes, and I then 
>>> have to step through that set of nodes and qrsh to each one to start 
>>> the hodring services.  As far as I can tell, the total number of 
>>> hodring services I need to start is 1 for the NameNode + 1 for the 
>>> JobTracker + n for the DataNodes + m for the TaskTrackers.  
>> HOD has a facility to use a HDFS service that's started outside of 
>> HOD. In that mode, it does not start NameNode or DataNodes. Also, the 
>> number of DataNodes always equals the number of TaskTrackers (if HDFS 
>> services are started with HOD).
>>
>>> The thing that I'm not grokking is how the hodrings know what 
>>> services to start, and how I should be parceling them out across the 
>>> nodes of the cluster.  
>> This is decided by the ringmaster process. The logic is independent 
>> of the resource manager in use, and hence need not be worried about 
>> when porting to a new resource manager.
>>
>>> Should I be making sure I have two hodrings per node, one for the 
>>> DataNode and one of the TaskTracker?  
>> No, a single hodring gets to start both the daemons.
>>
>>> If I were to go start a dozen hodrings, one on each of a dozen 
>>> machines, would they work out among themselves how many should be 
>>> DataNodes and how many should be TaskTrackers? One more thing.  If 
>>> the above is on the mark, that means you're consuming a queue slot 
>>> for each DataNode unless you use an external hdfs service.  That 
>>> seems like a waste of cluster resources since slots tend to 
>>> correspond more to compute resources than I/O.  I have to wonder if 
>>> it wouldn't be more efficient from a cluster perspective to have 
>>> each hodring start a DataNode and a TaskTracker.  It would slightly 
>>> oversubscribe that job slot, but that may be better than grossly 
>>> undersubscribing two.
>>>
>> Explained above.
>>
>> Thanks
>> Hemanth


Re: GridEngine module for Hadoop on Demand

Posted by Daniel Templeton <Da...@Sun.COM>.
Thanks for the reply.  Grid Engine works roughly the same way wrt 
parallel jobs, except that we call the tasks the master and the slaves.  
Grid Engine does not have a pbsdsh equivalent, but it would be a really 
trivial wrapper script to write for qrsh, which is pbsdsh minus the 
automatic use of the nodes files (called the pe_hostfile in Grid Engine).

I assume from the 3-task minimum that the JobTracker gets a slot, the 
NameNode gets a slot, and there has to be at least one slot running a 
DataNode/TaskTracker.  Correct?  Should a single job be prevented from 
running more than on hodring on a single host?

How do I go about contributing this Grid Engine extension to the HoD 
source base?

Daniel

Hemanth Yamijala wrote:
> Daniel Templeton wrote:
>> Hi,
>>
>> I have a functioning module for Grid Engine for HoD, but some parts 
>> of it are currently hard-coded to my workstation.  In cleaning up 
>> those elements, I need some advice.  Hopefully this is the right forum.
>>
>> So, in the hodlib/NodePools/torque.py file, there's a runWorkers() 
>> method.  In that method, it makes a single call to pbsdsh to start 
>> the NameNode, DataNodes, JobTracker, and TaskTracker.  I know nada 
>> about Torque, so please tell me if I'm interpreting this correctly.  
>> It would appear that the pbsdsh somehow reads out of the environment 
>> how many hodring processes it should start up and executes them 
>> remotely, and each hodring then figures out what service it should run.
>>
> Roughly right. In Torque, when a set of nodes are assigned to a job, 
> the first node in that list is special (it's called mother superior - 
> MS), the other nodes are called sisters. The job that's submitted to 
> torque is a HOD process called 'ringmaster'. The ringmaster starts on 
> the MS and invokes runWorkers which executes pbsdsh. AFAIK, pbsdsh 
> reads the environment and gets a 'nodes' file that Torque writes out. 
> This file contains all the sisters allocated for the job (including 
> the MS). It executes the command passed to pbsdsh - another HOD 
> process, called hodring - on all of these nodes. The Hodring processes 
> work with the ringmaster and decide which service to run. In a sense 
> the ringmaster coordinates which service to start where, and inform 
> the hodring to start that service.
>> In Grid Engine, the rough equivalent of pbsdsh is qrsh.  (I think.)  
>> With qrsh, the master assigns the HoD job a set of nodes, and I then 
>> have to step through that set of nodes and qrsh to each one to start 
>> the hodring services.  As far as I can tell, the total number of 
>> hodring services I need to start is 1 for the NameNode + 1 for the 
>> JobTracker + n for the DataNodes + m for the TaskTrackers.  
> HOD has a facility to use a HDFS service that's started outside of 
> HOD. In that mode, it does not start NameNode or DataNodes. Also, the 
> number of DataNodes always equals the number of TaskTrackers (if HDFS 
> services are started with HOD).
>
>> The thing that I'm not grokking is how the hodrings know what 
>> services to start, and how I should be parceling them out across the 
>> nodes of the cluster.  
> This is decided by the ringmaster process. The logic is independent of 
> the resource manager in use, and hence need not be worried about when 
> porting to a new resource manager.
>
>> Should I be making sure I have two hodrings per node, one for the 
>> DataNode and one of the TaskTracker?  
> No, a single hodring gets to start both the daemons.
>
>> If I were to go start a dozen hodrings, one on each of a dozen 
>> machines, would they work out among themselves how many should be 
>> DataNodes and how many should be TaskTrackers? One more thing.  If 
>> the above is on the mark, that means you're consuming a queue slot 
>> for each DataNode unless you use an external hdfs service.  That 
>> seems like a waste of cluster resources since slots tend to 
>> correspond more to compute resources than I/O.  I have to wonder if 
>> it wouldn't be more efficient from a cluster perspective to have each 
>> hodring start a DataNode and a TaskTracker.  It would slightly 
>> oversubscribe that job slot, but that may be better than grossly 
>> undersubscribing two.
>>
> Explained above.
>
> Thanks
> Hemanth

Re: GridEngine module for Hadoop on Demand

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.
Daniel Templeton wrote:
> Hi,
>
> I have a functioning module for Grid Engine for HoD, but some parts of 
> it are currently hard-coded to my workstation.  In cleaning up those 
> elements, I need some advice.  Hopefully this is the right forum.
>
> So, in the hodlib/NodePools/torque.py file, there's a runWorkers() 
> method.  In that method, it makes a single call to pbsdsh to start the 
> NameNode, DataNodes, JobTracker, and TaskTracker.  I know nada about 
> Torque, so please tell me if I'm interpreting this correctly.  It 
> would appear that the pbsdsh somehow reads out of the environment how 
> many hodring processes it should start up and executes them remotely, 
> and each hodring then figures out what service it should run.
>
Roughly right. In Torque, when a set of nodes are assigned to a job, the 
first node in that list is special (it's called mother superior - MS), 
the other nodes are called sisters. The job that's submitted to torque 
is a HOD process called 'ringmaster'. The ringmaster starts on the MS 
and invokes runWorkers which executes pbsdsh. AFAIK, pbsdsh reads the 
environment and gets a 'nodes' file that Torque writes out. This file 
contains all the sisters allocated for the job (including the MS). It 
executes the command passed to pbsdsh - another HOD process, called 
hodring - on all of these nodes. The Hodring processes work with the 
ringmaster and decide which service to run. In a sense the ringmaster 
coordinates which service to start where, and inform the hodring to 
start that service.
> In Grid Engine, the rough equivalent of pbsdsh is qrsh.  (I think.)  
> With qrsh, the master assigns the HoD job a set of nodes, and I then 
> have to step through that set of nodes and qrsh to each one to start 
> the hodring services.  As far as I can tell, the total number of 
> hodring services I need to start is 1 for the NameNode + 1 for the 
> JobTracker + n for the DataNodes + m for the TaskTrackers.  
HOD has a facility to use a HDFS service that's started outside of HOD. 
In that mode, it does not start NameNode or DataNodes. Also, the number 
of DataNodes always equals the number of TaskTrackers (if HDFS services 
are started with HOD).

> The thing that I'm not grokking is how the hodrings know what services 
> to start, and how I should be parceling them out across the nodes of 
> the cluster.  
This is decided by the ringmaster process. The logic is independent of 
the resource manager in use, and hence need not be worried about when 
porting to a new resource manager.

> Should I be making sure I have two hodrings per node, one for the 
> DataNode and one of the TaskTracker?  
No, a single hodring gets to start both the daemons.

> If I were to go start a dozen hodrings, one on each of a dozen 
> machines, would they work out among themselves how many should be 
> DataNodes and how many should be TaskTrackers? One more thing.  If the 
> above is on the mark, that means you're consuming a queue slot for 
> each DataNode unless you use an external hdfs service.  That seems 
> like a waste of cluster resources since slots tend to correspond more 
> to compute resources than I/O.  I have to wonder if it wouldn't be 
> more efficient from a cluster perspective to have each hodring start a 
> DataNode and a TaskTracker.  It would slightly oversubscribe that job 
> slot, but that may be better than grossly undersubscribing two.
>
Explained above.

Thanks
Hemanth