You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Milne <d....@gmail.com> on 2010/06/10 04:56:57 UTC

Problems with HOD and HDFS

Hi there,

I am trying to get Hadoop on Demand up and running, but am having
problems with the ringmaster not being able to communicate with HDFS.

The output from the hod allocate command ends with this, with full verbosity:

[2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
'hdfs' service address.
[2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
[2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
[2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
[2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
cluster /home/dmilne/hadoop/cluster
[2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


I've attached the hodrc file below, but briefly HOD is supposed to
provision an HDFS cluster as well as a Map/Reduce cluster, and seems
to be failing to do so. The ringmaster log looks like this:

[2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
[2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found

... and so on, until it gives up

Any ideas why? One red flag is that when running the allocate command,
some of the variables echo-ed back look dodgy:

--gridservice-hdfs.fs_port 0
--gridservice-hdfs.host localhost
--gridservice-hdfs.info_port 0

These are not what I specified in the hodrc. Are the port numbers just
set to 0 because I am not using an external HDFS, or is this a
problem?


The software versions involved are:
 - Hadoop 0.20.2
 - Python 2.5.2 (no Twisted)
 - Java 1.6.0_20
 - Torque 2.4.5


The hodrc file looks like this:

[hod]
stream                          = True
java-home                       = /opt/jdk1.6.0_20
cluster                         = debian5
cluster-factor                  = 1.8
xrs-port-range                  = 32768-65536
debug                           = 3
allocate-wait-time              = 3600
temp-dir                        = /scratch/local/dmilne/hod

[ringmaster]
register                        = True
stream                          = False
temp-dir                        = /scratch/local/dmilne/hod
log-dir                         = /scratch/local/dmilne/hod/log
http-port-range                 = 8000-9000
idleness-limit                  = 864000
work-dirs                       =
/scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
xrs-port-range                  = 32768-65536
debug                           = 4

[hodring]
stream                          = False
temp-dir                        = /scratch/local/dmilne/hod
log-dir                         = /scratch/local/dmilne/hod/log
register                        = True
java-home                       = /opt/jdk1.6.0_20
http-port-range                 = 8000-9000
xrs-port-range                  = 32768-65536
debug                           = 4

[resource_manager]
queue                           = express
batch-home                      = /opt/torque-2.4.5
id                              = torque
options                         = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
#env-vars                       =
HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

[gridservice-mapred]
external                        = False
pkgs                            = /opt/hadoop-0.20.2
tracker_port                    = 8030
info_port                       = 50080

[gridservice-hdfs]
external                        = False
pkgs                            = /opt/hadoop-0.20.2
fs_port                         = 8020
info_port                       = 50070

Cheers,
Dave

Re: Problems with HOD and HDFS

Posted by Steve Loughran <st...@apache.org>.
David Milne wrote:
> Is there something else I could read about setting up short-lived
> Hadoop clusters on virtual machines? I have no experience with VMs at
> all. I see there is quite a bit of material about using them to get
> Hadoop up and running with a psuedo-cluster on a single machine, but I
> don't follow how this stretches out to using multiple machines
> allocated by Torque.

My slides are up here
http://www.slideshare.net/steve_l/farming-hadoop-inthecloud

We've been bringing up hadoop in a virtual infrastructure, first you ask 
for the master node containing a NN, a JT and a DN with almost no 
storage (just enough for the filesystem to go live, so stop the JT 
blocking). If it comes up you then have a stable hostname for the 
filesystem which you can use for all the real worker nodes (DN + TT) you 
want.

Some nearby physicists are trying to get Hadoop to co-exist with the 
grid schedulers, I've added a feature request to make the reporting of 
task tracker slots something plugins can handle, so that you'd have a 
set of hadoop workers which could be used by the grid apps or by hadoop 
-with physical hadoop storage. When they were doing work scheduled out 
of hadoop, they'd report less availability to the Job Tracker, so not 
overload the machines.

Dan Templeton of Sun/Oracle has been working with getting Hadoop to 
coexist with his resource manager -he's worth contacting. Maybe we could 
persuade him to give public online talk on the topic.

-steve


Re: Problems with HOD and HDFS

Posted by David Milne <d....@gmail.com>.
Is there something else I could read about setting up short-lived
Hadoop clusters on virtual machines? I have no experience with VMs at
all. I see there is quite a bit of material about using them to get
Hadoop up and running with a psuedo-cluster on a single machine, but I
don't follow how this stretches out to using multiple machines
allocated by Torque.

Thanks,
Dave

On Tue, Jun 15, 2010 at 3:49 AM, Steve Loughran <st...@apache.org> wrote:
> Edward Capriolo wrote:
>
>>
>> I have not used it much, but I think HOD is pretty cool. I guess most
>> people
>> who are looking to (spin up, run job ,transfer off, spin down) are using
>> EC2. HOD does something like make private hadoop clouds on your hardware
>> and
>> many probably do not have that use case. As schedulers advance and get
>> better HOD becomes less attractive, but I can always see a place for it.
>
> I don't know who is using it, or maintaining it; we've been bringing up
> short-lived Hadoop clusters different.
>
> I think I should write a little article on the topic; I presented about it
> at Berlin Buzzwords last week.
>
> Short lived Hadoop clusters on VMs are fine if you don't have enough data or
> CPU load to justify a set of dedicated physical machines, and is a good way
> of experimenting with Hadoop at scale. You can maybe lock down the network
> better too, though that depends on your VM infrastructure.
>
> Where VMs are weak is in disk IO performance, but there's no reason why the
> VM infrastructure can't take a list of filenames/directories as a hint for
> VM placement (placement is the new scheduling, incidentally), and
> virtualized IO can only improve. If you can run Hadoop MapReduce directly
> against SAN-mounted storage then you can stop worrying about locality of
> data and still gain from parallelisation of the operations.
>
>
> -steve
>
>
>

Re: job execution

Posted by Akash Deep Shakya <ak...@gmail.com>.
Use ControlledJob class from Hadoop trunk. And run it through JobControl.

Regards
Akash Deep Shakya "OpenAK"
FOSS Nepal Community
akashakya at gmail dot com

~ Failure to prepare is preparing to fail ~



On Mon, Jun 14, 2010 at 10:40 PM, Gang Luo <lg...@yahoo.com.cn> wrote:

> Hi,
> According to the doc, JobControl can maintain the dependency among
> different jobs and only jobs without dependency can execute. How does
> JobControl maintain the dependency and how can we indicate the dependency?
>
> Thanks,
> -Gang
>
>
>
>
>

Re: job execution

Posted by Akash Deep Shakya <ak...@gmail.com>.
@Jeff, I think JobConf is already deprecated
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl; can be used instead.

Regards
Akash Deep Shakya "OpenAK"
FOSS Nepal Community
akashakya at gmail dot com

~ Failure to prepare is preparing to fail ~



On Tue, Jun 15, 2010 at 7:28 AM, Jeff Zhang <zj...@gmail.com> wrote:

> There's a class org.apache.hadoop.mapred.jobcontrol.Job which is a
> wapper of JobConf. And You and dependent jobs to it. Then put it to
> JobControl.
>
>
>
>
> On Mon, Jun 14, 2010 at 9:55 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> > Hi,
> > According to the doc, JobControl can maintain the dependency among
> different jobs and only jobs without dependency can execute. How does
> JobControl maintain the dependency and how can we indicate the dependency?
> >
> > Thanks,
> > -Gang
> >
> >
> >
> >
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: job execution

Posted by Jeff Zhang <zj...@gmail.com>.
There's a class org.apache.hadoop.mapred.jobcontrol.Job which is a
wapper of JobConf. And You and dependent jobs to it. Then put it to
JobControl.




On Mon, Jun 14, 2010 at 9:55 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi,
> According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency?
>
> Thanks,
> -Gang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

job execution

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi,
According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency?

Thanks,
-Gang



      

Re: Problems with HOD and HDFS

Posted by Steve Loughran <st...@apache.org>.
Edward Capriolo wrote:

> 
> I have not used it much, but I think HOD is pretty cool. I guess most people
> who are looking to (spin up, run job ,transfer off, spin down) are using
> EC2. HOD does something like make private hadoop clouds on your hardware and
> many probably do not have that use case. As schedulers advance and get
> better HOD becomes less attractive, but I can always see a place for it.

I don't know who is using it, or maintaining it; we've been bringing up 
short-lived Hadoop clusters different.

I think I should write a little article on the topic; I presented about 
it at Berlin Buzzwords last week.

Short lived Hadoop clusters on VMs are fine if you don't have enough 
data or CPU load to justify a set of dedicated physical machines, and is 
a good way of experimenting with Hadoop at scale. You can maybe lock 
down the network better too, though that depends on your VM infrastructure.

Where VMs are weak is in disk IO performance, but there's no reason why 
the VM infrastructure can't take a list of filenames/directories as a 
hint for VM placement (placement is the new scheduling, incidentally), 
and virtualized IO can only improve. If you can run Hadoop MapReduce 
directly against SAN-mounted storage then you can stop worrying about 
locality of data and still gain from parallelisation of the operations.


-steve



Re: Problems with HOD and HDFS

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Jun 14, 2010 at 8:37 AM, Amr Awadallah <aa...@cloudera.com> wrote:

> Dave,
>
>  Yes, many others have the same situation, the recommended solution is
> either to use the Fair Share Scheduler or the Capacity Scheduler. These
> schedulers are much better than HOD since they take data locality into
> consideration (they don't just spin up 20 TT nodes on machines that have
> nothing to do with your data). They also don't lock down the nodes just for
> you, so as TT are freed other jobs can use them immediately (as opposed to
> no body can use them till your entire job is done).
>
>  Also, if you are brave and want to try something spanking new, then I
> recommend you reach out to the Mesos guys, they have a scheduler layer
> under
> Hadoop that is data locality aware:
>
> http://mesos.berkeley.edu/
>
> -- amr
>
> On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d....@gmail.com> wrote:
>
> > Ok, thanks Jeff.
> >
> > This is pretty surprising though. I would have thought many people
> > would be in my position, where they have to use Hadoop on a general
> > purpose cluster, and need it to play nice with a resource manager?
> > What do other people do in this position, if they don't use HOD?
> > Deprecated normally means there is a better alternative.
> >
> > - Dave
> >
> > On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <ha...@cloudera.com>
> > wrote:
> > > Hey Dave,
> > >
> > > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
> > don't
> > > think HOD is actively used or developed anywhere these days. You're
> > > attempting to use a mostly deprecated project, and hence not receiving
> > any
> > > support on the mailing list.
> > >
> > > Thanks,
> > > Jeff
> > >
> > > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com>
> > wrote:
> > >
> > >> Anybody? I am completely stuck here. I have no idea who else I can ask
> > >> or where I can go for more information. Is there somewhere specific
> > >> where I should be asking about HOD?
> > >>
> > >> Thank you,
> > >> Dave
> > >>
> > >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com>
> > wrote:
> > >> > Hi there,
> > >> >
> > >> > I am trying to get Hadoop on Demand up and running, but am having
> > >> > problems with the ringmaster not being able to communicate with
> HDFS.
> > >> >
> > >> > The output from the hod allocate command ends with this, with full
> > >> verbosity:
> > >> >
> > >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
> retrieve
> > >> > 'hdfs' service address.
> > >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster
> id
> > >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
> > >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> > >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
> > rm.stop()
> > >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> > >> > cluster /home/dmilne/hadoop/cluster
> > >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> > >> >
> > >> >
> > >> > I've attached the hodrc file below, but briefly HOD is supposed to
> > >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems
> > >> > to be failing to do so. The ringmaster log looks like this:
> > >> >
> > >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
> > name:
> > >> hdfs
> > >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> > >> > addr hdfs: not found
> > >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
> > name:
> > >> hdfs
> > >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> > >> > addr hdfs: not found
> > >> >
> > >> > ... and so on, until it gives up
> > >> >
> > >> > Any ideas why? One red flag is that when running the allocate
> command,
> > >> > some of the variables echo-ed back look dodgy:
> > >> >
> > >> > --gridservice-hdfs.fs_port 0
> > >> > --gridservice-hdfs.host localhost
> > >> > --gridservice-hdfs.info_port 0
> > >> >
> > >> > These are not what I specified in the hodrc. Are the port numbers
> just
> > >> > set to 0 because I am not using an external HDFS, or is this a
> > >> > problem?
> > >> >
> > >> >
> > >> > The software versions involved are:
> > >> >  - Hadoop 0.20.2
> > >> >  - Python 2.5.2 (no Twisted)
> > >> >  - Java 1.6.0_20
> > >> >  - Torque 2.4.5
> > >> >
> > >> >
> > >> > The hodrc file looks like this:
> > >> >
> > >> > [hod]
> > >> > stream                          = True
> > >> > java-home                       = /opt/jdk1.6.0_20
> > >> > cluster                         = debian5
> > >> > cluster-factor                  = 1.8
> > >> > xrs-port-range                  = 32768-65536
> > >> > debug                           = 3
> > >> > allocate-wait-time              = 3600
> > >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >
> > >> > [ringmaster]
> > >> > register                        = True
> > >> > stream                          = False
> > >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> > http-port-range                 = 8000-9000
> > >> > idleness-limit                  = 864000
> > >> > work-dirs                       =
> > >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> > >> > xrs-port-range                  = 32768-65536
> > >> > debug                           = 4
> > >> >
> > >> > [hodring]
> > >> > stream                          = False
> > >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> > register                        = True
> > >> > java-home                       = /opt/jdk1.6.0_20
> > >> > http-port-range                 = 8000-9000
> > >> > xrs-port-range                  = 32768-65536
> > >> > debug                           = 4
> > >> >
> > >> > [resource_manager]
> > >> > queue                           = express
> > >> > batch-home                      = /opt/torque-2.4.5
> > >> > id                              = torque
> > >> > options                         =
> > >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> > >> > #env-vars                       =
> > >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> > >> >
> > >> > [gridservice-mapred]
> > >> > external                        = False
> > >> > pkgs                            = /opt/hadoop-0.20.2
> > >> > tracker_port                    = 8030
> > >> > info_port                       = 50080
> > >> >
> > >> > [gridservice-hdfs]
> > >> > external                        = False
> > >> > pkgs                            = /opt/hadoop-0.20.2
> > >> > fs_port                         = 8020
> > >> > info_port                       = 50070
> > >> >
> > >> > Cheers,
> > >> > Dave
> > >> >
> > >>
> > >
> >
>

I have not used it much, but I think HOD is pretty cool. I guess most people
who are looking to (spin up, run job ,transfer off, spin down) are using
EC2. HOD does something like make private hadoop clouds on your hardware and
many probably do not have that use case. As schedulers advance and get
better HOD becomes less attractive, but I can always see a place for it.

Re: Problems with HOD and HDFS

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jun 15, 2010 at 3:10 PM, Jason Stowe <js...@cyclecomputing.com>wrote:

> Hi David,
> The original HOD project was integrated with Condor (
> http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters.
>
> A year or two ago, the Condor project in addition to being open-source w/o
> costs for licensing, created close integration with Hadoop (as does SGE),
> as
> presented by me at a prior Hadoop World, and the Condor team at Condor Week
> 2010:
> http://bit.ly/Condor_Hadoop_CondorWeek2010
>
> My company has solutions for deploying Hadoop Clusters on shared
> infrastructure using CycleServer and schedulers like Condor/SGE/etc. The
> general deployment strategy is to deploy head nodes (Name/Job Tracker),
> then
> execute nodes, and to be careful about how you deal with
> data/sizing/replication counts.
>
> If you're interested in this, please feel free to drop us a line at my
> e-mail or http://cyclecomputing.com/about/contact
>
> Thanks,
> Jason
>
>
> On Mon, Jun 14, 2010 at 7:45 PM, David Milne <d....@gmail.com> wrote:
>
> > Unless I am missing something, the Fair Share and Capacity schedulers
> > sound like a solution to a different problem: aren't they for a
> > dedicated Hadoop cluster that needs to be shared by lots of people? I
> > have a general purpose cluster that needs to be shared by lots of
> > people. Only one of them (me) wants to run hadoop, and only wants to
> > run it  intermittently. I'm not concerned with data locality, as my
> > workflow is:
> >
> > 1) upload data I need to process to cluster
> > 2) run a chain of map-reduce tasks
> > 3) grab processed data from cluster
> > 4) clean up cluster
> >
> > Mesos sounds good, but I am definitely NOT brave about this. As I
> > said, I am just one user of the cluster among many. I would want to
> > stick with Torque and Maui for resource management.
> >
> > - Dave
> >
> > On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <aa...@cloudera.com>
> wrote:
> > > Dave,
> > >
> > >  Yes, many others have the same situation, the recommended solution is
> > > either to use the Fair Share Scheduler or the Capacity Scheduler. These
> > > schedulers are much better than HOD since they take data locality into
> > > consideration (they don't just spin up 20 TT nodes on machines that
> have
> > > nothing to do with your data). They also don't lock down the nodes just
> > for
> > > you, so as TT are freed other jobs can use them immediately (as opposed
> > to
> > > no body can use them till your entire job is done).
> > >
> > >  Also, if you are brave and want to try something spanking new, then I
> > > recommend you reach out to the Mesos guys, they have a scheduler layer
> > under
> > > Hadoop that is data locality aware:
> > >
> > > http://mesos.berkeley.edu/
> > >
> > > -- amr
> > >
> > > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d....@gmail.com>
> > wrote:
> > >
> > >> Ok, thanks Jeff.
> > >>
> > >> This is pretty surprising though. I would have thought many people
> > >> would be in my position, where they have to use Hadoop on a general
> > >> purpose cluster, and need it to play nice with a resource manager?
> > >> What do other people do in this position, if they don't use HOD?
> > >> Deprecated normally means there is a better alternative.
> > >>
> > >> - Dave
> > >>
> > >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <
> hammer@cloudera.com
> > >
> > >> wrote:
> > >> > Hey Dave,
> > >> >
> > >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
> > >> don't
> > >> > think HOD is actively used or developed anywhere these days. You're
> > >> > attempting to use a mostly deprecated project, and hence not
> receiving
> > >> any
> > >> > support on the mailing list.
> > >> >
> > >> > Thanks,
> > >> > Jeff
> > >> >
> > >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Anybody? I am completely stuck here. I have no idea who else I can
> > ask
> > >> >> or where I can go for more information. Is there somewhere specific
> > >> >> where I should be asking about HOD?
> > >> >>
> > >> >> Thank you,
> > >> >> Dave
> > >> >>
> > >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com>
> > >> wrote:
> > >> >> > Hi there,
> > >> >> >
> > >> >> > I am trying to get Hadoop on Demand up and running, but am having
> > >> >> > problems with the ringmaster not being able to communicate with
> > HDFS.
> > >> >> >
> > >> >> > The output from the hod allocate command ends with this, with
> full
> > >> >> verbosity:
> > >> >> >
> > >> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
> > retrieve
> > >> >> > 'hdfs' service address.
> > >> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up
> cluster
> > id
> > >> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be
> > allocated.
> > >> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> > >> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
> > >> rm.stop()
> > >> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> > >> >> > cluster /home/dmilne/hadoop/cluster
> > >> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> > >> >> >
> > >> >> >
> > >> >> > I've attached the hodrc file below, but briefly HOD is supposed
> to
> > >> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and
> > seems
> > >> >> > to be failing to do so. The ringmaster log looks like this:
> > >> >> >
> > >> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 -
> getServiceAddr
> > >> name:
> > >> >> hdfs
> > >> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 -
> getServiceAddr
> > >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 -
> getServiceAddr
> > >> >> > addr hdfs: not found
> > >> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 -
> getServiceAddr
> > >> name:
> > >> >> hdfs
> > >> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 -
> getServiceAddr
> > >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 -
> getServiceAddr
> > >> >> > addr hdfs: not found
> > >> >> >
> > >> >> > ... and so on, until it gives up
> > >> >> >
> > >> >> > Any ideas why? One red flag is that when running the allocate
> > command,
> > >> >> > some of the variables echo-ed back look dodgy:
> > >> >> >
> > >> >> > --gridservice-hdfs.fs_port 0
> > >> >> > --gridservice-hdfs.host localhost
> > >> >> > --gridservice-hdfs.info_port 0
> > >> >> >
> > >> >> > These are not what I specified in the hodrc. Are the port numbers
> > just
> > >> >> > set to 0 because I am not using an external HDFS, or is this a
> > >> >> > problem?
> > >> >> >
> > >> >> >
> > >> >> > The software versions involved are:
> > >> >> >  - Hadoop 0.20.2
> > >> >> >  - Python 2.5.2 (no Twisted)
> > >> >> >  - Java 1.6.0_20
> > >> >> >  - Torque 2.4.5
> > >> >> >
> > >> >> >
> > >> >> > The hodrc file looks like this:
> > >> >> >
> > >> >> > [hod]
> > >> >> > stream                          = True
> > >> >> > java-home                       = /opt/jdk1.6.0_20
> > >> >> > cluster                         = debian5
> > >> >> > cluster-factor                  = 1.8
> > >> >> > xrs-port-range                  = 32768-65536
> > >> >> > debug                           = 3
> > >> >> > allocate-wait-time              = 3600
> > >> >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >> >
> > >> >> > [ringmaster]
> > >> >> > register                        = True
> > >> >> > stream                          = False
> > >> >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> >> > http-port-range                 = 8000-9000
> > >> >> > idleness-limit                  = 864000
> > >> >> > work-dirs                       =
> > >> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> > >> >> > xrs-port-range                  = 32768-65536
> > >> >> > debug                           = 4
> > >> >> >
> > >> >> > [hodring]
> > >> >> > stream                          = False
> > >> >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> >> > register                        = True
> > >> >> > java-home                       = /opt/jdk1.6.0_20
> > >> >> > http-port-range                 = 8000-9000
> > >> >> > xrs-port-range                  = 32768-65536
> > >> >> > debug                           = 4
> > >> >> >
> > >> >> > [resource_manager]
> > >> >> > queue                           = express
> > >> >> > batch-home                      = /opt/torque-2.4.5
> > >> >> > id                              = torque
> > >> >> > options                         =
> > >> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> > >> >> > #env-vars                       =
> > >> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> > >> >> >
> > >> >> > [gridservice-mapred]
> > >> >> > external                        = False
> > >> >> > pkgs                            = /opt/hadoop-0.20.2
> > >> >> > tracker_port                    = 8030
> > >> >> > info_port                       = 50080
> > >> >> >
> > >> >> > [gridservice-hdfs]
> > >> >> > external                        = False
> > >> >> > pkgs                            = /opt/hadoop-0.20.2
> > >> >> > fs_port                         = 8020
> > >> >> > info_port                       = 50070
> > >> >> >
> > >> >> > Cheers,
> > >> >> > Dave
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>
>
>
> --
>
> ==================================
> Jason A. Stowe
> cell: 607.227.9686
> main: 888.292.5320
>
> http://twitter.com/jasonastowe/
> http://twitter.com/cyclecomputing/
>
> Cycle Computing, LLC
> Leader in Open Compute Solutions for Clouds, Servers, and Desktops
> Enterprise Condor Support and Management Tools
>
> http://www.cyclecomputing.com
> http://www.cyclecloud.com
>

>>but I don't follow how this stretches out to using multiple machines
allocated by Torque.

Hadoop does not have a concept of VirutalHosting NameNode has a port,
jobtracker has a port, DataNode users a port, and has a port for the web
interface, task tracker is the same deal. Running multiple copies of hadoop
on the same machine is "easy". All you have to do is make sure they do not
step on each other. Make sure they do not write to the same folder
locations, make sure they do not use the same ports.

Single setup
NameNode: 9000 Web: 50070
JobTracker: 1000 Web: 50030
...

Multi Setup

Setup 1
NameNode: 9001 Web: 50071
JobTracker: 1001 Web: 50031
...

Setup2
NameNode: 9002 Web: 50072
JobTracker: 1002 Web: 50032
...

HOD is supposed to handle the "dirty" work for you of building configuration
files, installing hadoop to the nodes, starting the hadoop components. You
could theoretically accomplish similar things with remote SSH keys, and a
boatload of scripting. HOD is a deployment and management tool.

It sounds like it may not meet your need. Is your goal to just deploy and
manage one instance of Hadoop or multiple instances? HOD is designed to
install multiple instances of hadoop on a single set of hardware. It sounds
like you want to deploy one cluster per group of VM's which is not really
the same thing.

Re: Problems with HOD and HDFS

Posted by Jason Stowe <js...@cyclecomputing.com>.
Hi David,
The original HOD project was integrated with Condor (
http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters.

A year or two ago, the Condor project in addition to being open-source w/o
costs for licensing, created close integration with Hadoop (as does SGE), as
presented by me at a prior Hadoop World, and the Condor team at Condor Week
2010:
http://bit.ly/Condor_Hadoop_CondorWeek2010

My company has solutions for deploying Hadoop Clusters on shared
infrastructure using CycleServer and schedulers like Condor/SGE/etc. The
general deployment strategy is to deploy head nodes (Name/Job Tracker), then
execute nodes, and to be careful about how you deal with
data/sizing/replication counts.

If you're interested in this, please feel free to drop us a line at my
e-mail or http://cyclecomputing.com/about/contact

Thanks,
Jason


On Mon, Jun 14, 2010 at 7:45 PM, David Milne <d....@gmail.com> wrote:

> Unless I am missing something, the Fair Share and Capacity schedulers
> sound like a solution to a different problem: aren't they for a
> dedicated Hadoop cluster that needs to be shared by lots of people? I
> have a general purpose cluster that needs to be shared by lots of
> people. Only one of them (me) wants to run hadoop, and only wants to
> run it  intermittently. I'm not concerned with data locality, as my
> workflow is:
>
> 1) upload data I need to process to cluster
> 2) run a chain of map-reduce tasks
> 3) grab processed data from cluster
> 4) clean up cluster
>
> Mesos sounds good, but I am definitely NOT brave about this. As I
> said, I am just one user of the cluster among many. I would want to
> stick with Torque and Maui for resource management.
>
> - Dave
>
> On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <aa...@cloudera.com> wrote:
> > Dave,
> >
> >  Yes, many others have the same situation, the recommended solution is
> > either to use the Fair Share Scheduler or the Capacity Scheduler. These
> > schedulers are much better than HOD since they take data locality into
> > consideration (they don't just spin up 20 TT nodes on machines that have
> > nothing to do with your data). They also don't lock down the nodes just
> for
> > you, so as TT are freed other jobs can use them immediately (as opposed
> to
> > no body can use them till your entire job is done).
> >
> >  Also, if you are brave and want to try something spanking new, then I
> > recommend you reach out to the Mesos guys, they have a scheduler layer
> under
> > Hadoop that is data locality aware:
> >
> > http://mesos.berkeley.edu/
> >
> > -- amr
> >
> > On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d....@gmail.com>
> wrote:
> >
> >> Ok, thanks Jeff.
> >>
> >> This is pretty surprising though. I would have thought many people
> >> would be in my position, where they have to use Hadoop on a general
> >> purpose cluster, and need it to play nice with a resource manager?
> >> What do other people do in this position, if they don't use HOD?
> >> Deprecated normally means there is a better alternative.
> >>
> >> - Dave
> >>
> >> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <hammer@cloudera.com
> >
> >> wrote:
> >> > Hey Dave,
> >> >
> >> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
> >> don't
> >> > think HOD is actively used or developed anywhere these days. You're
> >> > attempting to use a mostly deprecated project, and hence not receiving
> >> any
> >> > support on the mailing list.
> >> >
> >> > Thanks,
> >> > Jeff
> >> >
> >> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com>
> >> wrote:
> >> >
> >> >> Anybody? I am completely stuck here. I have no idea who else I can
> ask
> >> >> or where I can go for more information. Is there somewhere specific
> >> >> where I should be asking about HOD?
> >> >>
> >> >> Thank you,
> >> >> Dave
> >> >>
> >> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com>
> >> wrote:
> >> >> > Hi there,
> >> >> >
> >> >> > I am trying to get Hadoop on Demand up and running, but am having
> >> >> > problems with the ringmaster not being able to communicate with
> HDFS.
> >> >> >
> >> >> > The output from the hod allocate command ends with this, with full
> >> >> verbosity:
> >> >> >
> >> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
> retrieve
> >> >> > 'hdfs' service address.
> >> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster
> id
> >> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be
> allocated.
> >> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> >> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
> >> rm.stop()
> >> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> >> >> > cluster /home/dmilne/hadoop/cluster
> >> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> >> >> >
> >> >> >
> >> >> > I've attached the hodrc file below, but briefly HOD is supposed to
> >> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and
> seems
> >> >> > to be failing to do so. The ringmaster log looks like this:
> >> >> >
> >> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
> >> name:
> >> >> hdfs
> >> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> >> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> >> >> > addr hdfs: not found
> >> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
> >> name:
> >> >> hdfs
> >> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> >> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> >> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> >> >> > addr hdfs: not found
> >> >> >
> >> >> > ... and so on, until it gives up
> >> >> >
> >> >> > Any ideas why? One red flag is that when running the allocate
> command,
> >> >> > some of the variables echo-ed back look dodgy:
> >> >> >
> >> >> > --gridservice-hdfs.fs_port 0
> >> >> > --gridservice-hdfs.host localhost
> >> >> > --gridservice-hdfs.info_port 0
> >> >> >
> >> >> > These are not what I specified in the hodrc. Are the port numbers
> just
> >> >> > set to 0 because I am not using an external HDFS, or is this a
> >> >> > problem?
> >> >> >
> >> >> >
> >> >> > The software versions involved are:
> >> >> >  - Hadoop 0.20.2
> >> >> >  - Python 2.5.2 (no Twisted)
> >> >> >  - Java 1.6.0_20
> >> >> >  - Torque 2.4.5
> >> >> >
> >> >> >
> >> >> > The hodrc file looks like this:
> >> >> >
> >> >> > [hod]
> >> >> > stream                          = True
> >> >> > java-home                       = /opt/jdk1.6.0_20
> >> >> > cluster                         = debian5
> >> >> > cluster-factor                  = 1.8
> >> >> > xrs-port-range                  = 32768-65536
> >> >> > debug                           = 3
> >> >> > allocate-wait-time              = 3600
> >> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >> >
> >> >> > [ringmaster]
> >> >> > register                        = True
> >> >> > stream                          = False
> >> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> >> >> > http-port-range                 = 8000-9000
> >> >> > idleness-limit                  = 864000
> >> >> > work-dirs                       =
> >> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> >> >> > xrs-port-range                  = 32768-65536
> >> >> > debug                           = 4
> >> >> >
> >> >> > [hodring]
> >> >> > stream                          = False
> >> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >> > log-dir                         = /scratch/local/dmilne/hod/log
> >> >> > register                        = True
> >> >> > java-home                       = /opt/jdk1.6.0_20
> >> >> > http-port-range                 = 8000-9000
> >> >> > xrs-port-range                  = 32768-65536
> >> >> > debug                           = 4
> >> >> >
> >> >> > [resource_manager]
> >> >> > queue                           = express
> >> >> > batch-home                      = /opt/torque-2.4.5
> >> >> > id                              = torque
> >> >> > options                         =
> >> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> >> >> > #env-vars                       =
> >> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> >> >> >
> >> >> > [gridservice-mapred]
> >> >> > external                        = False
> >> >> > pkgs                            = /opt/hadoop-0.20.2
> >> >> > tracker_port                    = 8030
> >> >> > info_port                       = 50080
> >> >> >
> >> >> > [gridservice-hdfs]
> >> >> > external                        = False
> >> >> > pkgs                            = /opt/hadoop-0.20.2
> >> >> > fs_port                         = 8020
> >> >> > info_port                       = 50070
> >> >> >
> >> >> > Cheers,
> >> >> > Dave
> >> >> >
> >> >>
> >> >
> >>
> >
>



-- 

==================================
Jason A. Stowe
cell: 607.227.9686
main: 888.292.5320

http://twitter.com/jasonastowe/
http://twitter.com/cyclecomputing/

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com

Re: Problems with HOD and HDFS

Posted by David Milne <d....@gmail.com>.
Unless I am missing something, the Fair Share and Capacity schedulers
sound like a solution to a different problem: aren't they for a
dedicated Hadoop cluster that needs to be shared by lots of people? I
have a general purpose cluster that needs to be shared by lots of
people. Only one of them (me) wants to run hadoop, and only wants to
run it  intermittently. I'm not concerned with data locality, as my
workflow is:

1) upload data I need to process to cluster
2) run a chain of map-reduce tasks
3) grab processed data from cluster
4) clean up cluster

Mesos sounds good, but I am definitely NOT brave about this. As I
said, I am just one user of the cluster among many. I would want to
stick with Torque and Maui for resource management.

- Dave

On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah <aa...@cloudera.com> wrote:
> Dave,
>
>  Yes, many others have the same situation, the recommended solution is
> either to use the Fair Share Scheduler or the Capacity Scheduler. These
> schedulers are much better than HOD since they take data locality into
> consideration (they don't just spin up 20 TT nodes on machines that have
> nothing to do with your data). They also don't lock down the nodes just for
> you, so as TT are freed other jobs can use them immediately (as opposed to
> no body can use them till your entire job is done).
>
>  Also, if you are brave and want to try something spanking new, then I
> recommend you reach out to the Mesos guys, they have a scheduler layer under
> Hadoop that is data locality aware:
>
> http://mesos.berkeley.edu/
>
> -- amr
>
> On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d....@gmail.com> wrote:
>
>> Ok, thanks Jeff.
>>
>> This is pretty surprising though. I would have thought many people
>> would be in my position, where they have to use Hadoop on a general
>> purpose cluster, and need it to play nice with a resource manager?
>> What do other people do in this position, if they don't use HOD?
>> Deprecated normally means there is a better alternative.
>>
>> - Dave
>>
>> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <ha...@cloudera.com>
>> wrote:
>> > Hey Dave,
>> >
>> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
>> don't
>> > think HOD is actively used or developed anywhere these days. You're
>> > attempting to use a mostly deprecated project, and hence not receiving
>> any
>> > support on the mailing list.
>> >
>> > Thanks,
>> > Jeff
>> >
>> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com>
>> wrote:
>> >
>> >> Anybody? I am completely stuck here. I have no idea who else I can ask
>> >> or where I can go for more information. Is there somewhere specific
>> >> where I should be asking about HOD?
>> >>
>> >> Thank you,
>> >> Dave
>> >>
>> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com>
>> wrote:
>> >> > Hi there,
>> >> >
>> >> > I am trying to get Hadoop on Demand up and running, but am having
>> >> > problems with the ringmaster not being able to communicate with HDFS.
>> >> >
>> >> > The output from the hod allocate command ends with this, with full
>> >> verbosity:
>> >> >
>> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
>> >> > 'hdfs' service address.
>> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
>> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
>> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
>> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
>> rm.stop()
>> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
>> >> > cluster /home/dmilne/hadoop/cluster
>> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
>> >> >
>> >> >
>> >> > I've attached the hodrc file below, but briefly HOD is supposed to
>> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems
>> >> > to be failing to do so. The ringmaster log looks like this:
>> >> >
>> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
>> name:
>> >> hdfs
>> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
>> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
>> >> > addr hdfs: not found
>> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
>> name:
>> >> hdfs
>> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
>> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
>> >> > addr hdfs: not found
>> >> >
>> >> > ... and so on, until it gives up
>> >> >
>> >> > Any ideas why? One red flag is that when running the allocate command,
>> >> > some of the variables echo-ed back look dodgy:
>> >> >
>> >> > --gridservice-hdfs.fs_port 0
>> >> > --gridservice-hdfs.host localhost
>> >> > --gridservice-hdfs.info_port 0
>> >> >
>> >> > These are not what I specified in the hodrc. Are the port numbers just
>> >> > set to 0 because I am not using an external HDFS, or is this a
>> >> > problem?
>> >> >
>> >> >
>> >> > The software versions involved are:
>> >> >  - Hadoop 0.20.2
>> >> >  - Python 2.5.2 (no Twisted)
>> >> >  - Java 1.6.0_20
>> >> >  - Torque 2.4.5
>> >> >
>> >> >
>> >> > The hodrc file looks like this:
>> >> >
>> >> > [hod]
>> >> > stream                          = True
>> >> > java-home                       = /opt/jdk1.6.0_20
>> >> > cluster                         = debian5
>> >> > cluster-factor                  = 1.8
>> >> > xrs-port-range                  = 32768-65536
>> >> > debug                           = 3
>> >> > allocate-wait-time              = 3600
>> >> > temp-dir                        = /scratch/local/dmilne/hod
>> >> >
>> >> > [ringmaster]
>> >> > register                        = True
>> >> > stream                          = False
>> >> > temp-dir                        = /scratch/local/dmilne/hod
>> >> > log-dir                         = /scratch/local/dmilne/hod/log
>> >> > http-port-range                 = 8000-9000
>> >> > idleness-limit                  = 864000
>> >> > work-dirs                       =
>> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
>> >> > xrs-port-range                  = 32768-65536
>> >> > debug                           = 4
>> >> >
>> >> > [hodring]
>> >> > stream                          = False
>> >> > temp-dir                        = /scratch/local/dmilne/hod
>> >> > log-dir                         = /scratch/local/dmilne/hod/log
>> >> > register                        = True
>> >> > java-home                       = /opt/jdk1.6.0_20
>> >> > http-port-range                 = 8000-9000
>> >> > xrs-port-range                  = 32768-65536
>> >> > debug                           = 4
>> >> >
>> >> > [resource_manager]
>> >> > queue                           = express
>> >> > batch-home                      = /opt/torque-2.4.5
>> >> > id                              = torque
>> >> > options                         =
>> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
>> >> > #env-vars                       =
>> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
>> >> >
>> >> > [gridservice-mapred]
>> >> > external                        = False
>> >> > pkgs                            = /opt/hadoop-0.20.2
>> >> > tracker_port                    = 8030
>> >> > info_port                       = 50080
>> >> >
>> >> > [gridservice-hdfs]
>> >> > external                        = False
>> >> > pkgs                            = /opt/hadoop-0.20.2
>> >> > fs_port                         = 8020
>> >> > info_port                       = 50070
>> >> >
>> >> > Cheers,
>> >> > Dave
>> >> >
>> >>
>> >
>>
>

Re: Problems with HOD and HDFS

Posted by Amr Awadallah <aa...@cloudera.com>.
Dave,

  Yes, many others have the same situation, the recommended solution is
either to use the Fair Share Scheduler or the Capacity Scheduler. These
schedulers are much better than HOD since they take data locality into
consideration (they don't just spin up 20 TT nodes on machines that have
nothing to do with your data). They also don't lock down the nodes just for
you, so as TT are freed other jobs can use them immediately (as opposed to
no body can use them till your entire job is done).

  Also, if you are brave and want to try something spanking new, then I
recommend you reach out to the Mesos guys, they have a scheduler layer under
Hadoop that is data locality aware:

http://mesos.berkeley.edu/

-- amr

On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d....@gmail.com> wrote:

> Ok, thanks Jeff.
>
> This is pretty surprising though. I would have thought many people
> would be in my position, where they have to use Hadoop on a general
> purpose cluster, and need it to play nice with a resource manager?
> What do other people do in this position, if they don't use HOD?
> Deprecated normally means there is a better alternative.
>
> - Dave
>
> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <ha...@cloudera.com>
> wrote:
> > Hey Dave,
> >
> > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
> don't
> > think HOD is actively used or developed anywhere these days. You're
> > attempting to use a mostly deprecated project, and hence not receiving
> any
> > support on the mailing list.
> >
> > Thanks,
> > Jeff
> >
> > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com>
> wrote:
> >
> >> Anybody? I am completely stuck here. I have no idea who else I can ask
> >> or where I can go for more information. Is there somewhere specific
> >> where I should be asking about HOD?
> >>
> >> Thank you,
> >> Dave
> >>
> >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com>
> wrote:
> >> > Hi there,
> >> >
> >> > I am trying to get Hadoop on Demand up and running, but am having
> >> > problems with the ringmaster not being able to communicate with HDFS.
> >> >
> >> > The output from the hod allocate command ends with this, with full
> >> verbosity:
> >> >
> >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
> >> > 'hdfs' service address.
> >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
> >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
> >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
> rm.stop()
> >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> >> > cluster /home/dmilne/hadoop/cluster
> >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> >> >
> >> >
> >> > I've attached the hodrc file below, but briefly HOD is supposed to
> >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems
> >> > to be failing to do so. The ringmaster log looks like this:
> >> >
> >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
> name:
> >> hdfs
> >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> >> > addr hdfs: not found
> >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
> name:
> >> hdfs
> >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> >> > addr hdfs: not found
> >> >
> >> > ... and so on, until it gives up
> >> >
> >> > Any ideas why? One red flag is that when running the allocate command,
> >> > some of the variables echo-ed back look dodgy:
> >> >
> >> > --gridservice-hdfs.fs_port 0
> >> > --gridservice-hdfs.host localhost
> >> > --gridservice-hdfs.info_port 0
> >> >
> >> > These are not what I specified in the hodrc. Are the port numbers just
> >> > set to 0 because I am not using an external HDFS, or is this a
> >> > problem?
> >> >
> >> >
> >> > The software versions involved are:
> >> >  - Hadoop 0.20.2
> >> >  - Python 2.5.2 (no Twisted)
> >> >  - Java 1.6.0_20
> >> >  - Torque 2.4.5
> >> >
> >> >
> >> > The hodrc file looks like this:
> >> >
> >> > [hod]
> >> > stream                          = True
> >> > java-home                       = /opt/jdk1.6.0_20
> >> > cluster                         = debian5
> >> > cluster-factor                  = 1.8
> >> > xrs-port-range                  = 32768-65536
> >> > debug                           = 3
> >> > allocate-wait-time              = 3600
> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> >
> >> > [ringmaster]
> >> > register                        = True
> >> > stream                          = False
> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> > log-dir                         = /scratch/local/dmilne/hod/log
> >> > http-port-range                 = 8000-9000
> >> > idleness-limit                  = 864000
> >> > work-dirs                       =
> >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> >> > xrs-port-range                  = 32768-65536
> >> > debug                           = 4
> >> >
> >> > [hodring]
> >> > stream                          = False
> >> > temp-dir                        = /scratch/local/dmilne/hod
> >> > log-dir                         = /scratch/local/dmilne/hod/log
> >> > register                        = True
> >> > java-home                       = /opt/jdk1.6.0_20
> >> > http-port-range                 = 8000-9000
> >> > xrs-port-range                  = 32768-65536
> >> > debug                           = 4
> >> >
> >> > [resource_manager]
> >> > queue                           = express
> >> > batch-home                      = /opt/torque-2.4.5
> >> > id                              = torque
> >> > options                         =
> >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> >> > #env-vars                       =
> >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> >> >
> >> > [gridservice-mapred]
> >> > external                        = False
> >> > pkgs                            = /opt/hadoop-0.20.2
> >> > tracker_port                    = 8030
> >> > info_port                       = 50080
> >> >
> >> > [gridservice-hdfs]
> >> > external                        = False
> >> > pkgs                            = /opt/hadoop-0.20.2
> >> > fs_port                         = 8020
> >> > info_port                       = 50070
> >> >
> >> > Cheers,
> >> > Dave
> >> >
> >>
> >
>

Re: Problems with HOD and HDFS

Posted by Vinod KV <vi...@yahoo-inc.com>.
On Monday 14 June 2010 09:51 AM, David Milne wrote:
> Ok, thanks Jeff.
>
> This is pretty surprising though. I would have thought many people
> would be in my position, where they have to use Hadoop on a general
> purpose cluster, and need it to play nice with a resource manager?
> What do other people do in this position, if they don't use HOD?
> Deprecated normally means there is a better alternative.
>
> - Dave
>    


It isn't formally deprecated though. May be we'll need to do it 
explicitly; that'll help putting up proper documentation about what else 
to use instead.

A quick reply is that you start a static cluster on a set of nodes. 
Static cluster means bringing up hadoop dameons on a set of nodes using 
the startup scripts distributed along in bin/ directory.

That said, there are no changes in HOD in 0.21 and beyond. Deploying 
0.21 clusters should mostly work out of the box. But beyond 0.21, it may 
not work because HOD needs to be updated w.r.t removed/updated hadoop 
specific configuration parameters and environmental variables it 
generates itself.

HTH,
+vinod

> On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher<ha...@cloudera.com>  wrote:
>    
>> Hey Dave,
>>
>> I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
>> think HOD is actively used or developed anywhere these days. You're
>> attempting to use a mostly deprecated project, and hence not receiving any
>> support on the mailing list.
>>
>> Thanks,
>> Jeff
>>
>> On Sun, Jun 13, 2010 at 7:33 PM, David Milne<d....@gmail.com>  wrote:
>>
>>      
>>> Anybody? I am completely stuck here. I have no idea who else I can ask
>>> or where I can go for more information. Is there somewhere specific
>>> where I should be asking about HOD?
>>>
>>> Thank you,
>>> Dave
>>>
>>> On Thu, Jun 10, 2010 at 2:56 PM, David Milne<d....@gmail.com>  wrote:
>>>        
>>>> Hi there,
>>>>
>>>> I am trying to get Hadoop on Demand up and running, but am having
>>>> problems with the ringmaster not being able to communicate with HDFS.
>>>>
>>>> The output from the hod allocate command ends with this, with full
>>>>          
>>> verbosity:
>>>        
>>>> [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
>>>> 'hdfs' service address.
>>>> [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
>>>> 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
>>>> [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
>>>> [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
>>>> [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
>>>> cluster /home/dmilne/hadoop/cluster
>>>> [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
>>>>
>>>>
>>>> I've attached the hodrc file below, but briefly HOD is supposed to
>>>> provision an HDFS cluster as well as a Map/Reduce cluster, and seems
>>>> to be failing to do so. The ringmaster log looks like this:
>>>>
>>>> [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name:
>>>>          
>>> hdfs
>>>        
>>>> [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
>>>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>>>> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
>>>> addr hdfs: not found
>>>> [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name:
>>>>          
>>> hdfs
>>>        
>>>> [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
>>>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>>>> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
>>>> addr hdfs: not found
>>>>
>>>> ... and so on, until it gives up
>>>>
>>>> Any ideas why? One red flag is that when running the allocate command,
>>>> some of the variables echo-ed back look dodgy:
>>>>
>>>> --gridservice-hdfs.fs_port 0
>>>> --gridservice-hdfs.host localhost
>>>> --gridservice-hdfs.info_port 0
>>>>
>>>> These are not what I specified in the hodrc. Are the port numbers just
>>>> set to 0 because I am not using an external HDFS, or is this a
>>>> problem?
>>>>
>>>>
>>>> The software versions involved are:
>>>>   - Hadoop 0.20.2
>>>>   - Python 2.5.2 (no Twisted)
>>>>   - Java 1.6.0_20
>>>>   - Torque 2.4.5
>>>>
>>>>
>>>> The hodrc file looks like this:
>>>>
>>>> [hod]
>>>> stream                          = True
>>>> java-home                       = /opt/jdk1.6.0_20
>>>> cluster                         = debian5
>>>> cluster-factor                  = 1.8
>>>> xrs-port-range                  = 32768-65536
>>>> debug                           = 3
>>>> allocate-wait-time              = 3600
>>>> temp-dir                        = /scratch/local/dmilne/hod
>>>>
>>>> [ringmaster]
>>>> register                        = True
>>>> stream                          = False
>>>> temp-dir                        = /scratch/local/dmilne/hod
>>>> log-dir                         = /scratch/local/dmilne/hod/log
>>>> http-port-range                 = 8000-9000
>>>> idleness-limit                  = 864000
>>>> work-dirs                       =
>>>> /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
>>>> xrs-port-range                  = 32768-65536
>>>> debug                           = 4
>>>>
>>>> [hodring]
>>>> stream                          = False
>>>> temp-dir                        = /scratch/local/dmilne/hod
>>>> log-dir                         = /scratch/local/dmilne/hod/log
>>>> register                        = True
>>>> java-home                       = /opt/jdk1.6.0_20
>>>> http-port-range                 = 8000-9000
>>>> xrs-port-range                  = 32768-65536
>>>> debug                           = 4
>>>>
>>>> [resource_manager]
>>>> queue                           = express
>>>> batch-home                      = /opt/torque-2.4.5
>>>> id                              = torque
>>>> options                         =
>>>>          
>>> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
>>>        
>>>> #env-vars                       =
>>>> HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
>>>>
>>>> [gridservice-mapred]
>>>> external                        = False
>>>> pkgs                            = /opt/hadoop-0.20.2
>>>> tracker_port                    = 8030
>>>> info_port                       = 50080
>>>>
>>>> [gridservice-hdfs]
>>>> external                        = False
>>>> pkgs                            = /opt/hadoop-0.20.2
>>>> fs_port                         = 8020
>>>> info_port                       = 50070
>>>>
>>>> Cheers,
>>>> Dave
>>>>
>>>>          
>>>        
>>      
>    


Re: Problems with HOD and HDFS

Posted by David Milne <d....@gmail.com>.
Ok, thanks Jeff.

This is pretty surprising though. I would have thought many people
would be in my position, where they have to use Hadoop on a general
purpose cluster, and need it to play nice with a resource manager?
What do other people do in this position, if they don't use HOD?
Deprecated normally means there is a better alternative.

- Dave

On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <ha...@cloudera.com> wrote:
> Hey Dave,
>
> I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
> think HOD is actively used or developed anywhere these days. You're
> attempting to use a mostly deprecated project, and hence not receiving any
> support on the mailing list.
>
> Thanks,
> Jeff
>
> On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com> wrote:
>
>> Anybody? I am completely stuck here. I have no idea who else I can ask
>> or where I can go for more information. Is there somewhere specific
>> where I should be asking about HOD?
>>
>> Thank you,
>> Dave
>>
>> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com> wrote:
>> > Hi there,
>> >
>> > I am trying to get Hadoop on Demand up and running, but am having
>> > problems with the ringmaster not being able to communicate with HDFS.
>> >
>> > The output from the hod allocate command ends with this, with full
>> verbosity:
>> >
>> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
>> > 'hdfs' service address.
>> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
>> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
>> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
>> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
>> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
>> > cluster /home/dmilne/hadoop/cluster
>> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
>> >
>> >
>> > I've attached the hodrc file below, but briefly HOD is supposed to
>> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems
>> > to be failing to do so. The ringmaster log looks like this:
>> >
>> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name:
>> hdfs
>> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
>> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
>> > addr hdfs: not found
>> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name:
>> hdfs
>> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
>> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
>> > addr hdfs: not found
>> >
>> > ... and so on, until it gives up
>> >
>> > Any ideas why? One red flag is that when running the allocate command,
>> > some of the variables echo-ed back look dodgy:
>> >
>> > --gridservice-hdfs.fs_port 0
>> > --gridservice-hdfs.host localhost
>> > --gridservice-hdfs.info_port 0
>> >
>> > These are not what I specified in the hodrc. Are the port numbers just
>> > set to 0 because I am not using an external HDFS, or is this a
>> > problem?
>> >
>> >
>> > The software versions involved are:
>> >  - Hadoop 0.20.2
>> >  - Python 2.5.2 (no Twisted)
>> >  - Java 1.6.0_20
>> >  - Torque 2.4.5
>> >
>> >
>> > The hodrc file looks like this:
>> >
>> > [hod]
>> > stream                          = True
>> > java-home                       = /opt/jdk1.6.0_20
>> > cluster                         = debian5
>> > cluster-factor                  = 1.8
>> > xrs-port-range                  = 32768-65536
>> > debug                           = 3
>> > allocate-wait-time              = 3600
>> > temp-dir                        = /scratch/local/dmilne/hod
>> >
>> > [ringmaster]
>> > register                        = True
>> > stream                          = False
>> > temp-dir                        = /scratch/local/dmilne/hod
>> > log-dir                         = /scratch/local/dmilne/hod/log
>> > http-port-range                 = 8000-9000
>> > idleness-limit                  = 864000
>> > work-dirs                       =
>> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
>> > xrs-port-range                  = 32768-65536
>> > debug                           = 4
>> >
>> > [hodring]
>> > stream                          = False
>> > temp-dir                        = /scratch/local/dmilne/hod
>> > log-dir                         = /scratch/local/dmilne/hod/log
>> > register                        = True
>> > java-home                       = /opt/jdk1.6.0_20
>> > http-port-range                 = 8000-9000
>> > xrs-port-range                  = 32768-65536
>> > debug                           = 4
>> >
>> > [resource_manager]
>> > queue                           = express
>> > batch-home                      = /opt/torque-2.4.5
>> > id                              = torque
>> > options                         =
>> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
>> > #env-vars                       =
>> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
>> >
>> > [gridservice-mapred]
>> > external                        = False
>> > pkgs                            = /opt/hadoop-0.20.2
>> > tracker_port                    = 8030
>> > info_port                       = 50080
>> >
>> > [gridservice-hdfs]
>> > external                        = False
>> > pkgs                            = /opt/hadoop-0.20.2
>> > fs_port                         = 8020
>> > info_port                       = 50070
>> >
>> > Cheers,
>> > Dave
>> >
>>
>

Re: Problems with HOD and HDFS

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey Dave,

I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
think HOD is actively used or developed anywhere these days. You're
attempting to use a mostly deprecated project, and hence not receiving any
support on the mailing list.

Thanks,
Jeff

On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d....@gmail.com> wrote:

> Anybody? I am completely stuck here. I have no idea who else I can ask
> or where I can go for more information. Is there somewhere specific
> where I should be asking about HOD?
>
> Thank you,
> Dave
>
> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com> wrote:
> > Hi there,
> >
> > I am trying to get Hadoop on Demand up and running, but am having
> > problems with the ringmaster not being able to communicate with HDFS.
> >
> > The output from the hod allocate command ends with this, with full
> verbosity:
> >
> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
> > 'hdfs' service address.
> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> > cluster /home/dmilne/hadoop/cluster
> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> >
> >
> > I've attached the hodrc file below, but briefly HOD is supposed to
> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems
> > to be failing to do so. The ringmaster log looks like this:
> >
> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name:
> hdfs
> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> > addr hdfs: not found
> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name:
> hdfs
> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> > addr hdfs: not found
> >
> > ... and so on, until it gives up
> >
> > Any ideas why? One red flag is that when running the allocate command,
> > some of the variables echo-ed back look dodgy:
> >
> > --gridservice-hdfs.fs_port 0
> > --gridservice-hdfs.host localhost
> > --gridservice-hdfs.info_port 0
> >
> > These are not what I specified in the hodrc. Are the port numbers just
> > set to 0 because I am not using an external HDFS, or is this a
> > problem?
> >
> >
> > The software versions involved are:
> >  - Hadoop 0.20.2
> >  - Python 2.5.2 (no Twisted)
> >  - Java 1.6.0_20
> >  - Torque 2.4.5
> >
> >
> > The hodrc file looks like this:
> >
> > [hod]
> > stream                          = True
> > java-home                       = /opt/jdk1.6.0_20
> > cluster                         = debian5
> > cluster-factor                  = 1.8
> > xrs-port-range                  = 32768-65536
> > debug                           = 3
> > allocate-wait-time              = 3600
> > temp-dir                        = /scratch/local/dmilne/hod
> >
> > [ringmaster]
> > register                        = True
> > stream                          = False
> > temp-dir                        = /scratch/local/dmilne/hod
> > log-dir                         = /scratch/local/dmilne/hod/log
> > http-port-range                 = 8000-9000
> > idleness-limit                  = 864000
> > work-dirs                       =
> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> > xrs-port-range                  = 32768-65536
> > debug                           = 4
> >
> > [hodring]
> > stream                          = False
> > temp-dir                        = /scratch/local/dmilne/hod
> > log-dir                         = /scratch/local/dmilne/hod/log
> > register                        = True
> > java-home                       = /opt/jdk1.6.0_20
> > http-port-range                 = 8000-9000
> > xrs-port-range                  = 32768-65536
> > debug                           = 4
> >
> > [resource_manager]
> > queue                           = express
> > batch-home                      = /opt/torque-2.4.5
> > id                              = torque
> > options                         =
> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> > #env-vars                       =
> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> >
> > [gridservice-mapred]
> > external                        = False
> > pkgs                            = /opt/hadoop-0.20.2
> > tracker_port                    = 8030
> > info_port                       = 50080
> >
> > [gridservice-hdfs]
> > external                        = False
> > pkgs                            = /opt/hadoop-0.20.2
> > fs_port                         = 8020
> > info_port                       = 50070
> >
> > Cheers,
> > Dave
> >
>

Re: Problems with HOD and HDFS

Posted by Vinod KV <vi...@yahoo-inc.com>.
On Tuesday 15 June 2010 04:19 AM, David Milne wrote:
> [2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command:
> /opt/torque-2.4.5/bin/pbsdsh
> /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
> --hodring.tarball-retry-initial-time 1.0
> --hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
> --hodring.service-id 34350.symphony.cs.waikato.ac.nz
> --hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
> 8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
> --hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
> --hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
> --hodring.log-dir /scratch/local/dmilne/hod/log
> --hodring.mapred-system-dir-root /mapredsystem
> --hodring.xrs-port-range 32768-65536 --hodring.debug 4
> --hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
> [2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers.
>
> //chorus (many times)
>    

Did you mean pbsdsh command itseld was printed many times above? That 
should not happen.

I previously thought hodrings could not start namenode but looks like 
hodrings themselves failed to startup. You can do two things:
  - See qstat output, log into the slave nodes where your job was 
supposed to start and see hodring logs there.
  - run the above hodring command yourselves directly on on these slave 
nodes for your job and see if it fails with some error.

+Vinod

Re: Problems with HOD and HDFS

Posted by David Milne <d....@gmail.com>.
Thanks everyone for your replies.

Even though HOD looks like a dead-end I would prefer to use it. I am
just one user of the cluster among many, and currently the only one
using Hadoop. The jobs I need to run are pretty much one-off: they are
big jobs that I can't do without Hadoop, but I might need to run them
once a month or less. The ability to provision MapReduce and HDFS when
I need it sounds ideal.

Following Vinod's advice, I have rolled back to Hadoop 0.20.1 (the
last version that HOD kept up with) and taken a closer look at the
ringmaster logs. However, I am still getting the same problems as
before, and I can't find anything in the logs to help me identify the
NameNode.

The full ringmaster log is below. It's a pretty repetitive song, so
I've identified the chorus.

[2010-06-15 10:07:40,236] DEBUG/10 ringMaster:569 - Getting service ID.
[2010-06-15 10:07:40,237] DEBUG/10 ringMaster:573 - Got service ID:
34350.symphony.cs.waikato.ac.nz
[2010-06-15 10:07:40,239] DEBUG/10 ringMaster:756 - Command to
execute: /bin/cp /home/dmilne/hadoop/hadoop-0.20.1.tar.gz
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster
[2010-06-15 10:07:42,314] DEBUG/10 ringMaster:762 - Completed command
execution. Exit Code: 0.
[2010-06-15 10:07:42,315] DEBUG/10 ringMaster:591 - Service registry @
http://symphony.cs.waikato.ac.nz:36372
[2010-06-15 10:07:47,503] DEBUG/10 ringMaster:726 - tarball name :
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz
hadoop package name : hadoop-0.20.1/
[2010-06-15 10:07:47,505] DEBUG/10 ringMaster:716 - Returning Hadoop
directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/
[2010-06-15 10:07:47,515] DEBUG/10 util:215 - Executing command
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop
version to find hadoop version
[2010-06-15 10:07:48,241] DEBUG/10 util:224 - Version from hadoop
command: Hadoop 0.20.1

[2010-06-15 10:07:48,244] DEBUG/10 ringMaster:117 - Using max-connect value 30
[2010-06-15 10:07:48,246] INFO/20 ringMaster:61 - Twisted interface
not found. Using hodXMLRPCServer.
[2010-06-15 10:07:48,257] DEBUG/10 ringMaster:73 - Ringmaster RPC
Server at 33771
[2010-06-15 10:07:48,265] DEBUG/10 ringMaster:121 - registering:
http://cn71:8030/hadoop-0.20.1.tar.gz
[2010-06-15 10:07:48,275] DEBUG/10 ringMaster:658 - dmilne
34350.symphony.cs.waikato.ac.nz cn71.symphony.cs.waikato.ac.nz
ringmaster hod
[2010-06-15 10:07:48,307] DEBUG/10 ringMaster:670 - Registered with
serivce registry: http://symphony.cs.waikato.ac.nz:36372.

//chorus start
[2010-06-15 10:07:48,393] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-15 10:07:48,394] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.hdfs.Hdfs instance at 0xc9e050>
[2010-06-15 10:07:48,395] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
//chorus end

//chorus (3x)

[2010-06-15 10:07:51,461] DEBUG/10 ringMaster:726 - tarball name :
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz
hadoop package name : hadoop-0.20.1/
[2010-06-15 10:07:51,463] DEBUG/10 ringMaster:716 - Returning Hadoop
directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/
[2010-06-15 10:07:51,465] DEBUG/10 ringMaster:690 -
hadoopdir=/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/,
java-home=/opt/jdk1.6.0_20
[2010-06-15 10:07:51,470] DEBUG/10 util:215 - Executing command
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop
version to find hadoop version

//chorus (1x)

[2010-06-15 10:07:52,448] DEBUG/10 util:224 - Version from hadoop
command: Hadoop 0.20.1
[2010-06-15 10:07:52,450] DEBUG/10 ringMaster:697 - starting jt monitor
[2010-06-15 10:07:52,453] DEBUG/10 ringMaster:913 - Entered start method.
[2010-06-15 10:07:52,455] DEBUG/10 ringMaster:924 -
/home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
--hodring.service-id 34350.symphony.cs.waikato.ac.nz
--hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
--hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
--hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
--hodring.log-dir /scratch/local/dmilne/hod/log
--hodring.mapred-system-dir-root /mapredsystem
--hodring.xrs-port-range 32768-65536 --hodring.debug 4
--hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
[2010-06-15 10:07:52,456] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred
[2010-06-15 10:07:52,458] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098>
[2010-06-15 10:07:52,460] DEBUG/10 ringMaster:504 - getServiceAddr
addr mapred: not found
[2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command:
/opt/torque-2.4.5/bin/pbsdsh
/home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
--hodring.service-id 34350.symphony.cs.waikato.ac.nz
--hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
--hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
--hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
--hodring.log-dir /scratch/local/dmilne/hod/log
--hodring.mapred-system-dir-root /mapredsystem
--hodring.xrs-port-range 32768-65536 --hodring.debug 4
--hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
[2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers.

//chorus (many times)

[2010-06-15 10:12:02,852] DEBUG/10 ringMaster:530 - inside xml-rpc
call to stop ringmaster
[2010-06-15 10:12:02,853] DEBUG/10 ringMaster:976 - RingMaster stop
method invoked.
[2010-06-15 10:12:02,854] DEBUG/10 ringMaster:981 - finding exit code

//chorus (1x)

[2010-06-15 10:12:02,858] DEBUG/10 ringMaster:533 - returning from
xml-rpc call to stop ringmaster
[2010-06-15 10:12:02,859] DEBUG/10 ringMaster:949 - exit code 7
[2010-06-15 10:12:02,859] DEBUG/10 ringMaster:983 - stopping ringmaster instance
[2010-06-15 10:12:03,420] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred
[2010-06-15 10:12:03,421] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098>
[2010-06-15 10:12:03,422] DEBUG/10 ringMaster:504 - getServiceAddr
addr mapred: not found
[2010-06-15 10:12:03,852] DEBUG/10 idleJobTracker:79 - Joining the
monitoring thread.
[2010-06-15 10:12:03,853] DEBUG/10 idleJobTracker:83 - Joined the
monitoring thread.
[2010-06-15 10:12:04,442] DEBUG/10 ringMaster:793 - Cleaned up
temporary dir: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster
[2010-06-15 10:12:04,477] DEBUG/10 ringMaster:976 - RingMaster stop
method invoked.
[2010-06-15 10:12:04,478] DEBUG/10 ringMaster:1014 - returning from main






On Mon, Jun 14, 2010 at 5:52 PM, Vinod KV <vi...@yahoo-inc.com> wrote:
> On Monday 14 June 2010 08:03 AM, David Milne wrote:
>>
>> Anybody? I am completely stuck here. I have no idea who else I can ask
>> or where I can go for more information. Is there somewhere specific
>> where I should be asking about HOD?
>>
>> Thank you,
>> Dave
>>
>
> In the ringmaster logs, you should see which node was supposed to run
> Namenode. This can be found above the logs that you've printed. I can barely
> remember but I guess it reads something like getCommand(). Once you find out
> the node, check the hodring logs there, something must have gone wrong
> there.
>
> The return code was 7 - indicating HDFS failure. See
> http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque,
> and check if you are hitting one of the problems listed there.
>
> HTH,
> +vinod
>
>
>> On Thu, Jun 10, 2010 at 2:56 PM, David Milne<d....@gmail.com>  wrote:
>>
>>>
>>> Hi there,
>>>
>>> I am trying to get Hadoop on Demand up and running, but am having
>>> problems with the ringmaster not being able to communicate with HDFS.
>>>
>>> The output from the hod allocate command ends with this, with full
>>> verbosity:
>>>
>>> [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
>>> 'hdfs' service address.
>>> [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
>>> 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
>>> [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
>>> [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
>>> [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
>>> cluster /home/dmilne/hadoop/cluster
>>> [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
>>>
>>>
>>> I've attached the hodrc file below, but briefly HOD is supposed to
>>> provision an HDFS cluster as well as a Map/Reduce cluster, and seems
>>> to be failing to do so. The ringmaster log looks like this:
>>>
>>> [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name:
>>> hdfs
>>> [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
>>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>>> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
>>> addr hdfs: not found
>>> [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name:
>>> hdfs
>>> [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
>>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>>> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
>>> addr hdfs: not found
>>>
>>> ... and so on, until it gives up
>>>
>>> Any ideas why? One red flag is that when running the allocate command,
>>> some of the variables echo-ed back look dodgy:
>>>
>>> --gridservice-hdfs.fs_port 0
>>> --gridservice-hdfs.host localhost
>>> --gridservice-hdfs.info_port 0
>>>
>>> These are not what I specified in the hodrc. Are the port numbers just
>>> set to 0 because I am not using an external HDFS, or is this a
>>> problem?
>>>
>>>
>>> The software versions involved are:
>>>  - Hadoop 0.20.2
>>>  - Python 2.5.2 (no Twisted)
>>>  - Java 1.6.0_20
>>>  - Torque 2.4.5
>>>
>>>
>>> The hodrc file looks like this:
>>>
>>> [hod]
>>> stream                          = True
>>> java-home                       = /opt/jdk1.6.0_20
>>> cluster                         = debian5
>>> cluster-factor                  = 1.8
>>> xrs-port-range                  = 32768-65536
>>> debug                           = 3
>>> allocate-wait-time              = 3600
>>> temp-dir                        = /scratch/local/dmilne/hod
>>>
>>> [ringmaster]
>>> register                        = True
>>> stream                          = False
>>> temp-dir                        = /scratch/local/dmilne/hod
>>> log-dir                         = /scratch/local/dmilne/hod/log
>>> http-port-range                 = 8000-9000
>>> idleness-limit                  = 864000
>>> work-dirs                       =
>>> /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
>>> xrs-port-range                  = 32768-65536
>>> debug                           = 4
>>>
>>> [hodring]
>>> stream                          = False
>>> temp-dir                        = /scratch/local/dmilne/hod
>>> log-dir                         = /scratch/local/dmilne/hod/log
>>> register                        = True
>>> java-home                       = /opt/jdk1.6.0_20
>>> http-port-range                 = 8000-9000
>>> xrs-port-range                  = 32768-65536
>>> debug                           = 4
>>>
>>> [resource_manager]
>>> queue                           = express
>>> batch-home                      = /opt/torque-2.4.5
>>> id                              = torque
>>> options                         =
>>> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
>>> #env-vars                       =
>>> HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
>>>
>>> [gridservice-mapred]
>>> external                        = False
>>> pkgs                            = /opt/hadoop-0.20.2
>>> tracker_port                    = 8030
>>> info_port                       = 50080
>>>
>>> [gridservice-hdfs]
>>> external                        = False
>>> pkgs                            = /opt/hadoop-0.20.2
>>> fs_port                         = 8020
>>> info_port                       = 50070
>>>
>>> Cheers,
>>> Dave
>>>
>>>
>>
>>
>
>

Re: Problems with HOD and HDFS

Posted by Vinod KV <vi...@yahoo-inc.com>.
On Monday 14 June 2010 08:03 AM, David Milne wrote:
> Anybody? I am completely stuck here. I have no idea who else I can ask
> or where I can go for more information. Is there somewhere specific
> where I should be asking about HOD?
>
> Thank you,
> Dave
>    

In the ringmaster logs, you should see which node was supposed to run 
Namenode. This can be found above the logs that you've printed. I can 
barely remember but I guess it reads something like getCommand(). Once 
you find out the node, check the hodring logs there, something must have 
gone wrong there.

The return code was 7 - indicating HDFS failure. See 
http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque, 
and check if you are hitting one of the problems listed there.

HTH,
+vinod


> On Thu, Jun 10, 2010 at 2:56 PM, David Milne<d....@gmail.com>  wrote:
>    
>> Hi there,
>>
>> I am trying to get Hadoop on Demand up and running, but am having
>> problems with the ringmaster not being able to communicate with HDFS.
>>
>> The output from the hod allocate command ends with this, with full verbosity:
>>
>> [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
>> 'hdfs' service address.
>> [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
>> 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
>> [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
>> [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
>> [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
>> cluster /home/dmilne/hadoop/cluster
>> [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
>>
>>
>> I've attached the hodrc file below, but briefly HOD is supposed to
>> provision an HDFS cluster as well as a Map/Reduce cluster, and seems
>> to be failing to do so. The ringmaster log looks like this:
>>
>> [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
>> [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
>> addr hdfs: not found
>> [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
>> [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
>> service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
>> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
>> addr hdfs: not found
>>
>> ... and so on, until it gives up
>>
>> Any ideas why? One red flag is that when running the allocate command,
>> some of the variables echo-ed back look dodgy:
>>
>> --gridservice-hdfs.fs_port 0
>> --gridservice-hdfs.host localhost
>> --gridservice-hdfs.info_port 0
>>
>> These are not what I specified in the hodrc. Are the port numbers just
>> set to 0 because I am not using an external HDFS, or is this a
>> problem?
>>
>>
>> The software versions involved are:
>>   - Hadoop 0.20.2
>>   - Python 2.5.2 (no Twisted)
>>   - Java 1.6.0_20
>>   - Torque 2.4.5
>>
>>
>> The hodrc file looks like this:
>>
>> [hod]
>> stream                          = True
>> java-home                       = /opt/jdk1.6.0_20
>> cluster                         = debian5
>> cluster-factor                  = 1.8
>> xrs-port-range                  = 32768-65536
>> debug                           = 3
>> allocate-wait-time              = 3600
>> temp-dir                        = /scratch/local/dmilne/hod
>>
>> [ringmaster]
>> register                        = True
>> stream                          = False
>> temp-dir                        = /scratch/local/dmilne/hod
>> log-dir                         = /scratch/local/dmilne/hod/log
>> http-port-range                 = 8000-9000
>> idleness-limit                  = 864000
>> work-dirs                       =
>> /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
>> xrs-port-range                  = 32768-65536
>> debug                           = 4
>>
>> [hodring]
>> stream                          = False
>> temp-dir                        = /scratch/local/dmilne/hod
>> log-dir                         = /scratch/local/dmilne/hod/log
>> register                        = True
>> java-home                       = /opt/jdk1.6.0_20
>> http-port-range                 = 8000-9000
>> xrs-port-range                  = 32768-65536
>> debug                           = 4
>>
>> [resource_manager]
>> queue                           = express
>> batch-home                      = /opt/torque-2.4.5
>> id                              = torque
>> options                         = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
>> #env-vars                       =
>> HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
>>
>> [gridservice-mapred]
>> external                        = False
>> pkgs                            = /opt/hadoop-0.20.2
>> tracker_port                    = 8030
>> info_port                       = 50080
>>
>> [gridservice-hdfs]
>> external                        = False
>> pkgs                            = /opt/hadoop-0.20.2
>> fs_port                         = 8020
>> info_port                       = 50070
>>
>> Cheers,
>> Dave
>>
>>      
>    


Re: Problems with HOD and HDFS

Posted by David Milne <d....@gmail.com>.
Anybody? I am completely stuck here. I have no idea who else I can ask
or where I can go for more information. Is there somewhere specific
where I should be asking about HOD?

Thank you,
Dave

On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d....@gmail.com> wrote:
> Hi there,
>
> I am trying to get Hadoop on Demand up and running, but am having
> problems with the ringmaster not being able to communicate with HDFS.
>
> The output from the hod allocate command ends with this, with full verbosity:
>
> [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
> 'hdfs' service address.
> [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
> 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
> [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
> [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> cluster /home/dmilne/hadoop/cluster
> [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
>
>
> I've attached the hodrc file below, but briefly HOD is supposed to
> provision an HDFS cluster as well as a Map/Reduce cluster, and seems
> to be failing to do so. The ringmaster log looks like this:
>
> [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
> [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> addr hdfs: not found
> [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
> [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> addr hdfs: not found
>
> ... and so on, until it gives up
>
> Any ideas why? One red flag is that when running the allocate command,
> some of the variables echo-ed back look dodgy:
>
> --gridservice-hdfs.fs_port 0
> --gridservice-hdfs.host localhost
> --gridservice-hdfs.info_port 0
>
> These are not what I specified in the hodrc. Are the port numbers just
> set to 0 because I am not using an external HDFS, or is this a
> problem?
>
>
> The software versions involved are:
>  - Hadoop 0.20.2
>  - Python 2.5.2 (no Twisted)
>  - Java 1.6.0_20
>  - Torque 2.4.5
>
>
> The hodrc file looks like this:
>
> [hod]
> stream                          = True
> java-home                       = /opt/jdk1.6.0_20
> cluster                         = debian5
> cluster-factor                  = 1.8
> xrs-port-range                  = 32768-65536
> debug                           = 3
> allocate-wait-time              = 3600
> temp-dir                        = /scratch/local/dmilne/hod
>
> [ringmaster]
> register                        = True
> stream                          = False
> temp-dir                        = /scratch/local/dmilne/hod
> log-dir                         = /scratch/local/dmilne/hod/log
> http-port-range                 = 8000-9000
> idleness-limit                  = 864000
> work-dirs                       =
> /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> xrs-port-range                  = 32768-65536
> debug                           = 4
>
> [hodring]
> stream                          = False
> temp-dir                        = /scratch/local/dmilne/hod
> log-dir                         = /scratch/local/dmilne/hod/log
> register                        = True
> java-home                       = /opt/jdk1.6.0_20
> http-port-range                 = 8000-9000
> xrs-port-range                  = 32768-65536
> debug                           = 4
>
> [resource_manager]
> queue                           = express
> batch-home                      = /opt/torque-2.4.5
> id                              = torque
> options                         = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> #env-vars                       =
> HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
>
> [gridservice-mapred]
> external                        = False
> pkgs                            = /opt/hadoop-0.20.2
> tracker_port                    = 8030
> info_port                       = 50080
>
> [gridservice-hdfs]
> external                        = False
> pkgs                            = /opt/hadoop-0.20.2
> fs_port                         = 8020
> info_port                       = 50070
>
> Cheers,
> Dave
>