You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Luca <ra...@yahoo.it> on 2008/02/21 19:52:30 UTC

Problems running a HOD test cluster

Hello everyone,
	I've been trying to run HOD on a sample cluster with three nodes that 
already have Torque installed and (hopefully?) properly working. I also 
prepared a configuration file for hod, that I'm gonna paste at the end 
of this email.

A few questions:
- is Java6 ok for HOD?
- I have an externally running HDFS cluster, as specified in 
[gridservice-hdfs]: how do I find out the fs_port of my cluster? IS it 
something specified in the hadoop-site.xml file?
- what should I expect at the end of an allocate command? Currently what 
I get is the output above, but should I in theory return back to the 
shell prompt, to issue an hadoop command?


[2008-02-21 19:45:34,349] DEBUG/10 hod:144 - ('server.com', 10029)
[2008-02-21 19:45:34,350] INFO/20 hod:216 - Service Registry Started.
[2008-02-21 19:45:34,353] DEBUG/10 hadoop:425 - allocate 
/mnt/scratch/grid/test 3 3
[2008-02-21 19:45:34,357] DEBUG/10 torque:72 - ringmaster cmd: 
/mnt/scratch/grid/hod/bin/ringmaster 
--hodring.tarball-retry-initial-time 1.0 
--hodring.cmd-retry-initial-time 2.0 --hodring.http-port-range 
10000-11000 --hodring.log-dir /mnt/scratch/grid/hod/logs 
--hodring.temp-dir /tmp/hod --hodring.register --hodring.userid hadoop 
--hodring.java-home /usr/java/jdk1.6.0_04 
--hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0 
--hodring.xrs-port-range 10000-11000 --hodring.debug 4 
--resource_manager.queue hadoop --resource_manager.env-vars 
"HOD_PYTHON_HOME=/usr/bin/python2.5" --resource_manager.id torque 
--resource_manager.batch-home /usr --gridservice-hdfs.fs_port 10007 
--gridservice-hdfs.host localhost --gridservice-hdfs.pkgs 
/mnt/scratch/grid/hadoop/current --gridservice-hdfs.info_port 10009 
--gridservice-hdfs.external --ringmaster.http-port-range 10000-11000 
--ringmaster.hadoop-tar-ball hadoop/hadoop-releases/hadoop-0.16.0.tar.gz 
--ringmaster.temp-dir /tmp/hod --ringmaster.register --ringmaster.userid 
hadoop --ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 
--ringmaster.svcrgy-addr server.com:10029 --ringmaster.log-dir 
/mnt/scratch/grid/hod/logs --ringmaster.max-connect 30 
--ringmaster.xrs-port-range 10000-11000 --ringmaster.jt-poll-interval 
120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600 
--gridservice-mapred.tracker_port 10003 --gridservice-mapred.host 
localhost --gridservice-mapred.pkgs /mnt/scratch/grid/hadoop/current 
--gridservice-mapred.info_port 10008
[2008-02-21 19:45:34,361] DEBUG/10 torque:44 - qsub -> /usr/bin/qsub -l 
nodes=3 -W x= -l nodes=3 -W x= -N "HOD" -r n -d /tmp/ -q hadoop -v 
HOD_PYTHON_HOME=/usr/bin/python2.5
[2008-02-21 19:45:34,373] DEBUG/10 torque:54 - qsub stdin: #!/bin/sh
[2008-02-21 19:45:34,374] DEBUG/10 torque:54 - qsub stdin: 
/mnt/scratch/grid/hod/bin/ringmaster 
--hodring.tarball-retry-initial-time 1.0 
--hodring.cmd-retry-initial-time 2.0 --hodring.http-port-range 
10000-11000 --hodring.log-dir /mnt/scratch/grid/hod/logs 
--hodring.temp-dir /tmp/hod --hodring.register --hodring.userid hadoop 
--hodring.java-home /usr/java/jdk1.6.0_04 
--hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0 
--hodring.xrs-port-range 10000-11000 --hodring.debug 4 
--resource_manager.queue hadoop --resource_manager.env-vars 
"HOD_PYTHON_HOME=/usr/bin/python2.5" --resource_manager.id torque 
--resource_manager.batch-home /usr --gridservice-hdfs.fs_port 10007 
--gridservice-hdfs.host localhost --gridservice-hdfs.pkgs 
/mnt/scratch/grid/hadoop/current --gridservice-hdfs.info_port 10009 
--gridservice-hdfs.external --ringmaster.http-port-range 10000-11000 
--ringmaster.hadoop-tar-ball hadoop/hadoop-releases/hadoop-0.16.0.tar.gz 
--ringmaster.temp-dir /tmp/hod --ringmaster.register --ringmaster.userid 
hadoop --ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 
--ringmaster.svcrgy-addr server.com:10029 --ringmaster.log-dir 
/mnt/scratch/grid/hod/logs --ringmaster.max-connect 30 
--ringmaster.xrs-port-range 10000-11000 --ringmaster.jt-poll-interval 
120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600 
--gridservice-mapred.tracker_port 10003 --gridservice-mapred.host 
localhost --gridservice-mapred.pkgs /mnt/scratch/grid/hadoop/current 
--gridservice-mapred.info_port 10008
[2008-02-21 19:45:36,385] DEBUG/10 torque:76 - qsub jobid: 207.server.com
[2008-02-21 19:45:36,389] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:38,952] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:41,524] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:44,066] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:46,612] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:49,155] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:51,696] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:54,236] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:56,797] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:58,856] INFO/20 hadoop:447 - Hod Job successfully 
submitted. JobId : 207.server.com.
[2008-02-21 19:46:08,967] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:46:11,014] ERROR/40 torque:96 - qstat error: exit code: 
153 | signal: False | core False
[2008-02-21 19:46:11,017] INFO/20 hadoop:451 - Ringmaster at : None.
[2008-02-21 19:46:11,021] INFO/20 hadoop:530 - Cleaning up job id 
207.server.com, as cluster could not be allocated.
[2008-02-21 19:46:11,025] DEBUG/10 torque:131 - /usr/bin/qdel 207.server.com
[2008-02-21 19:46:13,079] CRITICAL/50 hod:253 - Cannot allocate cluster 
/mnt/scratch/grid/test
[2008-02-21 19:46:13,940] DEBUG/10 hod:391 - return code: 6


$ cat hod/conf/hodrc
[hod]
stream                          = True
java-home                       = /usr/java/jdk1.6.0_04
cluster                         = HOD
cluster-factor                  = 1.8
xrs-port-range                  = 10000-11000
debug                           = 4
allocate-wait-time              = 3600
temp-dir                        = /tmp/hod
log-dir                         = /mnt/scratch/grid/hod/logs

[ringmaster]
register                        = True
stream                          = False
temp-dir                        = /tmp/hod
http-port-range                 = 10000-11000
work-dirs                       = /tmp/hod/1,/tmp/hod/2
xrs-port-range                  = 10000-11000
debug                           = 4

[hodring]
stream                          = False
temp-dir                        = /tmp/hod
register                        = True
java-home                       = /usr/java/jdk1.6.0_04
http-port-range                 = 10000-11000
xrs-port-range                  = 10000-11000
debug                           = 4

[resource_manager]
queue                           = hadoop
batch-home                      = /usr
id                              = torque
env-vars                        = HOD_PYTHON_HOME=/usr/bin/python2.5

[gridservice-mapred]
external                        = False
pkgs                            = /mnt/scratch/grid/hadoop/current
tracker_port                    = 10003
info_port                       = 10008

[gridservice-hdfs]
external                        = True
pkgs                            = /mnt/scratch/grid/hadoop/current
fs_port                         = 10007
info_port                       = 10009

Cheers,
Luca

Re: Problems running a HOD test cluster

Posted by Luca <ra...@yahoo.it>.

Allen Wittenauer wrote:

>> [2008-02-21 19:46:11,014] ERROR/40 torque:96 - qstat error: exit code:
>> 153 | signal: False | core False
>> [2008-02-21 19:46:11,017] INFO/20 hadoop:451 - Ringmaster at : None.
> 
>     I bet your ringmaster didn't come up.  Check which nodes were allocated
> to your job via qstat -f.  Chances are good the first one is the ringmaster
> node.  Check the torque logs, syslogs, and the hod log dir for hints as to
> what happened.

So it was a ringmaster related problem, and it's now solved. Now another 
problem: I try to run an hadoop job but I get a timeout error from 
hadoop in trying to get the input path. I guess this might be related to 
the fact that I'm using an external HDFS already running, and I'm not 
sure how to hook hod with it.

I configured

[gridservice-hdfs]
external                        = True
host                            = mane-of-my-server.com
pkgs                            = /mnt/scratch/grid/hadoop/current
fs_port                         = 10010
info_port                       = 10007

but I'm not sure whether the actual fs_port is on 10010. Does anybody 
know what's the attribute in hadoop configuratioon where this value is 
specified? If not specified, there is a default/random value? Finally, 
is this a global attribute, or a node attribute? (like node A having 
some fs_port and node B having a different one).

Thanks in advance,
Luca

Re: Problems running a HOD test cluster

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 2/22/08 3:58 PM, "Jason Venner" <ja...@attributor.com> wrote:

> We have been unable to get torque up and running. The magic value in the
> server_name file seems to elude us.

The server_name should be the real hostname of the machine running
pbs_server.

> We have tried localhost, 127.0.0.1, machine name, machine ip, fq machine
> name. Depending on what we use, we either get
> Unauthorized request or invalid entry

    If you do qmgr -c 'list server', does submit_hosts have the hostname of
the machine you are submitting from listed?

> qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host:
> hdimg01

    Hmm. That error message doesn't ring any bells for me.

> qmgr obj= svr=default: Unauthorized Request

    Unauthorized request, IIRC, is either host or user is not allowed to do
whatever it is they are trying to do.

> /var/torque/server_name
> torque-pam-2.1.6-1.fc5

    We compile our own.  Currently on 2.1.8 w/3 custom patches related to
issues with node selection mainly.  I think our 3 patches have been sent to
cluster resources and are part of 2.2.0 or whatever the latest build is.

> Will someone with torque knowledge take pity on us and advise?

    :)  No guaratees, but this should give you some ideas about how we have
ours configured:

1. pbs-server -t create
2. Hop into qmgr, and change the following.  The rest we keep as default (I
think)
3. set server managers = root@localhost
4. set server managers += admin #1

I generally recommend using user@*.domain and user@localhost.  So, for
example, set server operators += aw@*.example.com followed by set server
operators += aw@localhost .

5. set server managers += admin #2  [repeat as necessary]
6. set server operators = root@localhost
7. set server operators += admin #1
8. set server operators += admin #2 [repeat as necessary]
9. set server submit_hosts = torque server hostname
10. set server submit_hosts += hostname where qsub will be executed [repeat]
11. set server allow_node_submit = true [enables all nodes registered to run
qsub]
12. set server acl_hosts = torque server hostname
13. create queue hadoop queue_type = execution
14. set queue hadoop resources_default.neednodes = hadoop
15. set queue hadoop resources_default.nodect = 1
16. set queue hadoop resources_default.nodes = 1
17. set server default_queue = hadoop
18. set server scheduling = true
19. Configure the nodes.... Hop out of qmgr and do something like the
following:

 for host in (some range of hosts)
 do
    qmgr -c "create node ${host}"
    qmgr -c "set node ${host} properties += hadoop"
    pbsnodes -r ${host}"
 done

20.  Make sure pbs_mom is running on your remote nodes.
21. mom_priv/config should look something like this:

$pbsserver  torque server name
$usecp *:/home /home
$usecp *:/tmp /tmp
$fatal_job_poll_failure false

BTW,  We HIGHLY HIGHLY HIGHLY recommend using a node_check script.  For
large clusters, this will save you a LOT of headaches when drives fail, etc.
At some point, maybe we'll sanitize ours and share. ;)

23.  Restart pbs_server.  This is one of the bugs we've patched in ours. ;)

Re: Problems running a HOD test cluster

Posted by Jason Venner <ja...@attributor.com>.

We have been unable to get torque up and running. The magic value in the 
server_name file seems to elude us.
We have tried localhost, 127.0.0.1, machine name, machine ip, fq machine 
name. Depending on what we use, we either get
Unauthorized request or invalid entry
qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host: 
hdimg01
qmgr obj= svr=default: Unauthorized Request
/var/torque/server_name
torque-pam-2.1.6-1.fc5
torque-mom-2.1.6-1.fc5
libtorque-2.1.6-1.fc5
torque-client-2.1.6-1.fc5
torque-gui-2.1.6-1.fc5
torque-server-2.1.6-1.fc5
torque-2.1.6-1.fc5
torque-docs-2.1.6-1.fc5
torque-scheduler-2.1.6-1.fc5

Will someone with torque knowledge take pity on us and advise?

Allen Wittenauer wrote:
> On 2/21/08 10:52 AM, "Luca" <ra...@yahoo.it> wrote:
>   
>> A few questions:
>> - is Java6 ok for HOD?
>>     
>
>     That's what we use.
>
>   
>> - I have an externally running HDFS cluster, as specified in
>> [gridservice-hdfs]: how do I find out the fs_port of my cluster? IS it
>> something specified in the hadoop-site.xml file?
>>     
>
>     Yup.
>
>   
>> - what should I expect at the end of an allocate command? Currently what
>> I get is the output above, but should I in theory return back to the
>> shell prompt, to issue an hadoop command?
>>     
>
>     With HOD 0.4, yes.
>
>
>   
>> [2008-02-21 19:46:11,014] ERROR/40 torque:96 - qstat error: exit code:
>> 153 | signal: False | core False
>> [2008-02-21 19:46:11,017] INFO/20 hadoop:451 - Ringmaster at : None.
>>     
>
>     I bet your ringmaster didn't come up.  Check which nodes were allocated
> to your job via qstat -f.  Chances are good the first one is the ringmaster
> node.  Check the torque logs, syslogs, and the hod log dir for hints as to
> what happened.
>
>
>   
>> [2008-02-21 19:46:11,021] INFO/20 hadoop:530 - Cleaning up job id
>> 207.server.com, as cluster could not be allocated.
>> [2008-02-21 19:46:11,025] DEBUG/10 torque:131 - /usr/bin/qdel 207.server.com
>> [2008-02-21 19:46:13,079] CRITICAL/50 hod:253 - Cannot allocate cluster
>> /mnt/scratch/grid/test
>> [2008-02-21 19:46:13,940] DEBUG/10 hod:391 - return code: 6
>>     
>
>   

-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: Problems running a HOD test cluster

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 2/21/08 10:52 AM, "Luca" <ra...@yahoo.it> wrote:
> A few questions:
> - is Java6 ok for HOD?

    That's what we use.

> - I have an externally running HDFS cluster, as specified in
> [gridservice-hdfs]: how do I find out the fs_port of my cluster? IS it
> something specified in the hadoop-site.xml file?

    Yup.

> - what should I expect at the end of an allocate command? Currently what
> I get is the output above, but should I in theory return back to the
> shell prompt, to issue an hadoop command?

    With HOD 0.4, yes.


> [2008-02-21 19:46:11,014] ERROR/40 torque:96 - qstat error: exit code:
> 153 | signal: False | core False
> [2008-02-21 19:46:11,017] INFO/20 hadoop:451 - Ringmaster at : None.

    I bet your ringmaster didn't come up.  Check which nodes were allocated
to your job via qstat -f.  Chances are good the first one is the ringmaster
node.  Check the torque logs, syslogs, and the hod log dir for hints as to
what happened.


> [2008-02-21 19:46:11,021] INFO/20 hadoop:530 - Cleaning up job id
> 207.server.com, as cluster could not be allocated.
> [2008-02-21 19:46:11,025] DEBUG/10 torque:131 - /usr/bin/qdel 207.server.com
> [2008-02-21 19:46:13,079] CRITICAL/50 hod:253 - Cannot allocate cluster
> /mnt/scratch/grid/test
> [2008-02-21 19:46:13,940] DEBUG/10 hod:391 - return code: 6