You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jason Venner <ja...@attributor.com> on 2008/02/26 07:55:16 UTC

More HOD questions 0.16.0 - debug log enclosed - help with how to debug

My hadoop jobs don't start
This is configured to use an existing DFS and to unpack a tarball with a 
cut down 0.16.0 config
I have looked in the mom logs on the client machines and am not getting 
anything meaningful.


The hadoop ports are biased by 1000 to allow another cluster to run on 
this machines, running and older version of hadoop.

Using Python: 2.5.1 (r251:54863, Feb 24 2008, 12:00:38)
[GCC 4.1.0 20060304 (Red Hat 4.1.0-3)]

[2008-02-25 21:56:38,611] DEBUG/10 hod:144 - ('hdimg01', 63059)
[2008-02-25 21:56:38,612] INFO/20 hod:216 - Service Registry Started.
[2008-02-25 21:56:38,615] DEBUG/10 hadoop:425 - allocate /tmp/hod 27 27
[2008-02-25 21:56:38,618] DEBUG/10 torque:72 - ringmaster cmd: 
/data1/hadoop-0.16.0-dfs/contrib/hod/bin/ringmaster 
--hodring.tarball-retry-initial-time 1.0 
--hodring.cmd-retry-initial-time 2.0 --$[2008-02-25 21:56:38,620] 
DEBUG/10 torque:44 - qsub -> /usr/bin/qsub -l nodes=27 -W x= -l nodes=27 
-W x= -N "HOD" -r n -d /tmp/ -q batch
[2008-02-25 21:56:38,822] DEBUG/10 torque:54 - qsub stdin: #!/bin/sh
[2008-02-25 21:56:38,823] DEBUG/10 torque:54 - qsub stdin: 
/data1/hadoop-0.16.0-dfs/contrib/hod/bin/ringmaster 
--hodring.tarball-retry-initial-time 1.0 
--hodring.cmd-retry-initial-time 2.0 --hodr$[2008-02-25 21:56:38,835] 
DEBUG/10 torque:76 - qsub jobid: 13.hdimg01
[2008-02-25 21:56:38,837] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
13.hdimg01
[2008-02-25 21:56:39,362] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
13.hdimg01
[2008-02-25 21:56:39,390] INFO/20 hadoop:447 - Hod Job successfully 
submitted. JobId : 13.hdimg01.
[2008-02-25 21:56:49,438] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
13.hdimg01
[2008-02-25 21:56:49,463] ERROR/40 torque:96 - qstat error: exit code: 
153 | signal: False | core False
[2008-02-25 21:56:49,464] INFO/20 hadoop:451 - Ringmaster at : None.
[2008-02-25 21:56:49,465] INFO/20 hadoop:530 - Cleaning up job id 
13.hdimg01, as cluster could not be allocated.
[2008-02-25 21:56:49,467] DEBUG/10 torque:131 - /usr/bin/qdel 13.hdimg01
[2008-02-25 21:56:49,490] CRITICAL/50 hod:253 - Cannot allocate cluster 
/tmp/hod
[2008-02-25 21:56:50,434] DEBUG/10 hod:391 - return code: 6
You have new mail in /var/spool/mail/XXXX


-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: More HOD questions 0.16.0 - debug log enclosed - help with how to debug - solved

Posted by Jason Venner <ja...@attributor.com>.

Well, this finally started to work, after we learned how to debug.

There were 2 issues, 1, the torque scp command was passing 3 arguments 
instead of 2, and this was causing the error logs to get eaten.

On our master node, the dfs hod is installed in a different place that 
on the child nodes, and a symlink plased into the 'standard location'.
HOD/torque was forwarding the real location instead of the configured 
location.

To find out the SCP was failing, we had to up the debug level on the 
pbs_moms' by seding SIGUSR1's to them, 4 seemed sufficient, then look at 
the /var/log/messages to find the failure reports.

For the short term, we just made symlinks on the child nodes of where 
the virtual cluster was expecting to find the dfs configuration.

Hemanth Yamijala wrote:
> Jason Venner wrote:
>> My hadoop jobs don't start
>> This is configured to use an existing DFS and to unpack a tarball 
>> with a cut down 0.16.0 config
>> I have looked in the mom logs on the client machines and am not 
>> getting anything meaningful.
>>
> What is your hod command line ? Specifically, how did you provide the 
> tarball option ?
> Can you attach the log of the hod command, like you did the hodrc. 
> There are some lines in the output that don't seem complete.
> Set your debug option in the [ringmaster] section to 4, and rerun hod. 
> Under the log-dir specified in the [ringmaster] section you will be 
> able to see a log file corresponding to your jobid. Can you attach 
> that too ? The ringmaster node is the first one allocated by torque 
> for the job, that is, the mother superior for the job.
> How is your tarball built ? Can you check that there's no 
> hadoop-env.sh with pre-filled values in them. Look at HADOOP-2860.
>
> Thanks
> Hemanth
>
-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: More HOD questions 0.16.0 - debug log enclosed - help with how to debug

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Jason Venner wrote:
> My hadoop jobs don't start
> This is configured to use an existing DFS and to unpack a tarball with 
> a cut down 0.16.0 config
> I have looked in the mom logs on the client machines and am not 
> getting anything meaningful.
>
What is your hod command line ? Specifically, how did you provide the 
tarball option ?
Can you attach the log of the hod command, like you did the hodrc. There 
are some lines in the output that don't seem complete.
Set your debug option in the [ringmaster] section to 4, and rerun hod. 
Under the log-dir specified in the [ringmaster] section you will be able 
to see a log file corresponding to your jobid. Can you attach that too ? 
The ringmaster node is the first one allocated by torque for the job, 
that is, the mother superior for the job.
How is your tarball built ? Can you check that there's no hadoop-env.sh 
with pre-filled values in them. Look at HADOOP-2860.

Thanks
Hemanth