You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by John Armstrong <jo...@ccri.com> on 2011/08/26 16:50:34 UTC

Jobs failing on submit

One of my colleagues has noticed this problem for a while, and now it's
biting me.  Jobs seem to be failing before every really starting.  It seems
to be limited (so far) to running in pseudo-distributed mode, since that's
where he saw the problem and where I'm now seeing it; it hasn't come up on
our cluster (yet).

So here's what happens:

$ java -classpath $MY_CLASSPATH MyLauncherClass -conf my-config.xml -D
extra.properties=extravalues
...
launcher output
...
11/08/26 10:35:54 INFO input.FileInputFormat: Total input paths to process
: 2
11/08/26 10:35:54 INFO mapred.JobClient: Running job:
job_201108261034_0001
11/08/26 10:35:55 INFO mapred.JobClient:  map 0% reduce 0%

and it just sits there.  If I look at the jobtracker's web view the number
of submissions increments, but nothing shows up as a running, completed,
failed, or retired job.  If I use the command line probe I find

$ hadoop job -list
1 jobs currently running
JobId	State	StartTime	UserName	Priority	SchedulingInfo
job_201108261034_0001	4	1314369354247	hdfs	NORMAL	NA

If I try to kill this job, nothing happens; it remains in the list with
state 4 (failed?).  I've tried telling the mapper JVM to suspend so I can
find it in netstat and attach a debugger from IDEA, but it seems that the
job never gets to the point of even spinning up a JVM to run the mapper.

Any ideas what might be going wrong?  Thanks.

Re: Jobs failing on submit

Posted by John Armstrong <jo...@ccri.com>.
On Fri, 26 Aug 2011 12:20:47 -0700, Ramya Sunil <ra...@hortonworks.com>
wrote:
> Can you also post the configuration of the scheduler you are using? You
> might also want to check the jobtracker logs. It would help in further
> debugging.

Where would I find the scheduler configuration?  I haven't changed it, so
I assume I'm using the default.

This is what I see in the jobtracker logs when I submit the job:

2011-08-26 16:11:19,164 INFO org.apache.hadoop.mapred.JobTracker: Job
job_201108261610_0001 added successfully for user 'hdfs' to queue 'default'
2011-08-26 16:11:19,164 INFO org.apache.hadoop.mapred.JobTracker:
Initializing job_201108261610_0001
2011-08-26 16:11:19,164 INFO org.apache.hadoop.mapred.JobInProgress:
Initializing job_201108261610_0001
2011-08-26 16:11:19,165 INFO org.apache.hadoop.mapred.AuditLogger:
USER=hdfs	IP=127.0.0.1	OPERATION=SUBMIT_JOB	TARGET=job_201108261610_0001	RESULT=SUCCESS

Nothing shows up in the tasktracker logs when I submit the job.

> State "4" indicates that the job is still in the PREP state and not a
job
> failure. We have seen these kind of errors when either the cluster does
not
> have tasktrackers to run the tasks or when the queue to which the job is
> submitted does not have sufficient capacity.

So it's possible something has gone wrong with the job queue?  Is it
possible something's stuck in there?  How would I find it/clean it out?

> If you do not see this log message, that implies the cluster does not
have
> enough resources due to which JT is unable to schedule the tasks.

I do see this line in the TaskTracker logs; it might have something to do
with the problem, but I have no idea how to fix it.

2011-08-26 16:14:41,966 WARN org.apache.hadoop.mapred.TaskTracker:
TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is
disabled.

Thanks for the pointers.

Re: Jobs failing on submit

Posted by Ramya Sunil <ra...@hortonworks.com>.
On Fri, Aug 26, 2011 at 11:50 AM, John Armstrong <jo...@ccri.com>wrote:

> On Fri, 26 Aug 2011 11:46:42 -0700, Ramya Sunil <ra...@hortonworks.com>
> wrote:
> > How many tasktrackers do you have? Can you check if your tasktrackers
> are
> > running and the total available map and reduce capacity in your cluster?
>
> In pseudo-distributed there's one tasktracker, which is running, and the
> total map and reduce capacity is reported by the jobtracker at 6 slots
> each.
>
> > Can you also post the configuration of the scheduler you are using? You
> > might also want to check the jobtracker logs. It would help in further
> > debugging.
>
> Any ideas what I should be looking for that could cause a job to list as
> failed before launching any task JVMs and without reporting back to the
> launcher that it's failed?  Am I correct in interpreting "state 4" as
> "failure"?
>

State "4" indicates that the job is still in the PREP state and not a job
failure. We have seen these kind of errors when either the cluster does not
have tasktrackers to run the tasks or when the queue to which the job is
submitted does not have sufficient capacity.
In the logs, if you are able to see "Adding task (MAP/REDUCE)
<attemptID>...for tracker 'tracker_<TT_hostname>'", that means the task was
scheduled to be run on the TT. One can then look at the TT logs to check why
the tasks did not begin execution.
If you do not see this log message, that implies the cluster does not have
enough resources due to which JT is unable to schedule the tasks.

Thanks
Ramya

Re: Jobs failing on submit

Posted by John Armstrong <jo...@ccri.com>.
On Fri, 26 Aug 2011 11:46:42 -0700, Ramya Sunil <ra...@hortonworks.com>
wrote:
> How many tasktrackers do you have? Can you check if your tasktrackers
are
> running and the total available map and reduce capacity in your cluster?

In pseudo-distributed there's one tasktracker, which is running, and the
total map and reduce capacity is reported by the jobtracker at 6 slots
each.

> Can you also post the configuration of the scheduler you are using? You
> might also want to check the jobtracker logs. It would help in further
> debugging.

Any ideas what I should be looking for that could cause a job to list as
failed before launching any task JVMs and without reporting back to the
launcher that it's failed?  Am I correct in interpreting "state 4" as
"failure"?

Re: Jobs failing on submit

Posted by Ramya Sunil <ra...@hortonworks.com>.
Hi John,

How many tasktrackers do you have? Can you check if your tasktrackers are
running and the total available map and reduce capacity in your cluster?
Can you also post the configuration of the scheduler you are using? You
might also want to check the jobtracker logs. It would help in further
debugging.

Thanks
Ramya

On Fri, Aug 26, 2011 at 7:50 AM, John Armstrong <jo...@ccri.com>wrote:

> One of my colleagues has noticed this problem for a while, and now it's
> biting me.  Jobs seem to be failing before every really starting.  It seems
> to be limited (so far) to running in pseudo-distributed mode, since that's
> where he saw the problem and where I'm now seeing it; it hasn't come up on
> our cluster (yet).
>
> So here's what happens:
>
> $ java -classpath $MY_CLASSPATH MyLauncherClass -conf my-config.xml -D
> extra.properties=extravalues
> ...
> launcher output
> ...
> 11/08/26 10:35:54 INFO input.FileInputFormat: Total input paths to process
> : 2
> 11/08/26 10:35:54 INFO mapred.JobClient: Running job:
> job_201108261034_0001
> 11/08/26 10:35:55 INFO mapred.JobClient:  map 0% reduce 0%
>
> and it just sits there.  If I look at the jobtracker's web view the number
> of submissions increments, but nothing shows up as a running, completed,
> failed, or retired job.  If I use the command line probe I find
>
> $ hadoop job -list
> 1 jobs currently running
> JobId   State   StartTime       UserName        Priority
>  SchedulingInfo
> job_201108261034_0001   4       1314369354247   hdfs    NORMAL  NA
>
> If I try to kill this job, nothing happens; it remains in the list with
> state 4 (failed?).  I've tried telling the mapper JVM to suspend so I can
> find it in netstat and attach a debugger from IDEA, but it seems that the
> job never gets to the point of even spinning up a JVM to run the mapper.
>
> Any ideas what might be going wrong?  Thanks.
>