You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Jan Lukavský <ja...@firma.seznam.cz> on 2012/11/20 14:44:02 UTC

Inconsistent state in JobTracker (cdh)

Hi all,

we are time to time experiencing a little odd behavior of JobTracker 
(using cdh release, currently on cdh3u3, but I suppose this affects at 
least all cdh3 releases so far). What we are seeing is M/R job beeing 
stuck between map and reduce phase, with 100% maps completed but the web 
UI reports 1 running map task and since we 
have**mapred.reduce.slowstart.completed.maps set to 1.0 (because of 
better throughput of jobs) the reduce phase will never start and the job 
has to be killed. I have investigated this a bit and I think I have 
found the reason for this.

12/11/20 01:05:10 INFO mapred.JobInProgress: Task 
'attempt_201211011002_1852_m_007638_0' has completed 
task_201211011002_1852_m_007638 successfully.
12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception: 
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease 
on <some output path> File does not exist. [Lease. Holder: 
DFSClient_408514838, pendingcreates: 1]
....
12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block 
blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. 
Source file "<some output path>" - Aborting...
12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job 
job_201211011002_1852removing PrintWriter from FileManager
12/11/20 01:05:10 ERROR security.UserGroupInformation: 
PriviledgedActionException as:mapred (auth:SIMPLE) 
cause:java.io.IOException: java.util.ConcurrentModificationException
12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call 
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, 
false, true, -17988) from 10.2.73.35:44969: error: java.io.IOException: 
java.util.ConcurrentModificationException


When I look to the source code for JobInProgress.completedTask(), I see 
the log about successful competion of the task, and after that, the 
logging in HDFS (JobHistory.Task.logFinished()). I suppose that if this 
call throws an exception (like in the case above), the call to 
completedTask() is aborted *before* the counters runningMapTasks and 
finishedMapTasks are updated accordingly. I created a heap dump of the 
JobTracker and I really found the counter runningMapTasks set to 1 and 
finishedMapTasks was equal to numMapTasks - 1.

Now, the question is, should this be handled in the JobTracker (say by 
moving the logging code after the counter manipulation)? Or should the 
TaskTracker re-report the completed task on error in JobTracker? What 
can cause the LeaseExpiredException? Should a JIRA be filled? :)

Thanks for comments,
  Jan

Re: Inconsistent state in JobTracker (cdh)

Posted by Harsh J <ha...@cloudera.com>.

Hey Jan,

Your problem may be CDH-specific, so am moving this to
cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user).

The specific issue you are running into (mostly environmentally
triggered), around the ConcurrentModificationException failure at
heartbeats specifically, was a known issue until CDH3u3 that has been
fixed in CDH3u4 onwards. If you upgrade (to the latest CDH3 - CDH3u5),
your problem should simply go away.

On Tue, Nov 20, 2012 at 7:14 PM, Jan Lukavský
<ja...@firma.seznam.cz> wrote:
> Hi all,
>
> we are time to time experiencing a little odd behavior of JobTracker (using
> cdh release, currently on cdh3u3, but I suppose this affects at least all
> cdh3 releases so far). What we are seeing is M/R job beeing stuck between
> map and reduce phase, with 100% maps completed but the web UI reports 1
> running map task and since we have mapred.reduce.slowstart.completed.maps
> set to 1.0 (because of better throughput of jobs) the reduce phase will
> never start and the job has to be killed. I have investigated this a bit and
> I think I have found the reason for this.
>
> 12/11/20 01:05:10 INFO mapred.JobInProgress: Task
> 'attempt_201211011002_1852_m_007638_0' has completed
> task_201211011002_1852_m_007638 successfully.
> 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> <some output path> File does not exist. [Lease.  Holder:
> DFSClient_408514838, pendingcreates: 1]
> ....
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
> blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source
> file "<some output path>" - Aborting...
> 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
> job_201211011002_1852removing PrintWriter from FileManager
> 12/11/20 01:05:10 ERROR security.UserGroupInformation:
> PriviledgedActionException as:mapred (auth:SIMPLE)
> cause:java.io.IOException: java.util.ConcurrentModificationException
> 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false,
> true, -17988) from 10.2.73.35:44969: error: java.io.IOException:
> java.util.ConcurrentModificationException
>
>
> When I look to the source code for JobInProgress.completedTask(), I see the
> log about successful competion of the task, and after that, the logging in
> HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an
> exception (like in the case above), the call to completedTask() is aborted
> *before* the counters runningMapTasks and finishedMapTasks are updated
> accordingly. I created a heap dump of the JobTracker and I really found the
> counter runningMapTasks set to 1 and finishedMapTasks was equal to
> numMapTasks - 1.
>
> Now, the question is, should this be handled in the JobTracker (say by
> moving the logging code after the counter manipulation)? Or should the
> TaskTracker re-report the completed task on error in JobTracker? What can
> cause the LeaseExpiredException? Should a JIRA be filled? :)
>
> Thanks for comments,
>  Jan
>
>



-- 
Harsh J

Re: Inconsistent state in JobTracker (cdh)

Posted by Harsh J <ha...@cloudera.com>.

Hey Jan,

Your problem may be CDH-specific, so am moving this to
cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user).

The specific issue you are running into (mostly environmentally
triggered), around the ConcurrentModificationException failure at
heartbeats specifically, was a known issue until CDH3u3 that has been
fixed in CDH3u4 onwards. If you upgrade (to the latest CDH3 - CDH3u5),
your problem should simply go away.

On Tue, Nov 20, 2012 at 7:14 PM, Jan Lukavský
<ja...@firma.seznam.cz> wrote:
> Hi all,
>
> we are time to time experiencing a little odd behavior of JobTracker (using
> cdh release, currently on cdh3u3, but I suppose this affects at least all
> cdh3 releases so far). What we are seeing is M/R job beeing stuck between
> map and reduce phase, with 100% maps completed but the web UI reports 1
> running map task and since we have mapred.reduce.slowstart.completed.maps
> set to 1.0 (because of better throughput of jobs) the reduce phase will
> never start and the job has to be killed. I have investigated this a bit and
> I think I have found the reason for this.
>
> 12/11/20 01:05:10 INFO mapred.JobInProgress: Task
> 'attempt_201211011002_1852_m_007638_0' has completed
> task_201211011002_1852_m_007638 successfully.
> 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> <some output path> File does not exist. [Lease.  Holder:
> DFSClient_408514838, pendingcreates: 1]
> ....
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
> blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source
> file "<some output path>" - Aborting...
> 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
> job_201211011002_1852removing PrintWriter from FileManager
> 12/11/20 01:05:10 ERROR security.UserGroupInformation:
> PriviledgedActionException as:mapred (auth:SIMPLE)
> cause:java.io.IOException: java.util.ConcurrentModificationException
> 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false,
> true, -17988) from 10.2.73.35:44969: error: java.io.IOException:
> java.util.ConcurrentModificationException
>
>
> When I look to the source code for JobInProgress.completedTask(), I see the
> log about successful competion of the task, and after that, the logging in
> HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an
> exception (like in the case above), the call to completedTask() is aborted
> *before* the counters runningMapTasks and finishedMapTasks are updated
> accordingly. I created a heap dump of the JobTracker and I really found the
> counter runningMapTasks set to 1 and finishedMapTasks was equal to
> numMapTasks - 1.
>
> Now, the question is, should this be handled in the JobTracker (say by
> moving the logging code after the counter manipulation)? Or should the
> TaskTracker re-report the completed task on error in JobTracker? What can
> cause the LeaseExpiredException? Should a JIRA be filled? :)
>
> Thanks for comments,
>  Jan
>
>



-- 
Harsh J

Re: Inconsistent state in JobTracker (cdh)

Posted by Harsh J <ha...@cloudera.com>.

Hey Jan,

Your problem may be CDH-specific, so am moving this to
cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user).

The specific issue you are running into (mostly environmentally
triggered), around the ConcurrentModificationException failure at
heartbeats specifically, was a known issue until CDH3u3 that has been
fixed in CDH3u4 onwards. If you upgrade (to the latest CDH3 - CDH3u5),
your problem should simply go away.

On Tue, Nov 20, 2012 at 7:14 PM, Jan Lukavský
<ja...@firma.seznam.cz> wrote:
> Hi all,
>
> we are time to time experiencing a little odd behavior of JobTracker (using
> cdh release, currently on cdh3u3, but I suppose this affects at least all
> cdh3 releases so far). What we are seeing is M/R job beeing stuck between
> map and reduce phase, with 100% maps completed but the web UI reports 1
> running map task and since we have mapred.reduce.slowstart.completed.maps
> set to 1.0 (because of better throughput of jobs) the reduce phase will
> never start and the job has to be killed. I have investigated this a bit and
> I think I have found the reason for this.
>
> 12/11/20 01:05:10 INFO mapred.JobInProgress: Task
> 'attempt_201211011002_1852_m_007638_0' has completed
> task_201211011002_1852_m_007638 successfully.
> 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> <some output path> File does not exist. [Lease.  Holder:
> DFSClient_408514838, pendingcreates: 1]
> ....
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
> blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source
> file "<some output path>" - Aborting...
> 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
> job_201211011002_1852removing PrintWriter from FileManager
> 12/11/20 01:05:10 ERROR security.UserGroupInformation:
> PriviledgedActionException as:mapred (auth:SIMPLE)
> cause:java.io.IOException: java.util.ConcurrentModificationException
> 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false,
> true, -17988) from 10.2.73.35:44969: error: java.io.IOException:
> java.util.ConcurrentModificationException
>
>
> When I look to the source code for JobInProgress.completedTask(), I see the
> log about successful competion of the task, and after that, the logging in
> HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an
> exception (like in the case above), the call to completedTask() is aborted
> *before* the counters runningMapTasks and finishedMapTasks are updated
> accordingly. I created a heap dump of the JobTracker and I really found the
> counter runningMapTasks set to 1 and finishedMapTasks was equal to
> numMapTasks - 1.
>
> Now, the question is, should this be handled in the JobTracker (say by
> moving the logging code after the counter manipulation)? Or should the
> TaskTracker re-report the completed task on error in JobTracker? What can
> cause the LeaseExpiredException? Should a JIRA be filled? :)
>
> Thanks for comments,
>  Jan
>
>



-- 
Harsh J

Re: Inconsistent state in JobTracker (cdh)

Posted by Harsh J <ha...@cloudera.com>.

Hey Jan,

Your problem may be CDH-specific, so am moving this to
cdh-user@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user).

The specific issue you are running into (mostly environmentally
triggered), around the ConcurrentModificationException failure at
heartbeats specifically, was a known issue until CDH3u3 that has been
fixed in CDH3u4 onwards. If you upgrade (to the latest CDH3 - CDH3u5),
your problem should simply go away.

On Tue, Nov 20, 2012 at 7:14 PM, Jan Lukavský
<ja...@firma.seznam.cz> wrote:
> Hi all,
>
> we are time to time experiencing a little odd behavior of JobTracker (using
> cdh release, currently on cdh3u3, but I suppose this affects at least all
> cdh3 releases so far). What we are seeing is M/R job beeing stuck between
> map and reduce phase, with 100% maps completed but the web UI reports 1
> running map task and since we have mapred.reduce.slowstart.completed.maps
> set to 1.0 (because of better throughput of jobs) the reduce phase will
> never start and the job has to be killed. I have investigated this a bit and
> I think I have found the reason for this.
>
> 12/11/20 01:05:10 INFO mapred.JobInProgress: Task
> 'attempt_201211011002_1852_m_007638_0' has completed
> task_201211011002_1852_m_007638 successfully.
> 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> <some output path> File does not exist. [Lease.  Holder:
> DFSClient_408514838, pendingcreates: 1]
> ....
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block
> blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
> 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source
> file "<some output path>" - Aborting...
> 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job
> job_201211011002_1852removing PrintWriter from FileManager
> 12/11/20 01:05:10 ERROR security.UserGroupInformation:
> PriviledgedActionException as:mapred (auth:SIMPLE)
> cause:java.io.IOException: java.util.ConcurrentModificationException
> 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false,
> true, -17988) from 10.2.73.35:44969: error: java.io.IOException:
> java.util.ConcurrentModificationException
>
>
> When I look to the source code for JobInProgress.completedTask(), I see the
> log about successful competion of the task, and after that, the logging in
> HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an
> exception (like in the case above), the call to completedTask() is aborted
> *before* the counters runningMapTasks and finishedMapTasks are updated
> accordingly. I created a heap dump of the JobTracker and I really found the
> counter runningMapTasks set to 1 and finishedMapTasks was equal to
> numMapTasks - 1.
>
> Now, the question is, should this be handled in the JobTracker (say by
> moving the logging code after the counter manipulation)? Or should the
> TaskTracker re-report the completed task on error in JobTracker? What can
> cause the LeaseExpiredException? Should a JIRA be filled? :)
>
> Thanks for comments,
>  Jan
>
>



-- 
Harsh J