You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Иван <m1...@mail.ru> on 2008/09/16 17:35:33 UTC
Instant death of all TaskTrackers

Today I've met a strange situation: during running of some MapReduce jobs all the TaskTrackers in the cluster simply disappeared without any apparent reason, but the JobTracker have remained alive like nothing have happened. Even it's web interface was running showing zero capacity for maps and reduces and all the same jobs in the running state (in fact, TaskTracker$Childs have also remained in memory). Examination of tasktracker's logs resulted in (almost) same exception in the tail, like this one:

2008-09-16 06:27:11,244 WARN org.apache.hadoop.mapred.TaskTracker: Error initializing task_200809151253_1938_m_000003_0:
java.lang.InternalError: jzentry == 0,
 jzfile = 46912646564160,
 total = 148,
 name = /data/hadoop/root/mapred/local/taskTracker/jobcache/job_200809151253_1938/jars/job.jar,
 i = 3,
 message = invalid LOC header (bad signature)
        at java.util.zip.ZipFile$3.nextElement(ZipFile.java:429)
        at java.util.zip.ZipFile$3.nextElement(ZipFile.java:415)
        at java.util.jar.JarFile$1.nextElement(JarFile.java:221)
        at java.util.jar.JarFile$1.nextElement(JarFile.java:220)
        at org.apache.hadoop.util.RunJar.unJar(RunJar.java:40)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:708)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
        at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)

The exact exception strings were different, but the stacktraces on all nodes were close to each other. At the beginning it looked like that I can just shrug it off and simply start them once more, but something have gone wrong. The fsck utility found some missing blocks and the HBase running on the same cluster have just simply became unavailable and later failed to start up (seems to be the connected issues). The HBase reported the SocketTimeoutExceptions (in fact only about two servers simultaneously each time, but after cluster-restart the role of "victims" have transferred to other nodes), while in the HDFS logs sometimes emerged the messages about unability to find some old blocks or create a new ones. I've double-checked the possible variants: some DNS problems, network collisions, iptables, possible disk corruption, or something like that, but even complete cluster reboot haven't changed the situation a bit.

P.S.: Hadoop 0.17.1, HBase 0.2.0, Debian Etch
P.S.: If this does matter: the MR jobs running at that moment have performed some manipulation with data in HBase and all the blocks which report some problems are located in HBase root directory (at least it looks like that).

Thanks,
Ivan Blinkov