You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/01/25 17:42:26 UTC

Lingering TaskTracker$Child

Hi.

Today I noticed when I ran a Solr Indexing job through our Hadoop cluster
that the master MySQL database where screaming about "Too Many Connections".

I wondered how that could happen so I logged into my Hadoop machines and
searched through the logs. Nothing strange there. Then I just did a jps:

root@mapreduce1:~# jps
10701 TaskTracker$Child
9567 NameNode
5435 TaskTracker$Child
31801 Bootstrap
7349 TaskTracker$Child
6197 TaskTracker$Child
7761 TaskTracker$Child
10453 TaskTracker$Child
11232 TaskTracker$Child
11113 TaskTracker$Child
9688 DataNode
10877 TaskTracker$Child
6504 TaskTracker$Child
10236 TaskTracker$Child
9852 TaskTracker
6515 TaskTracker$Child
11396 TaskTracker$Child
11741 Jps
6191 TaskTracker$Child
10981 TaskTracker$Child
7742 TaskTracker$Child
5946 TaskTracker$Child
11315 TaskTracker$Child
8112 TaskTracker$Child
11580 TaskTracker$Child
11490 TaskTracker$Child
5687 TaskTracker$Child
5927 TaskTracker$Child
27144 WrapperSimpleApp
7368 TaskTracker$Child

Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so it
only can have one active connection to any shard at any time.
Background: I ran out of connections during the Christmas holidays since I
have 60 shards (10 per MySQL machine) and each required a DB-Pool which
allowed too many active+idle connections.

Anyway I have no active jobs at the moment so the children should have died
by themselves.
Fortunately I have a little nice script which kills the bastards: jps |egrep
"TaskTracker.+" | awk '{print $1}'|xargs kill
I will probably put that in a cronjob which kills long running children...

Anyway, how can this happen ? Am I doing something really stupid along the
way ?
Hard facts:
Ubuntu Hardy-Heron, 2.6.24-19-server
java version "1.6.0_06"
Hadoop-0.18.2
It's my own classes which fires the jobs through JobClient
(JobClient.runJob(job))
I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from a bash
script)
I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles); (comes by
parsing external CLASSPATH ENV in bash)

The client do not complain, se example output below (I write no data to HDFS
((HDFS bytes written=774)), since I mostly use it for crawling and all
crawlers/indexers access my sharding db structure directly without
intermediate storage):
2009-01-25 17:12:11.175 INFO main org.apache.hadoop.mapred.FileInputFormat -
Total input paths to process : 1
2009-01-25 17:12:11.176 INFO main org.apache.hadoop.mapred.FileInputFormat -
Total input paths to process : 1
2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
Running job: job_200901251629_0011
2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -  map
0% reduce 0%
2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -  map
6% reduce 0%
2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -  map
21% reduce 0%
2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient -  map
31% reduce 0%
2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient -  map
51% reduce 0%
2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient -  map
67% reduce 0%
2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient -  map
72% reduce 0%
2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient -  map
84% reduce 0%
2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient -  map
95% reduce 0%
2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient -  map
98% reduce 0%
2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient - Job
complete: job_200901251629_0011
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
Counters: 7
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
File Systems
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
HDFS bytes read=2741143
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
HDFS bytes written=774
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -   Job
Counters
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
Rack-local map tasks=9
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Launched map tasks=9
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map-Reduce Framework
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map input records=48314
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map input bytes=2732424
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map output records=0

Any suggestions or pointers would be greatly appreciated. Hmm Coming to
think about something. I start X threads from inside Hadoop almost cut'n
pasted from Nutch.
If a thread somehow would linger, would Hadoop not be able to shutdown even
though there is nothing more to read from the RecordReader ?

Kindly

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Lingering TaskTracker$Child

Posted by Marcus Herou <ma...@tailsweep.com>.
Thanks.

Before going down the Hadoop modding path we just run this script in cron
every 5 mins.

#!/bin/bash
# Only allows the two most recenet children to survive.
CMD=`jps |egrep 'TaskTracker.+' | awk '{print \$1}'|xargs --no-run-if-empty
ps -o etime,pid -p |egrep -v 'PID|ELAPSED' |sed 's/://g' | sort -nr | wc -l`
echo $CMD
if [ $CMD -gt 2 ]; then
    echo $CMD
    DIFF=$(($CMD - 2));
    KILL_RES=`jps | egrep 'TaskTracker.+'  | awk '{print \$1}'|xargs
--no-run-if-empty ps -o etime,pid -p |egrep -v 'PID|ELAPSED' |sed 's/://g' |
sort -nr | head -n$DIFF | awk '{print \$2}'|xargs kill`
    echo "$KILL_RES";
else
    echo "Nothing to kill";
fi



On Mon, Jan 26, 2009 at 3:03 AM, jason hadoop <ja...@gmail.com>wrote:

> Yes, I mean we mod'd the tasktracker code.
>
> It is most likely that you have some threads about that are holding the JVM
> from exiting.
> If you are mot doing JVM re-use as of the 19/20 feature set
> You could in the close method of your map/reduce task, Walk through the
> list of threads and prioritize them to daemon.
> The attached java should do the trick if called from the close method of
> your map or reduce.
>
>
> I have attached our modified version from 0.18.1
> Our line is: // Force the jvm to exit even if it has threads still running,
> this prevents memory expensive jvms being left around
>
>
>
> On Sun, Jan 25, 2009 at 9:05 AM, Marcus Herou <ma...@tailsweep.com>wrote:
>
>> Oh, fuck you mean modding the hadoop TaskTracker child code ?
>>
>> On Sun, Jan 25, 2009 at 6:04 PM, Marcus Herou <marcus.herou@tailsweep.com
>> >wrote:
>>
>> > Thanks!
>> >
>> > So by you experience would this be good enough ? (Notice the
>> > System.exit(0))
>> >
>> > I implement the MapRunnable interface.
>> >
>> >  public void run(RecordReader<LongWritable, Text> recordReader,
>> > OutputCollector<WritableComparable, WritableComparable> outputCollector,
>> > Reporter reporter) throws IOException
>> >     {
>> >         this.recordReader = recordReader;
>> >         this.outputCollector = outputCollector;
>> >         this.reporter = reporter;
>> >         int threads =
>> > Integer.valueOf(this.getConf().get(getClass().getName()+".threads",
>> "10"));
>> >         log.info("Starting with "+threads + " threads");
>> >         long timeout =
>> > Long.valueOf(this.getConf().get(getClass().getName()+".timeout",
>> "600000"));
>> >
>> >         for (int i = 0; i < threads; i++)
>> >         {
>> >             // spawn threads
>> >             new FetcherThread().start();
>> >         }
>> >         do
>> >         {                                          // wait for threads
>> to
>> > exit
>> >             try
>> >             {
>> >                 Thread.sleep(1000);
>> >             } catch (InterruptedException e) {}
>> >
>> >             reportStatus();
>> >
>> >             // some requests seem to hang, despite all intentions
>> >             synchronized (this)
>> >             {
>> >                 if ((System.currentTimeMillis() - lastRequestStart) >
>> > timeout)
>> >                 {
>> >                     if (log.isWarnEnabled())
>> >                     {
>> >                         log.warn("Aborting with "+activeThreads+" hung
>> > threads.");
>> >                     }
>> >                     return;
>> >                 }
>> >             }
>> >
>> >         } while (activeThreads > 0);
>> >         log.info("All threads seem to be done, exiting");
>> >         System.exit(0);
>> >
>> >     }
>> >
>> > On Sun, Jan 25, 2009 at 5:57 PM, jason hadoop <jason.hadoop@gmail.com
>> >wrote:
>> >
>> >> We had trouble like that with some jobs, when the child ran additional
>> >> threads that were not set at daemon priority. These hold the Child JVM
>> >> from
>> >> exiting.
>> >> JMX was the cause in our case, but we have seen our JNI jobs do it
>> also.
>> >> In the end we made a local mod that forced a System.exit in the finally
>> >> block of the Child main.
>> >>
>> >>
>> >> On Sun, Jan 25, 2009 at 8:53 AM, Marcus Herou <
>> marcus.herou@tailsweep.com
>> >> >wrote:
>> >>
>> >> > Some extra info, apparently the child exits with a status of 143.
>> >> >
>> >> > 2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0011_m_000005_0
>> >> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
>> >> > 2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker - Task
>> >> > attempt_200901251629_0011_m_000005_0 is done.
>> >> > 2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner
>> -
>> >> > Task
>> >> > 'attempt_200901251629_0011_m_000005_0' done.
>> >> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0011_m_000005_0
>> >> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
>> >> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0011_m_000005_0
>> >> > Ignoring status-update since task is 'done'
>> >> > 2009-01-25 17:13:24.996 INFO taskCleanup
>> >> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for
>> job:
>> >> > job_200901251629_0011
>> >> > 2009-01-25 17:13:24.996 INFO taskCleanup
>> >> > org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0011_m_000005_0 done; removing files.
>> >> > 2009-01-25 17:47:22.668 WARN Thread-23
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > -
>> >> > attempt_200901251629_0001_m_000006_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:22.669 WARN Thread-23
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0001_m_000006_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.671 WARN Thread-23
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0001_m_000006_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.713 WARN Thread-79
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > -
>> >> > attempt_200901251629_0002_m_000007_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:22.713 WARN Thread-159
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0011_m_000005_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:22.713 WARN Thread-79
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0002_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.714 WARN Thread-159
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0011_m_000005_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.714 WARN Thread-79
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0002_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.715 WARN Thread-159
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0011_m_000005_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.811 WARN Thread-119
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0006_m_000007_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:22.811 WARN Thread-119
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0006_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.812 WARN Thread-119
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0006_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.960 WARN Thread-89
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > -
>> >> > attempt_200901251629_0003_m_000007_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:22.961 WARN Thread-89
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0003_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:22.962 WARN Thread-89
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0003_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.060 WARN Thread-129
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0008_m_000003_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:23.061 WARN Thread-129
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0008_m_000003_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.062 WARN Thread-129
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0008_m_000003_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.279 WARN Thread-99
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > -
>> >> > attempt_200901251629_0004_m_000007_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:23.279 WARN Thread-99
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0004_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.280 WARN Thread-99
>> >> org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0004_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.360 WARN Thread-139
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0009_m_000004_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:23.360 WARN Thread-139
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0009_m_000004_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.361 WARN Thread-139
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0009_m_000004_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.445 WARN Thread-149
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0010_m_000004_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:23.445 WARN Thread-149
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0010_m_000004_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.446 WARN Thread-149
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0010_m_000004_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.463 WARN Thread-109
>> >> org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0005_m_000007_0 Child Error
>> >> > java.io.IOException: Task process exit with nonzero status of 143.
>> >> >        at
>> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >> >        at
>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> >> > 2009-01-25 17:47:23.464 WARN Thread-109
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Error from unknown child task:
>> attempt_200901251629_0005_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:47:23.465 WARN Thread-109
>> >> > org.apache.hadoop.mapred.TaskTracker
>> >> > - Unknown child task finshed: attempt_200901251629_0005_m_000007_0.
>> >> > Ignored.
>> >> > 2009-01-25 17:48:15.315 INFO main
>> org.apache.hadoop.mapred.TaskTracker -
>> >> > LaunchTaskAction: attempt_200901251629_0012_m_000004_0
>> >> > 2009-01-25 17:48:28.088 INFO main org.apache.hadoop.mapred.MapTask -
>> >> > numReduceTasks: 0
>> >> > 2009-01-25 17:48:34.097 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.07784176% 347 docs, 0 errors, 69.4 docs/s
>> >> > 2009-01-25 17:48:37.099 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.13955688% 673 docs, 0 errors, 84.1 docs/s
>> >> > 2009-01-25 17:48:40.100 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.20145722% 1003 docs, 0 errors, 91.2 docs/s
>> >> > 2009-01-25 17:48:43.102 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.26416454% 1343 docs, 0 errors, 95.9 docs/s
>> >> > 2009-01-25 17:48:46.104 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.32754657% 1673 docs, 0 errors, 98.4 docs/s
>> >> > 2009-01-25 17:48:49.105 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.39229122% 2025 docs, 0 errors, 101.3 docs/s
>> >> > 2009-01-25 17:48:52.106 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.44708738% 2315 docs, 0 errors, 100.7 docs/s
>> >> > 2009-01-25 17:48:55.108 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.5117659% 2658 docs, 0 errors, 102.2 docs/s
>> >> > 2009-01-25 17:48:58.109 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.5795401% 3019 docs, 0 errors, 104.1 docs/s
>> >> > 2009-01-25 17:49:01.111 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.6468777% 3377 docs, 0 errors, 105.5 docs/s
>> >> > 2009-01-25 17:49:04.112 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.71514136% 3741 docs, 0 errors, 106.9 docs/s
>> >> > 2009-01-25 17:49:07.114 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.78037554% 4094 docs, 0 errors, 107.7 docs/s
>> >> > 2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.84660184% 4448 docs, 0 errors, 108.5 docs/s
>> >> > 2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.9146803% 4810 docs, 0 errors, 109.3 docs/s
>> >> > 2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 0.9521195% 5044 docs, 0 errors, 107.3 docs/s
>> >> > 2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
>> >> > 2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker - Task
>> >> > attempt_200901251629_0012_m_000004_0 is done.
>> >> > 2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner
>> -
>> >> > Task
>> >> > 'attempt_200901251629_0012_m_000004_0' done.
>> >> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
>> >> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
>> >> > org.apache.hadoop.mapred.TaskTracker -
>> >> attempt_200901251629_0012_m_000004_0
>> >> > Ignoring status-update since task is 'done'
>> >> > 2009-01-25 17:49:35.582 INFO taskCleanup
>> >> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for
>> job:
>> >> > job_200901251629_0012
>> >> > 2009-01-25 17:49:35.582 INFO taskCleanup
>> >> > org.apache.hadoop.mapred.TaskRunner
>> >> > - attempt_200901251629_0012_m_000004_0 done; removing files.
>> >> >
>> >> > # Still processes left even though the TaskTracker said: Received
>> >> > 'KillJobAction' for job: job_200901251629_0012
>> >> > root@mapreduce2:~# jps
>> >> > 10732 Jps
>> >> > 10634 TaskTracker$Child
>> >> > 8660 DataNode
>> >> > 8824 TaskTracker
>> >> > 8730 SecondaryNameNode
>> >> > 25060 Bootstrap
>> >> > root@mapreduce2:~# date
>> >> > Sun Jan 25 17:51:48 CET 2009
>> >> > root@mapreduce2:~#
>> >> >
>> >> > On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou <
>> >> marcus.herou@tailsweep.com
>> >> > >wrote:
>> >> >
>> >> > > Hi.
>> >> > >
>> >> > > Today I noticed when I ran a Solr Indexing job through our Hadoop
>> >> cluster
>> >> > > that the master MySQL database where screaming about "Too Many
>> >> > Connections".
>> >> > >
>> >> > > I wondered how that could happen so I logged into my Hadoop
>> machines
>> >> and
>> >> > > searched through the logs. Nothing strange there. Then I just did a
>> >> jps:
>> >> > >
>> >> > > root@mapreduce1:~# jps
>> >> > > 10701 TaskTracker$Child
>> >> > > 9567 NameNode
>> >> > > 5435 TaskTracker$Child
>> >> > > 31801 Bootstrap
>> >> > > 7349 TaskTracker$Child
>> >> > > 6197 TaskTracker$Child
>> >> > > 7761 TaskTracker$Child
>> >> > > 10453 TaskTracker$Child
>> >> > > 11232 TaskTracker$Child
>> >> > > 11113 TaskTracker$Child
>> >> > > 9688 DataNode
>> >> > > 10877 TaskTracker$Child
>> >> > > 6504 TaskTracker$Child
>> >> > > 10236 TaskTracker$Child
>> >> > > 9852 TaskTracker
>> >> > > 6515 TaskTracker$Child
>> >> > > 11396 TaskTracker$Child
>> >> > > 11741 Jps
>> >> > > 6191 TaskTracker$Child
>> >> > > 10981 TaskTracker$Child
>> >> > > 7742 TaskTracker$Child
>> >> > > 5946 TaskTracker$Child
>> >> > > 11315 TaskTracker$Child
>> >> > > 8112 TaskTracker$Child
>> >> > > 11580 TaskTracker$Child
>> >> > > 11490 TaskTracker$Child
>> >> > > 5687 TaskTracker$Child
>> >> > > 5927 TaskTracker$Child
>> >> > > 27144 WrapperSimpleApp
>> >> > > 7368 TaskTracker$Child
>> >> > >
>> >> > > Damn! Each Child have it's own DataSource (dbcp pool) tweaked down
>> so
>> >> it
>> >> > > only can have one active connection to any shard at any time.
>> >> > > Background: I ran out of connections during the Christmas holidays
>> >> since
>> >> > I
>> >> > > have 60 shards (10 per MySQL machine) and each required a DB-Pool
>> >> which
>> >> > > allowed too many active+idle connections.
>> >> > >
>> >> > > Anyway I have no active jobs at the moment so the children should
>> have
>> >> > died
>> >> > > by themselves.
>> >> > > Fortunately I have a little nice script which kills the bastards:
>> jps
>> >> > > |egrep "TaskTracker.+" | awk '{print $1}'|xargs kill
>> >> > > I will probably put that in a cronjob which kills long running
>> >> > children...
>> >> > >
>> >> > > Anyway, how can this happen ? Am I doing something really stupid
>> along
>> >> > the
>> >> > > way ?
>> >> > > Hard facts:
>> >> > > Ubuntu Hardy-Heron, 2.6.24-19-server
>> >> > > java version "1.6.0_06"
>> >> > > Hadoop-0.18.2
>> >> > > It's my own classes which fires the jobs through JobClient
>> >> > > (JobClient.runJob(job))
>> >> > > I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes
>> from
>> >> a
>> >> > > bash script)
>> >> > > I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles);
>> >> (comes
>> >> > by
>> >> > > parsing external CLASSPATH ENV in bash)
>> >> > >
>> >> > > The client do not complain, se example output below (I write no
>> data
>> >> to
>> >> > > HDFS ((HDFS bytes written=774)), since I mostly use it for crawling
>> >> and
>> >> > all
>> >> > > crawlers/indexers access my sharding db structure directly without
>> >> > > intermediate storage):
>> >> > > 2009-01-25 17:12:11.175 INFO main
>> >> > org.apache.hadoop.mapred.FileInputFormat
>> >> > > - Total input paths to process : 1
>> >> > > 2009-01-25 17:12:11.176 INFO main
>> >> > org.apache.hadoop.mapred.FileInputFormat
>> >> > > - Total input paths to process : 1
>> >> > > 2009-01-25 17:12:11.437 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Running job: job_200901251629_0011
>> >> > > 2009-01-25 17:12:12.439 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 0% reduce 0%
>> >> > > 2009-01-25 17:12:35.481 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 6% reduce 0%
>> >> > > 2009-01-25 17:12:40.493 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 21% reduce 0%
>> >> > > 2009-01-25 17:12:45.502 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 31% reduce 0%
>> >> > > 2009-01-25 17:12:50.511 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 51% reduce 0%
>> >> > > 2009-01-25 17:12:55.520 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 67% reduce 0%
>> >> > > 2009-01-25 17:13:00.533 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 72% reduce 0%
>> >> > > 2009-01-25 17:13:05.543 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 84% reduce 0%
>> >> > > 2009-01-25 17:13:10.552 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 95% reduce 0%
>> >> > > 2009-01-25 17:13:15.560 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> >  map
>> >> > > 98% reduce 0%
>> >> > > 2009-01-25 17:13:20.568 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > Job
>> >> > > complete: job_200901251629_0011
>> >> > > 2009-01-25 17:13:20.570 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Counters: 7
>> >> > > 2009-01-25 17:13:20.570 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > File Systems
>> >> > > 2009-01-25 17:13:20.570 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > HDFS bytes read=2741143
>> >> > > 2009-01-25 17:13:20.570 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > HDFS bytes written=774
>> >> > > 2009-01-25 17:13:20.570 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Job Counters
>> >> > > 2009-01-25 17:13:20.570 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Rack-local map tasks=9
>> >> > > 2009-01-25 17:13:20.571 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Launched map tasks=9
>> >> > > 2009-01-25 17:13:20.571 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Map-Reduce Framework
>> >> > > 2009-01-25 17:13:20.571 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Map input records=48314
>> >> > > 2009-01-25 17:13:20.571 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Map input bytes=2732424
>> >> > > 2009-01-25 17:13:20.571 INFO main
>> org.apache.hadoop.mapred.JobClient -
>> >> > > Map output records=0
>> >> > >
>> >> > > Any suggestions or pointers would be greatly appreciated. Hmm
>> Coming
>> >> to
>> >> > > think about something. I start X threads from inside Hadoop almost
>> >> cut'n
>> >> > > pasted from Nutch.
>> >> > > If a thread somehow would linger, would Hadoop not be able to
>> shutdown
>> >> > even
>> >> > > though there is nothing more to read from the RecordReader ?
>> >> > >
>> >> > > Kindly
>> >> > >
>> >> > > //Marcus
>> >> > >
>> >> > > --
>> >> > > Marcus Herou CTO and co-founder Tailsweep AB
>> >> > > +46702561312
>> >> > > marcus.herou@tailsweep.com
>> >> > > http://www.tailsweep.com/
>> >> > > http://blogg.tailsweep.com/
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Marcus Herou CTO and co-founder Tailsweep AB
>> >> > +46702561312
>> >> > marcus.herou@tailsweep.com
>> >> > http://www.tailsweep.com/
>> >> > http://blogg.tailsweep.com/
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Marcus Herou CTO and co-founder Tailsweep AB
>> > +46702561312
>> > marcus.herou@tailsweep.com
>> > http://www.tailsweep.com/
>> > http://blogg.tailsweep.com/
>> >
>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.herou@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Lingering TaskTracker$Child

Posted by jason hadoop <ja...@gmail.com>.
Yes, I mean we mod'd the tasktracker code.

It is most likely that you have some threads about that are holding the JVM
from exiting.
If you are mot doing JVM re-use as of the 19/20 feature set
You could in the close method of your map/reduce task, Walk through the list
of threads and prioritize them to daemon.
The attached java should do the trick if called from the close method of
your map or reduce.


I have attached our modified version from 0.18.1
Our line is: // Force the jvm to exit even if it has threads still running,
this prevents memory expensive jvms being left around


On Sun, Jan 25, 2009 at 9:05 AM, Marcus Herou <ma...@tailsweep.com>wrote:

> Oh, fuck you mean modding the hadoop TaskTracker child code ?
>
> On Sun, Jan 25, 2009 at 6:04 PM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Thanks!
> >
> > So by you experience would this be good enough ? (Notice the
> > System.exit(0))
> >
> > I implement the MapRunnable interface.
> >
> >  public void run(RecordReader<LongWritable, Text> recordReader,
> > OutputCollector<WritableComparable, WritableComparable> outputCollector,
> > Reporter reporter) throws IOException
> >     {
> >         this.recordReader = recordReader;
> >         this.outputCollector = outputCollector;
> >         this.reporter = reporter;
> >         int threads =
> > Integer.valueOf(this.getConf().get(getClass().getName()+".threads",
> "10"));
> >         log.info("Starting with "+threads + " threads");
> >         long timeout =
> > Long.valueOf(this.getConf().get(getClass().getName()+".timeout",
> "600000"));
> >
> >         for (int i = 0; i < threads; i++)
> >         {
> >             // spawn threads
> >             new FetcherThread().start();
> >         }
> >         do
> >         {                                          // wait for threads to
> > exit
> >             try
> >             {
> >                 Thread.sleep(1000);
> >             } catch (InterruptedException e) {}
> >
> >             reportStatus();
> >
> >             // some requests seem to hang, despite all intentions
> >             synchronized (this)
> >             {
> >                 if ((System.currentTimeMillis() - lastRequestStart) >
> > timeout)
> >                 {
> >                     if (log.isWarnEnabled())
> >                     {
> >                         log.warn("Aborting with "+activeThreads+" hung
> > threads.");
> >                     }
> >                     return;
> >                 }
> >             }
> >
> >         } while (activeThreads > 0);
> >         log.info("All threads seem to be done, exiting");
> >         System.exit(0);
> >
> >     }
> >
> > On Sun, Jan 25, 2009 at 5:57 PM, jason hadoop <jason.hadoop@gmail.com
> >wrote:
> >
> >> We had trouble like that with some jobs, when the child ran additional
> >> threads that were not set at daemon priority. These hold the Child JVM
> >> from
> >> exiting.
> >> JMX was the cause in our case, but we have seen our JNI jobs do it also.
> >> In the end we made a local mod that forced a System.exit in the finally
> >> block of the Child main.
> >>
> >>
> >> On Sun, Jan 25, 2009 at 8:53 AM, Marcus Herou <
> marcus.herou@tailsweep.com
> >> >wrote:
> >>
> >> > Some extra info, apparently the child exits with a status of 143.
> >> >
> >> > 2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0011_m_000005_0
> >> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
> >> > 2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker - Task
> >> > attempt_200901251629_0011_m_000005_0 is done.
> >> > 2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner
> -
> >> > Task
> >> > 'attempt_200901251629_0011_m_000005_0' done.
> >> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0011_m_000005_0
> >> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
> >> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0011_m_000005_0
> >> > Ignoring status-update since task is 'done'
> >> > 2009-01-25 17:13:24.996 INFO taskCleanup
> >> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for
> job:
> >> > job_200901251629_0011
> >> > 2009-01-25 17:13:24.996 INFO taskCleanup
> >> > org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0011_m_000005_0 done; removing files.
> >> > 2009-01-25 17:47:22.668 WARN Thread-23
> >> org.apache.hadoop.mapred.TaskRunner
> >> > -
> >> > attempt_200901251629_0001_m_000006_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:22.669 WARN Thread-23
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0001_m_000006_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.671 WARN Thread-23
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0001_m_000006_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.713 WARN Thread-79
> >> org.apache.hadoop.mapred.TaskRunner
> >> > -
> >> > attempt_200901251629_0002_m_000007_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:22.713 WARN Thread-159
> >> org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0011_m_000005_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:22.713 WARN Thread-79
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0002_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.714 WARN Thread-159
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0011_m_000005_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.714 WARN Thread-79
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0002_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.715 WARN Thread-159
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0011_m_000005_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.811 WARN Thread-119
> >> org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0006_m_000007_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:22.811 WARN Thread-119
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0006_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.812 WARN Thread-119
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0006_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.960 WARN Thread-89
> >> org.apache.hadoop.mapred.TaskRunner
> >> > -
> >> > attempt_200901251629_0003_m_000007_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:22.961 WARN Thread-89
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0003_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:22.962 WARN Thread-89
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0003_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.060 WARN Thread-129
> >> org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0008_m_000003_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:23.061 WARN Thread-129
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0008_m_000003_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.062 WARN Thread-129
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0008_m_000003_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.279 WARN Thread-99
> >> org.apache.hadoop.mapred.TaskRunner
> >> > -
> >> > attempt_200901251629_0004_m_000007_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:23.279 WARN Thread-99
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0004_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.280 WARN Thread-99
> >> org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0004_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.360 WARN Thread-139
> >> org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0009_m_000004_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:23.360 WARN Thread-139
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0009_m_000004_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.361 WARN Thread-139
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0009_m_000004_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.445 WARN Thread-149
> >> org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0010_m_000004_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:23.445 WARN Thread-149
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0010_m_000004_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.446 WARN Thread-149
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0010_m_000004_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.463 WARN Thread-109
> >> org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0005_m_000007_0 Child Error
> >> > java.io.IOException: Task process exit with nonzero status of 143.
> >> >        at
> >> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >> > 2009-01-25 17:47:23.464 WARN Thread-109
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Error from unknown child task: attempt_200901251629_0005_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:47:23.465 WARN Thread-109
> >> > org.apache.hadoop.mapred.TaskTracker
> >> > - Unknown child task finshed: attempt_200901251629_0005_m_000007_0.
> >> > Ignored.
> >> > 2009-01-25 17:48:15.315 INFO main org.apache.hadoop.mapred.TaskTracker
> -
> >> > LaunchTaskAction: attempt_200901251629_0012_m_000004_0
> >> > 2009-01-25 17:48:28.088 INFO main org.apache.hadoop.mapred.MapTask -
> >> > numReduceTasks: 0
> >> > 2009-01-25 17:48:34.097 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.07784176% 347 docs, 0 errors, 69.4 docs/s
> >> > 2009-01-25 17:48:37.099 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.13955688% 673 docs, 0 errors, 84.1 docs/s
> >> > 2009-01-25 17:48:40.100 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.20145722% 1003 docs, 0 errors, 91.2 docs/s
> >> > 2009-01-25 17:48:43.102 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.26416454% 1343 docs, 0 errors, 95.9 docs/s
> >> > 2009-01-25 17:48:46.104 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.32754657% 1673 docs, 0 errors, 98.4 docs/s
> >> > 2009-01-25 17:48:49.105 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.39229122% 2025 docs, 0 errors, 101.3 docs/s
> >> > 2009-01-25 17:48:52.106 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.44708738% 2315 docs, 0 errors, 100.7 docs/s
> >> > 2009-01-25 17:48:55.108 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.5117659% 2658 docs, 0 errors, 102.2 docs/s
> >> > 2009-01-25 17:48:58.109 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.5795401% 3019 docs, 0 errors, 104.1 docs/s
> >> > 2009-01-25 17:49:01.111 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.6468777% 3377 docs, 0 errors, 105.5 docs/s
> >> > 2009-01-25 17:49:04.112 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.71514136% 3741 docs, 0 errors, 106.9 docs/s
> >> > 2009-01-25 17:49:07.114 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.78037554% 4094 docs, 0 errors, 107.7 docs/s
> >> > 2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.84660184% 4448 docs, 0 errors, 108.5 docs/s
> >> > 2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.9146803% 4810 docs, 0 errors, 109.3 docs/s
> >> > 2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 0.9521195% 5044 docs, 0 errors, 107.3 docs/s
> >> > 2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
> >> > 2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker - Task
> >> > attempt_200901251629_0012_m_000004_0 is done.
> >> > 2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner
> -
> >> > Task
> >> > 'attempt_200901251629_0012_m_000004_0' done.
> >> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
> >> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
> >> > org.apache.hadoop.mapred.TaskTracker -
> >> attempt_200901251629_0012_m_000004_0
> >> > Ignoring status-update since task is 'done'
> >> > 2009-01-25 17:49:35.582 INFO taskCleanup
> >> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for
> job:
> >> > job_200901251629_0012
> >> > 2009-01-25 17:49:35.582 INFO taskCleanup
> >> > org.apache.hadoop.mapred.TaskRunner
> >> > - attempt_200901251629_0012_m_000004_0 done; removing files.
> >> >
> >> > # Still processes left even though the TaskTracker said: Received
> >> > 'KillJobAction' for job: job_200901251629_0012
> >> > root@mapreduce2:~# jps
> >> > 10732 Jps
> >> > 10634 TaskTracker$Child
> >> > 8660 DataNode
> >> > 8824 TaskTracker
> >> > 8730 SecondaryNameNode
> >> > 25060 Bootstrap
> >> > root@mapreduce2:~# date
> >> > Sun Jan 25 17:51:48 CET 2009
> >> > root@mapreduce2:~#
> >> >
> >> > On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou <
> >> marcus.herou@tailsweep.com
> >> > >wrote:
> >> >
> >> > > Hi.
> >> > >
> >> > > Today I noticed when I ran a Solr Indexing job through our Hadoop
> >> cluster
> >> > > that the master MySQL database where screaming about "Too Many
> >> > Connections".
> >> > >
> >> > > I wondered how that could happen so I logged into my Hadoop machines
> >> and
> >> > > searched through the logs. Nothing strange there. Then I just did a
> >> jps:
> >> > >
> >> > > root@mapreduce1:~# jps
> >> > > 10701 TaskTracker$Child
> >> > > 9567 NameNode
> >> > > 5435 TaskTracker$Child
> >> > > 31801 Bootstrap
> >> > > 7349 TaskTracker$Child
> >> > > 6197 TaskTracker$Child
> >> > > 7761 TaskTracker$Child
> >> > > 10453 TaskTracker$Child
> >> > > 11232 TaskTracker$Child
> >> > > 11113 TaskTracker$Child
> >> > > 9688 DataNode
> >> > > 10877 TaskTracker$Child
> >> > > 6504 TaskTracker$Child
> >> > > 10236 TaskTracker$Child
> >> > > 9852 TaskTracker
> >> > > 6515 TaskTracker$Child
> >> > > 11396 TaskTracker$Child
> >> > > 11741 Jps
> >> > > 6191 TaskTracker$Child
> >> > > 10981 TaskTracker$Child
> >> > > 7742 TaskTracker$Child
> >> > > 5946 TaskTracker$Child
> >> > > 11315 TaskTracker$Child
> >> > > 8112 TaskTracker$Child
> >> > > 11580 TaskTracker$Child
> >> > > 11490 TaskTracker$Child
> >> > > 5687 TaskTracker$Child
> >> > > 5927 TaskTracker$Child
> >> > > 27144 WrapperSimpleApp
> >> > > 7368 TaskTracker$Child
> >> > >
> >> > > Damn! Each Child have it's own DataSource (dbcp pool) tweaked down
> so
> >> it
> >> > > only can have one active connection to any shard at any time.
> >> > > Background: I ran out of connections during the Christmas holidays
> >> since
> >> > I
> >> > > have 60 shards (10 per MySQL machine) and each required a DB-Pool
> >> which
> >> > > allowed too many active+idle connections.
> >> > >
> >> > > Anyway I have no active jobs at the moment so the children should
> have
> >> > died
> >> > > by themselves.
> >> > > Fortunately I have a little nice script which kills the bastards:
> jps
> >> > > |egrep "TaskTracker.+" | awk '{print $1}'|xargs kill
> >> > > I will probably put that in a cronjob which kills long running
> >> > children...
> >> > >
> >> > > Anyway, how can this happen ? Am I doing something really stupid
> along
> >> > the
> >> > > way ?
> >> > > Hard facts:
> >> > > Ubuntu Hardy-Heron, 2.6.24-19-server
> >> > > java version "1.6.0_06"
> >> > > Hadoop-0.18.2
> >> > > It's my own classes which fires the jobs through JobClient
> >> > > (JobClient.runJob(job))
> >> > > I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes
> from
> >> a
> >> > > bash script)
> >> > > I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles);
> >> (comes
> >> > by
> >> > > parsing external CLASSPATH ENV in bash)
> >> > >
> >> > > The client do not complain, se example output below (I write no data
> >> to
> >> > > HDFS ((HDFS bytes written=774)), since I mostly use it for crawling
> >> and
> >> > all
> >> > > crawlers/indexers access my sharding db structure directly without
> >> > > intermediate storage):
> >> > > 2009-01-25 17:12:11.175 INFO main
> >> > org.apache.hadoop.mapred.FileInputFormat
> >> > > - Total input paths to process : 1
> >> > > 2009-01-25 17:12:11.176 INFO main
> >> > org.apache.hadoop.mapred.FileInputFormat
> >> > > - Total input paths to process : 1
> >> > > 2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Running job: job_200901251629_0011
> >> > > 2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 0% reduce 0%
> >> > > 2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 6% reduce 0%
> >> > > 2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 21% reduce 0%
> >> > > 2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 31% reduce 0%
> >> > > 2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 51% reduce 0%
> >> > > 2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 67% reduce 0%
> >> > > 2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 72% reduce 0%
> >> > > 2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 84% reduce 0%
> >> > > 2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 95% reduce 0%
> >> > > 2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> >  map
> >> > > 98% reduce 0%
> >> > > 2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > Job
> >> > > complete: job_200901251629_0011
> >> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Counters: 7
> >> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > File Systems
> >> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > HDFS bytes read=2741143
> >> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > HDFS bytes written=774
> >> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Job Counters
> >> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Rack-local map tasks=9
> >> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Launched map tasks=9
> >> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Map-Reduce Framework
> >> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Map input records=48314
> >> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Map input bytes=2732424
> >> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient
> -
> >> > > Map output records=0
> >> > >
> >> > > Any suggestions or pointers would be greatly appreciated. Hmm Coming
> >> to
> >> > > think about something. I start X threads from inside Hadoop almost
> >> cut'n
> >> > > pasted from Nutch.
> >> > > If a thread somehow would linger, would Hadoop not be able to
> shutdown
> >> > even
> >> > > though there is nothing more to read from the RecordReader ?
> >> > >
> >> > > Kindly
> >> > >
> >> > > //Marcus
> >> > >
> >> > > --
> >> > > Marcus Herou CTO and co-founder Tailsweep AB
> >> > > +46702561312
> >> > > marcus.herou@tailsweep.com
> >> > > http://www.tailsweep.com/
> >> > > http://blogg.tailsweep.com/
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Marcus Herou CTO and co-founder Tailsweep AB
> >> > +46702561312
> >> > marcus.herou@tailsweep.com
> >> > http://www.tailsweep.com/
> >> > http://blogg.tailsweep.com/
> >> >
> >>
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

Re: Lingering TaskTracker$Child

Posted by Marcus Herou <ma...@tailsweep.com>.
Oh, fuck you mean modding the hadoop TaskTracker child code ?

On Sun, Jan 25, 2009 at 6:04 PM, Marcus Herou <ma...@tailsweep.com>wrote:

> Thanks!
>
> So by you experience would this be good enough ? (Notice the
> System.exit(0))
>
> I implement the MapRunnable interface.
>
>  public void run(RecordReader<LongWritable, Text> recordReader,
> OutputCollector<WritableComparable, WritableComparable> outputCollector,
> Reporter reporter) throws IOException
>     {
>         this.recordReader = recordReader;
>         this.outputCollector = outputCollector;
>         this.reporter = reporter;
>         int threads =
> Integer.valueOf(this.getConf().get(getClass().getName()+".threads", "10"));
>         log.info("Starting with "+threads + " threads");
>         long timeout =
> Long.valueOf(this.getConf().get(getClass().getName()+".timeout", "600000"));
>
>         for (int i = 0; i < threads; i++)
>         {
>             // spawn threads
>             new FetcherThread().start();
>         }
>         do
>         {                                          // wait for threads to
> exit
>             try
>             {
>                 Thread.sleep(1000);
>             } catch (InterruptedException e) {}
>
>             reportStatus();
>
>             // some requests seem to hang, despite all intentions
>             synchronized (this)
>             {
>                 if ((System.currentTimeMillis() - lastRequestStart) >
> timeout)
>                 {
>                     if (log.isWarnEnabled())
>                     {
>                         log.warn("Aborting with "+activeThreads+" hung
> threads.");
>                     }
>                     return;
>                 }
>             }
>
>         } while (activeThreads > 0);
>         log.info("All threads seem to be done, exiting");
>         System.exit(0);
>
>     }
>
> On Sun, Jan 25, 2009 at 5:57 PM, jason hadoop <ja...@gmail.com>wrote:
>
>> We had trouble like that with some jobs, when the child ran additional
>> threads that were not set at daemon priority. These hold the Child JVM
>> from
>> exiting.
>> JMX was the cause in our case, but we have seen our JNI jobs do it also.
>> In the end we made a local mod that forced a System.exit in the finally
>> block of the Child main.
>>
>>
>> On Sun, Jan 25, 2009 at 8:53 AM, Marcus Herou <marcus.herou@tailsweep.com
>> >wrote:
>>
>> > Some extra info, apparently the child exits with a status of 143.
>> >
>> > 2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0011_m_000005_0
>> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
>> > 2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker - Task
>> > attempt_200901251629_0011_m_000005_0 is done.
>> > 2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner -
>> > Task
>> > 'attempt_200901251629_0011_m_000005_0' done.
>> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0011_m_000005_0
>> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
>> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0011_m_000005_0
>> > Ignoring status-update since task is 'done'
>> > 2009-01-25 17:13:24.996 INFO taskCleanup
>> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
>> > job_200901251629_0011
>> > 2009-01-25 17:13:24.996 INFO taskCleanup
>> > org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0011_m_000005_0 done; removing files.
>> > 2009-01-25 17:47:22.668 WARN Thread-23
>> org.apache.hadoop.mapred.TaskRunner
>> > -
>> > attempt_200901251629_0001_m_000006_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:22.669 WARN Thread-23
>> org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0001_m_000006_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.671 WARN Thread-23
>> org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0001_m_000006_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.713 WARN Thread-79
>> org.apache.hadoop.mapred.TaskRunner
>> > -
>> > attempt_200901251629_0002_m_000007_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:22.713 WARN Thread-159
>> org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0011_m_000005_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:22.713 WARN Thread-79
>> org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0002_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.714 WARN Thread-159
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0011_m_000005_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.714 WARN Thread-79
>> org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0002_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.715 WARN Thread-159
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0011_m_000005_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.811 WARN Thread-119
>> org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0006_m_000007_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:22.811 WARN Thread-119
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0006_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.812 WARN Thread-119
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0006_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.960 WARN Thread-89
>> org.apache.hadoop.mapred.TaskRunner
>> > -
>> > attempt_200901251629_0003_m_000007_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:22.961 WARN Thread-89
>> org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0003_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:22.962 WARN Thread-89
>> org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0003_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.060 WARN Thread-129
>> org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0008_m_000003_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:23.061 WARN Thread-129
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0008_m_000003_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.062 WARN Thread-129
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0008_m_000003_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.279 WARN Thread-99
>> org.apache.hadoop.mapred.TaskRunner
>> > -
>> > attempt_200901251629_0004_m_000007_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:23.279 WARN Thread-99
>> org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0004_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.280 WARN Thread-99
>> org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0004_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.360 WARN Thread-139
>> org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0009_m_000004_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:23.360 WARN Thread-139
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0009_m_000004_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.361 WARN Thread-139
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0009_m_000004_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.445 WARN Thread-149
>> org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0010_m_000004_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:23.445 WARN Thread-149
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0010_m_000004_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.446 WARN Thread-149
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0010_m_000004_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.463 WARN Thread-109
>> org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0005_m_000007_0 Child Error
>> > java.io.IOException: Task process exit with nonzero status of 143.
>> >        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>> > 2009-01-25 17:47:23.464 WARN Thread-109
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Error from unknown child task: attempt_200901251629_0005_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:47:23.465 WARN Thread-109
>> > org.apache.hadoop.mapred.TaskTracker
>> > - Unknown child task finshed: attempt_200901251629_0005_m_000007_0.
>> > Ignored.
>> > 2009-01-25 17:48:15.315 INFO main org.apache.hadoop.mapred.TaskTracker -
>> > LaunchTaskAction: attempt_200901251629_0012_m_000004_0
>> > 2009-01-25 17:48:28.088 INFO main org.apache.hadoop.mapred.MapTask -
>> > numReduceTasks: 0
>> > 2009-01-25 17:48:34.097 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.07784176% 347 docs, 0 errors, 69.4 docs/s
>> > 2009-01-25 17:48:37.099 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.13955688% 673 docs, 0 errors, 84.1 docs/s
>> > 2009-01-25 17:48:40.100 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.20145722% 1003 docs, 0 errors, 91.2 docs/s
>> > 2009-01-25 17:48:43.102 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.26416454% 1343 docs, 0 errors, 95.9 docs/s
>> > 2009-01-25 17:48:46.104 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.32754657% 1673 docs, 0 errors, 98.4 docs/s
>> > 2009-01-25 17:48:49.105 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.39229122% 2025 docs, 0 errors, 101.3 docs/s
>> > 2009-01-25 17:48:52.106 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.44708738% 2315 docs, 0 errors, 100.7 docs/s
>> > 2009-01-25 17:48:55.108 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.5117659% 2658 docs, 0 errors, 102.2 docs/s
>> > 2009-01-25 17:48:58.109 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.5795401% 3019 docs, 0 errors, 104.1 docs/s
>> > 2009-01-25 17:49:01.111 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.6468777% 3377 docs, 0 errors, 105.5 docs/s
>> > 2009-01-25 17:49:04.112 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.71514136% 3741 docs, 0 errors, 106.9 docs/s
>> > 2009-01-25 17:49:07.114 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.78037554% 4094 docs, 0 errors, 107.7 docs/s
>> > 2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.84660184% 4448 docs, 0 errors, 108.5 docs/s
>> > 2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.9146803% 4810 docs, 0 errors, 109.3 docs/s
>> > 2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 0.9521195% 5044 docs, 0 errors, 107.3 docs/s
>> > 2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
>> > 2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
>> > org.apache.hadoop.mapred.TaskTracker - Task
>> > attempt_200901251629_0012_m_000004_0 is done.
>> > 2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner -
>> > Task
>> > 'attempt_200901251629_0012_m_000004_0' done.
>> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
>> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
>> > org.apache.hadoop.mapred.TaskTracker -
>> attempt_200901251629_0012_m_000004_0
>> > Ignoring status-update since task is 'done'
>> > 2009-01-25 17:49:35.582 INFO taskCleanup
>> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
>> > job_200901251629_0012
>> > 2009-01-25 17:49:35.582 INFO taskCleanup
>> > org.apache.hadoop.mapred.TaskRunner
>> > - attempt_200901251629_0012_m_000004_0 done; removing files.
>> >
>> > # Still processes left even though the TaskTracker said: Received
>> > 'KillJobAction' for job: job_200901251629_0012
>> > root@mapreduce2:~# jps
>> > 10732 Jps
>> > 10634 TaskTracker$Child
>> > 8660 DataNode
>> > 8824 TaskTracker
>> > 8730 SecondaryNameNode
>> > 25060 Bootstrap
>> > root@mapreduce2:~# date
>> > Sun Jan 25 17:51:48 CET 2009
>> > root@mapreduce2:~#
>> >
>> > On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou <
>> marcus.herou@tailsweep.com
>> > >wrote:
>> >
>> > > Hi.
>> > >
>> > > Today I noticed when I ran a Solr Indexing job through our Hadoop
>> cluster
>> > > that the master MySQL database where screaming about "Too Many
>> > Connections".
>> > >
>> > > I wondered how that could happen so I logged into my Hadoop machines
>> and
>> > > searched through the logs. Nothing strange there. Then I just did a
>> jps:
>> > >
>> > > root@mapreduce1:~# jps
>> > > 10701 TaskTracker$Child
>> > > 9567 NameNode
>> > > 5435 TaskTracker$Child
>> > > 31801 Bootstrap
>> > > 7349 TaskTracker$Child
>> > > 6197 TaskTracker$Child
>> > > 7761 TaskTracker$Child
>> > > 10453 TaskTracker$Child
>> > > 11232 TaskTracker$Child
>> > > 11113 TaskTracker$Child
>> > > 9688 DataNode
>> > > 10877 TaskTracker$Child
>> > > 6504 TaskTracker$Child
>> > > 10236 TaskTracker$Child
>> > > 9852 TaskTracker
>> > > 6515 TaskTracker$Child
>> > > 11396 TaskTracker$Child
>> > > 11741 Jps
>> > > 6191 TaskTracker$Child
>> > > 10981 TaskTracker$Child
>> > > 7742 TaskTracker$Child
>> > > 5946 TaskTracker$Child
>> > > 11315 TaskTracker$Child
>> > > 8112 TaskTracker$Child
>> > > 11580 TaskTracker$Child
>> > > 11490 TaskTracker$Child
>> > > 5687 TaskTracker$Child
>> > > 5927 TaskTracker$Child
>> > > 27144 WrapperSimpleApp
>> > > 7368 TaskTracker$Child
>> > >
>> > > Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so
>> it
>> > > only can have one active connection to any shard at any time.
>> > > Background: I ran out of connections during the Christmas holidays
>> since
>> > I
>> > > have 60 shards (10 per MySQL machine) and each required a DB-Pool
>> which
>> > > allowed too many active+idle connections.
>> > >
>> > > Anyway I have no active jobs at the moment so the children should have
>> > died
>> > > by themselves.
>> > > Fortunately I have a little nice script which kills the bastards: jps
>> > > |egrep "TaskTracker.+" | awk '{print $1}'|xargs kill
>> > > I will probably put that in a cronjob which kills long running
>> > children...
>> > >
>> > > Anyway, how can this happen ? Am I doing something really stupid along
>> > the
>> > > way ?
>> > > Hard facts:
>> > > Ubuntu Hardy-Heron, 2.6.24-19-server
>> > > java version "1.6.0_06"
>> > > Hadoop-0.18.2
>> > > It's my own classes which fires the jobs through JobClient
>> > > (JobClient.runJob(job))
>> > > I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from
>> a
>> > > bash script)
>> > > I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles);
>> (comes
>> > by
>> > > parsing external CLASSPATH ENV in bash)
>> > >
>> > > The client do not complain, se example output below (I write no data
>> to
>> > > HDFS ((HDFS bytes written=774)), since I mostly use it for crawling
>> and
>> > all
>> > > crawlers/indexers access my sharding db structure directly without
>> > > intermediate storage):
>> > > 2009-01-25 17:12:11.175 INFO main
>> > org.apache.hadoop.mapred.FileInputFormat
>> > > - Total input paths to process : 1
>> > > 2009-01-25 17:12:11.176 INFO main
>> > org.apache.hadoop.mapred.FileInputFormat
>> > > - Total input paths to process : 1
>> > > 2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Running job: job_200901251629_0011
>> > > 2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 0% reduce 0%
>> > > 2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 6% reduce 0%
>> > > 2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 21% reduce 0%
>> > > 2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 31% reduce 0%
>> > > 2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 51% reduce 0%
>> > > 2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 67% reduce 0%
>> > > 2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 72% reduce 0%
>> > > 2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 84% reduce 0%
>> > > 2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 95% reduce 0%
>> > > 2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient -
>> >  map
>> > > 98% reduce 0%
>> > > 2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient -
>> > Job
>> > > complete: job_200901251629_0011
>> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Counters: 7
>> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
>> > > File Systems
>> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
>> > > HDFS bytes read=2741143
>> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
>> > > HDFS bytes written=774
>> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Job Counters
>> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Rack-local map tasks=9
>> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Launched map tasks=9
>> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Map-Reduce Framework
>> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Map input records=48314
>> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Map input bytes=2732424
>> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
>> > > Map output records=0
>> > >
>> > > Any suggestions or pointers would be greatly appreciated. Hmm Coming
>> to
>> > > think about something. I start X threads from inside Hadoop almost
>> cut'n
>> > > pasted from Nutch.
>> > > If a thread somehow would linger, would Hadoop not be able to shutdown
>> > even
>> > > though there is nothing more to read from the RecordReader ?
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.herou@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>> > >
>> >
>> >
>> >
>> > --
>> > Marcus Herou CTO and co-founder Tailsweep AB
>> > +46702561312
>> > marcus.herou@tailsweep.com
>> > http://www.tailsweep.com/
>> > http://blogg.tailsweep.com/
>> >
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Lingering TaskTracker$Child

Posted by Marcus Herou <ma...@tailsweep.com>.
Thanks!

So by you experience would this be good enough ? (Notice the System.exit(0))


I implement the MapRunnable interface.

 public void run(RecordReader<LongWritable, Text> recordReader,
OutputCollector<WritableComparable, WritableComparable> outputCollector,
Reporter reporter) throws IOException
    {
        this.recordReader = recordReader;
        this.outputCollector = outputCollector;
        this.reporter = reporter;
        int threads =
Integer.valueOf(this.getConf().get(getClass().getName()+".threads", "10"));
        log.info("Starting with "+threads + " threads");
        long timeout =
Long.valueOf(this.getConf().get(getClass().getName()+".timeout", "600000"));

        for (int i = 0; i < threads; i++)
        {
            // spawn threads
            new FetcherThread().start();
        }
        do
        {                                          // wait for threads to
exit
            try
            {
                Thread.sleep(1000);
            } catch (InterruptedException e) {}

            reportStatus();

            // some requests seem to hang, despite all intentions
            synchronized (this)
            {
                if ((System.currentTimeMillis() - lastRequestStart) >
timeout)
                {
                    if (log.isWarnEnabled())
                    {
                        log.warn("Aborting with "+activeThreads+" hung
threads.");
                    }
                    return;
                }
            }

        } while (activeThreads > 0);
        log.info("All threads seem to be done, exiting");
        System.exit(0);
    }

On Sun, Jan 25, 2009 at 5:57 PM, jason hadoop <ja...@gmail.com>wrote:

> We had trouble like that with some jobs, when the child ran additional
> threads that were not set at daemon priority. These hold the Child JVM from
> exiting.
> JMX was the cause in our case, but we have seen our JNI jobs do it also.
> In the end we made a local mod that forced a System.exit in the finally
> block of the Child main.
>
>
> On Sun, Jan 25, 2009 at 8:53 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Some extra info, apparently the child exits with a status of 143.
> >
> > 2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0011_m_000005_0
> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
> > 2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker - Task
> > attempt_200901251629_0011_m_000005_0 is done.
> > 2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner -
> > Task
> > 'attempt_200901251629_0011_m_000005_0' done.
> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0011_m_000005_0
> > 1.0% 5364 docs, 0 errors, 124.7 docs/s
> > 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0011_m_000005_0
> > Ignoring status-update since task is 'done'
> > 2009-01-25 17:13:24.996 INFO taskCleanup
> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
> > job_200901251629_0011
> > 2009-01-25 17:13:24.996 INFO taskCleanup
> > org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0011_m_000005_0 done; removing files.
> > 2009-01-25 17:47:22.668 WARN Thread-23
> org.apache.hadoop.mapred.TaskRunner
> > -
> > attempt_200901251629_0001_m_000006_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:22.669 WARN Thread-23
> org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0001_m_000006_0.
> > Ignored.
> > 2009-01-25 17:47:22.671 WARN Thread-23
> org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0001_m_000006_0.
> > Ignored.
> > 2009-01-25 17:47:22.713 WARN Thread-79
> org.apache.hadoop.mapred.TaskRunner
> > -
> > attempt_200901251629_0002_m_000007_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:22.713 WARN Thread-159
> org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0011_m_000005_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:22.713 WARN Thread-79
> org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0002_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:22.714 WARN Thread-159
> > org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0011_m_000005_0.
> > Ignored.
> > 2009-01-25 17:47:22.714 WARN Thread-79
> org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0002_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:22.715 WARN Thread-159
> > org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0011_m_000005_0.
> > Ignored.
> > 2009-01-25 17:47:22.811 WARN Thread-119
> org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0006_m_000007_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:22.811 WARN Thread-119
> > org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0006_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:22.812 WARN Thread-119
> > org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0006_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:22.960 WARN Thread-89
> org.apache.hadoop.mapred.TaskRunner
> > -
> > attempt_200901251629_0003_m_000007_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:22.961 WARN Thread-89
> org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0003_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:22.962 WARN Thread-89
> org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0003_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:23.060 WARN Thread-129
> org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0008_m_000003_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:23.061 WARN Thread-129
> > org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0008_m_000003_0.
> > Ignored.
> > 2009-01-25 17:47:23.062 WARN Thread-129
> > org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0008_m_000003_0.
> > Ignored.
> > 2009-01-25 17:47:23.279 WARN Thread-99
> org.apache.hadoop.mapred.TaskRunner
> > -
> > attempt_200901251629_0004_m_000007_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:23.279 WARN Thread-99
> org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0004_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:23.280 WARN Thread-99
> org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0004_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:23.360 WARN Thread-139
> org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0009_m_000004_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:23.360 WARN Thread-139
> > org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0009_m_000004_0.
> > Ignored.
> > 2009-01-25 17:47:23.361 WARN Thread-139
> > org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0009_m_000004_0.
> > Ignored.
> > 2009-01-25 17:47:23.445 WARN Thread-149
> org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0010_m_000004_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:23.445 WARN Thread-149
> > org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0010_m_000004_0.
> > Ignored.
> > 2009-01-25 17:47:23.446 WARN Thread-149
> > org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0010_m_000004_0.
> > Ignored.
> > 2009-01-25 17:47:23.463 WARN Thread-109
> org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0005_m_000007_0 Child Error
> > java.io.IOException: Task process exit with nonzero status of 143.
> >        at
> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> > 2009-01-25 17:47:23.464 WARN Thread-109
> > org.apache.hadoop.mapred.TaskTracker
> > - Error from unknown child task: attempt_200901251629_0005_m_000007_0.
> > Ignored.
> > 2009-01-25 17:47:23.465 WARN Thread-109
> > org.apache.hadoop.mapred.TaskTracker
> > - Unknown child task finshed: attempt_200901251629_0005_m_000007_0.
> > Ignored.
> > 2009-01-25 17:48:15.315 INFO main org.apache.hadoop.mapred.TaskTracker -
> > LaunchTaskAction: attempt_200901251629_0012_m_000004_0
> > 2009-01-25 17:48:28.088 INFO main org.apache.hadoop.mapred.MapTask -
> > numReduceTasks: 0
> > 2009-01-25 17:48:34.097 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.07784176% 347 docs, 0 errors, 69.4 docs/s
> > 2009-01-25 17:48:37.099 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.13955688% 673 docs, 0 errors, 84.1 docs/s
> > 2009-01-25 17:48:40.100 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.20145722% 1003 docs, 0 errors, 91.2 docs/s
> > 2009-01-25 17:48:43.102 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.26416454% 1343 docs, 0 errors, 95.9 docs/s
> > 2009-01-25 17:48:46.104 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.32754657% 1673 docs, 0 errors, 98.4 docs/s
> > 2009-01-25 17:48:49.105 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.39229122% 2025 docs, 0 errors, 101.3 docs/s
> > 2009-01-25 17:48:52.106 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.44708738% 2315 docs, 0 errors, 100.7 docs/s
> > 2009-01-25 17:48:55.108 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.5117659% 2658 docs, 0 errors, 102.2 docs/s
> > 2009-01-25 17:48:58.109 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.5795401% 3019 docs, 0 errors, 104.1 docs/s
> > 2009-01-25 17:49:01.111 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.6468777% 3377 docs, 0 errors, 105.5 docs/s
> > 2009-01-25 17:49:04.112 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.71514136% 3741 docs, 0 errors, 106.9 docs/s
> > 2009-01-25 17:49:07.114 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.78037554% 4094 docs, 0 errors, 107.7 docs/s
> > 2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.84660184% 4448 docs, 0 errors, 108.5 docs/s
> > 2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.9146803% 4810 docs, 0 errors, 109.3 docs/s
> > 2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 0.9521195% 5044 docs, 0 errors, 107.3 docs/s
> > 2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
> > 2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
> > org.apache.hadoop.mapred.TaskTracker - Task
> > attempt_200901251629_0012_m_000004_0 is done.
> > 2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner -
> > Task
> > 'attempt_200901251629_0012_m_000004_0' done.
> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > 1.0% 5364 docs, 0 errors, 107.3 docs/s
> > 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
> > org.apache.hadoop.mapred.TaskTracker -
> attempt_200901251629_0012_m_000004_0
> > Ignoring status-update since task is 'done'
> > 2009-01-25 17:49:35.582 INFO taskCleanup
> > org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
> > job_200901251629_0012
> > 2009-01-25 17:49:35.582 INFO taskCleanup
> > org.apache.hadoop.mapred.TaskRunner
> > - attempt_200901251629_0012_m_000004_0 done; removing files.
> >
> > # Still processes left even though the TaskTracker said: Received
> > 'KillJobAction' for job: job_200901251629_0012
> > root@mapreduce2:~# jps
> > 10732 Jps
> > 10634 TaskTracker$Child
> > 8660 DataNode
> > 8824 TaskTracker
> > 8730 SecondaryNameNode
> > 25060 Bootstrap
> > root@mapreduce2:~# date
> > Sun Jan 25 17:51:48 CET 2009
> > root@mapreduce2:~#
> >
> > On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou <
> marcus.herou@tailsweep.com
> > >wrote:
> >
> > > Hi.
> > >
> > > Today I noticed when I ran a Solr Indexing job through our Hadoop
> cluster
> > > that the master MySQL database where screaming about "Too Many
> > Connections".
> > >
> > > I wondered how that could happen so I logged into my Hadoop machines
> and
> > > searched through the logs. Nothing strange there. Then I just did a
> jps:
> > >
> > > root@mapreduce1:~# jps
> > > 10701 TaskTracker$Child
> > > 9567 NameNode
> > > 5435 TaskTracker$Child
> > > 31801 Bootstrap
> > > 7349 TaskTracker$Child
> > > 6197 TaskTracker$Child
> > > 7761 TaskTracker$Child
> > > 10453 TaskTracker$Child
> > > 11232 TaskTracker$Child
> > > 11113 TaskTracker$Child
> > > 9688 DataNode
> > > 10877 TaskTracker$Child
> > > 6504 TaskTracker$Child
> > > 10236 TaskTracker$Child
> > > 9852 TaskTracker
> > > 6515 TaskTracker$Child
> > > 11396 TaskTracker$Child
> > > 11741 Jps
> > > 6191 TaskTracker$Child
> > > 10981 TaskTracker$Child
> > > 7742 TaskTracker$Child
> > > 5946 TaskTracker$Child
> > > 11315 TaskTracker$Child
> > > 8112 TaskTracker$Child
> > > 11580 TaskTracker$Child
> > > 11490 TaskTracker$Child
> > > 5687 TaskTracker$Child
> > > 5927 TaskTracker$Child
> > > 27144 WrapperSimpleApp
> > > 7368 TaskTracker$Child
> > >
> > > Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so
> it
> > > only can have one active connection to any shard at any time.
> > > Background: I ran out of connections during the Christmas holidays
> since
> > I
> > > have 60 shards (10 per MySQL machine) and each required a DB-Pool which
> > > allowed too many active+idle connections.
> > >
> > > Anyway I have no active jobs at the moment so the children should have
> > died
> > > by themselves.
> > > Fortunately I have a little nice script which kills the bastards: jps
> > > |egrep "TaskTracker.+" | awk '{print $1}'|xargs kill
> > > I will probably put that in a cronjob which kills long running
> > children...
> > >
> > > Anyway, how can this happen ? Am I doing something really stupid along
> > the
> > > way ?
> > > Hard facts:
> > > Ubuntu Hardy-Heron, 2.6.24-19-server
> > > java version "1.6.0_06"
> > > Hadoop-0.18.2
> > > It's my own classes which fires the jobs through JobClient
> > > (JobClient.runJob(job))
> > > I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from a
> > > bash script)
> > > I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles);
> (comes
> > by
> > > parsing external CLASSPATH ENV in bash)
> > >
> > > The client do not complain, se example output below (I write no data to
> > > HDFS ((HDFS bytes written=774)), since I mostly use it for crawling and
> > all
> > > crawlers/indexers access my sharding db structure directly without
> > > intermediate storage):
> > > 2009-01-25 17:12:11.175 INFO main
> > org.apache.hadoop.mapred.FileInputFormat
> > > - Total input paths to process : 1
> > > 2009-01-25 17:12:11.176 INFO main
> > org.apache.hadoop.mapred.FileInputFormat
> > > - Total input paths to process : 1
> > > 2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
> > > Running job: job_200901251629_0011
> > > 2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 0% reduce 0%
> > > 2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 6% reduce 0%
> > > 2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 21% reduce 0%
> > > 2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 31% reduce 0%
> > > 2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 51% reduce 0%
> > > 2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 67% reduce 0%
> > > 2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 72% reduce 0%
> > > 2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 84% reduce 0%
> > > 2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 95% reduce 0%
> > > 2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient -
> >  map
> > > 98% reduce 0%
> > > 2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient -
> > Job
> > > complete: job_200901251629_0011
> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > > Counters: 7
> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > > File Systems
> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > > HDFS bytes read=2741143
> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > > HDFS bytes written=774
> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > > Job Counters
> > > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > > Rack-local map tasks=9
> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > > Launched map tasks=9
> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > > Map-Reduce Framework
> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > > Map input records=48314
> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > > Map input bytes=2732424
> > > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > > Map output records=0
> > >
> > > Any suggestions or pointers would be greatly appreciated. Hmm Coming to
> > > think about something. I start X threads from inside Hadoop almost
> cut'n
> > > pasted from Nutch.
> > > If a thread somehow would linger, would Hadoop not be able to shutdown
> > even
> > > though there is nothing more to read from the RecordReader ?
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.herou@tailsweep.com
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
> > >
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Lingering TaskTracker$Child

Posted by jason hadoop <ja...@gmail.com>.
We had trouble like that with some jobs, when the child ran additional
threads that were not set at daemon priority. These hold the Child JVM from
exiting.
JMX was the cause in our case, but we have seen our JNI jobs do it also.
In the end we made a local mod that forced a System.exit in the finally
block of the Child main.


On Sun, Jan 25, 2009 at 8:53 AM, Marcus Herou <ma...@tailsweep.com>wrote:

> Some extra info, apparently the child exits with a status of 143.
>
> 2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0011_m_000005_0
> 1.0% 5364 docs, 0 errors, 124.7 docs/s
> 2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - Task
> attempt_200901251629_0011_m_000005_0 is done.
> 2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner -
> Task
> 'attempt_200901251629_0011_m_000005_0' done.
> 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0011_m_000005_0
> 1.0% 5364 docs, 0 errors, 124.7 docs/s
> 2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0011_m_000005_0
> Ignoring status-update since task is 'done'
> 2009-01-25 17:13:24.996 INFO taskCleanup
> org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
> job_200901251629_0011
> 2009-01-25 17:13:24.996 INFO taskCleanup
> org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0011_m_000005_0 done; removing files.
> 2009-01-25 17:47:22.668 WARN Thread-23 org.apache.hadoop.mapred.TaskRunner
> -
> attempt_200901251629_0001_m_000006_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:22.669 WARN Thread-23 org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0001_m_000006_0.
> Ignored.
> 2009-01-25 17:47:22.671 WARN Thread-23 org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0001_m_000006_0.
> Ignored.
> 2009-01-25 17:47:22.713 WARN Thread-79 org.apache.hadoop.mapred.TaskRunner
> -
> attempt_200901251629_0002_m_000007_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:22.713 WARN Thread-159 org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0011_m_000005_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:22.713 WARN Thread-79 org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0002_m_000007_0.
> Ignored.
> 2009-01-25 17:47:22.714 WARN Thread-159
> org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0011_m_000005_0.
> Ignored.
> 2009-01-25 17:47:22.714 WARN Thread-79 org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0002_m_000007_0.
> Ignored.
> 2009-01-25 17:47:22.715 WARN Thread-159
> org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0011_m_000005_0.
> Ignored.
> 2009-01-25 17:47:22.811 WARN Thread-119 org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0006_m_000007_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:22.811 WARN Thread-119
> org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0006_m_000007_0.
> Ignored.
> 2009-01-25 17:47:22.812 WARN Thread-119
> org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0006_m_000007_0.
> Ignored.
> 2009-01-25 17:47:22.960 WARN Thread-89 org.apache.hadoop.mapred.TaskRunner
> -
> attempt_200901251629_0003_m_000007_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:22.961 WARN Thread-89 org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0003_m_000007_0.
> Ignored.
> 2009-01-25 17:47:22.962 WARN Thread-89 org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0003_m_000007_0.
> Ignored.
> 2009-01-25 17:47:23.060 WARN Thread-129 org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0008_m_000003_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:23.061 WARN Thread-129
> org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0008_m_000003_0.
> Ignored.
> 2009-01-25 17:47:23.062 WARN Thread-129
> org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0008_m_000003_0.
> Ignored.
> 2009-01-25 17:47:23.279 WARN Thread-99 org.apache.hadoop.mapred.TaskRunner
> -
> attempt_200901251629_0004_m_000007_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:23.279 WARN Thread-99 org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0004_m_000007_0.
> Ignored.
> 2009-01-25 17:47:23.280 WARN Thread-99 org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0004_m_000007_0.
> Ignored.
> 2009-01-25 17:47:23.360 WARN Thread-139 org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0009_m_000004_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:23.360 WARN Thread-139
> org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0009_m_000004_0.
> Ignored.
> 2009-01-25 17:47:23.361 WARN Thread-139
> org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0009_m_000004_0.
> Ignored.
> 2009-01-25 17:47:23.445 WARN Thread-149 org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0010_m_000004_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:23.445 WARN Thread-149
> org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0010_m_000004_0.
> Ignored.
> 2009-01-25 17:47:23.446 WARN Thread-149
> org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0010_m_000004_0.
> Ignored.
> 2009-01-25 17:47:23.463 WARN Thread-109 org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0005_m_000007_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 143.
>        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> 2009-01-25 17:47:23.464 WARN Thread-109
> org.apache.hadoop.mapred.TaskTracker
> - Error from unknown child task: attempt_200901251629_0005_m_000007_0.
> Ignored.
> 2009-01-25 17:47:23.465 WARN Thread-109
> org.apache.hadoop.mapred.TaskTracker
> - Unknown child task finshed: attempt_200901251629_0005_m_000007_0.
> Ignored.
> 2009-01-25 17:48:15.315 INFO main org.apache.hadoop.mapred.TaskTracker -
> LaunchTaskAction: attempt_200901251629_0012_m_000004_0
> 2009-01-25 17:48:28.088 INFO main org.apache.hadoop.mapred.MapTask -
> numReduceTasks: 0
> 2009-01-25 17:48:34.097 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.07784176% 347 docs, 0 errors, 69.4 docs/s
> 2009-01-25 17:48:37.099 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.13955688% 673 docs, 0 errors, 84.1 docs/s
> 2009-01-25 17:48:40.100 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.20145722% 1003 docs, 0 errors, 91.2 docs/s
> 2009-01-25 17:48:43.102 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.26416454% 1343 docs, 0 errors, 95.9 docs/s
> 2009-01-25 17:48:46.104 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.32754657% 1673 docs, 0 errors, 98.4 docs/s
> 2009-01-25 17:48:49.105 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.39229122% 2025 docs, 0 errors, 101.3 docs/s
> 2009-01-25 17:48:52.106 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.44708738% 2315 docs, 0 errors, 100.7 docs/s
> 2009-01-25 17:48:55.108 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.5117659% 2658 docs, 0 errors, 102.2 docs/s
> 2009-01-25 17:48:58.109 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.5795401% 3019 docs, 0 errors, 104.1 docs/s
> 2009-01-25 17:49:01.111 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.6468777% 3377 docs, 0 errors, 105.5 docs/s
> 2009-01-25 17:49:04.112 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.71514136% 3741 docs, 0 errors, 106.9 docs/s
> 2009-01-25 17:49:07.114 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.78037554% 4094 docs, 0 errors, 107.7 docs/s
> 2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.84660184% 4448 docs, 0 errors, 108.5 docs/s
> 2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.9146803% 4810 docs, 0 errors, 109.3 docs/s
> 2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 0.9521195% 5044 docs, 0 errors, 107.3 docs/s
> 2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 1.0% 5364 docs, 0 errors, 107.3 docs/s
> 2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
> org.apache.hadoop.mapred.TaskTracker - Task
> attempt_200901251629_0012_m_000004_0 is done.
> 2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner -
> Task
> 'attempt_200901251629_0012_m_000004_0' done.
> 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> 1.0% 5364 docs, 0 errors, 107.3 docs/s
> 2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
> org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
> Ignoring status-update since task is 'done'
> 2009-01-25 17:49:35.582 INFO taskCleanup
> org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
> job_200901251629_0012
> 2009-01-25 17:49:35.582 INFO taskCleanup
> org.apache.hadoop.mapred.TaskRunner
> - attempt_200901251629_0012_m_000004_0 done; removing files.
>
> # Still processes left even though the TaskTracker said: Received
> 'KillJobAction' for job: job_200901251629_0012
> root@mapreduce2:~# jps
> 10732 Jps
> 10634 TaskTracker$Child
> 8660 DataNode
> 8824 TaskTracker
> 8730 SecondaryNameNode
> 25060 Bootstrap
> root@mapreduce2:~# date
> Sun Jan 25 17:51:48 CET 2009
> root@mapreduce2:~#
>
> On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Hi.
> >
> > Today I noticed when I ran a Solr Indexing job through our Hadoop cluster
> > that the master MySQL database where screaming about "Too Many
> Connections".
> >
> > I wondered how that could happen so I logged into my Hadoop machines and
> > searched through the logs. Nothing strange there. Then I just did a jps:
> >
> > root@mapreduce1:~# jps
> > 10701 TaskTracker$Child
> > 9567 NameNode
> > 5435 TaskTracker$Child
> > 31801 Bootstrap
> > 7349 TaskTracker$Child
> > 6197 TaskTracker$Child
> > 7761 TaskTracker$Child
> > 10453 TaskTracker$Child
> > 11232 TaskTracker$Child
> > 11113 TaskTracker$Child
> > 9688 DataNode
> > 10877 TaskTracker$Child
> > 6504 TaskTracker$Child
> > 10236 TaskTracker$Child
> > 9852 TaskTracker
> > 6515 TaskTracker$Child
> > 11396 TaskTracker$Child
> > 11741 Jps
> > 6191 TaskTracker$Child
> > 10981 TaskTracker$Child
> > 7742 TaskTracker$Child
> > 5946 TaskTracker$Child
> > 11315 TaskTracker$Child
> > 8112 TaskTracker$Child
> > 11580 TaskTracker$Child
> > 11490 TaskTracker$Child
> > 5687 TaskTracker$Child
> > 5927 TaskTracker$Child
> > 27144 WrapperSimpleApp
> > 7368 TaskTracker$Child
> >
> > Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so it
> > only can have one active connection to any shard at any time.
> > Background: I ran out of connections during the Christmas holidays since
> I
> > have 60 shards (10 per MySQL machine) and each required a DB-Pool which
> > allowed too many active+idle connections.
> >
> > Anyway I have no active jobs at the moment so the children should have
> died
> > by themselves.
> > Fortunately I have a little nice script which kills the bastards: jps
> > |egrep "TaskTracker.+" | awk '{print $1}'|xargs kill
> > I will probably put that in a cronjob which kills long running
> children...
> >
> > Anyway, how can this happen ? Am I doing something really stupid along
> the
> > way ?
> > Hard facts:
> > Ubuntu Hardy-Heron, 2.6.24-19-server
> > java version "1.6.0_06"
> > Hadoop-0.18.2
> > It's my own classes which fires the jobs through JobClient
> > (JobClient.runJob(job))
> > I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from a
> > bash script)
> > I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles); (comes
> by
> > parsing external CLASSPATH ENV in bash)
> >
> > The client do not complain, se example output below (I write no data to
> > HDFS ((HDFS bytes written=774)), since I mostly use it for crawling and
> all
> > crawlers/indexers access my sharding db structure directly without
> > intermediate storage):
> > 2009-01-25 17:12:11.175 INFO main
> org.apache.hadoop.mapred.FileInputFormat
> > - Total input paths to process : 1
> > 2009-01-25 17:12:11.176 INFO main
> org.apache.hadoop.mapred.FileInputFormat
> > - Total input paths to process : 1
> > 2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
> > Running job: job_200901251629_0011
> > 2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 0% reduce 0%
> > 2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 6% reduce 0%
> > 2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 21% reduce 0%
> > 2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 31% reduce 0%
> > 2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 51% reduce 0%
> > 2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 67% reduce 0%
> > 2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 72% reduce 0%
> > 2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 84% reduce 0%
> > 2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 95% reduce 0%
> > 2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient -
>  map
> > 98% reduce 0%
> > 2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient -
> Job
> > complete: job_200901251629_0011
> > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > Counters: 7
> > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > File Systems
> > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > HDFS bytes read=2741143
> > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > HDFS bytes written=774
> > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > Job Counters
> > 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> > Rack-local map tasks=9
> > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > Launched map tasks=9
> > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > Map-Reduce Framework
> > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > Map input records=48314
> > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > Map input bytes=2732424
> > 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> > Map output records=0
> >
> > Any suggestions or pointers would be greatly appreciated. Hmm Coming to
> > think about something. I start X threads from inside Hadoop almost cut'n
> > pasted from Nutch.
> > If a thread somehow would linger, would Hadoop not be able to shutdown
> even
> > though there is nothing more to read from the RecordReader ?
> >
> > Kindly
> >
> > //Marcus
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

Re: Lingering TaskTracker$Child

Posted by Marcus Herou <ma...@tailsweep.com>.
Some extra info, apparently the child exits with a status of 143.

2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0011_m_000005_0
1.0% 5364 docs, 0 errors, 124.7 docs/s
2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - Task
attempt_200901251629_0011_m_000005_0 is done.
2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner - Task
'attempt_200901251629_0011_m_000005_0' done.
2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0011_m_000005_0
1.0% 5364 docs, 0 errors, 124.7 docs/s
2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0011_m_000005_0
Ignoring status-update since task is 'done'
2009-01-25 17:13:24.996 INFO taskCleanup
org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
job_200901251629_0011
2009-01-25 17:13:24.996 INFO taskCleanup org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0011_m_000005_0 done; removing files.
2009-01-25 17:47:22.668 WARN Thread-23 org.apache.hadoop.mapred.TaskRunner -
attempt_200901251629_0001_m_000006_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:22.669 WARN Thread-23 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0001_m_000006_0.
Ignored.
2009-01-25 17:47:22.671 WARN Thread-23 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0001_m_000006_0. Ignored.
2009-01-25 17:47:22.713 WARN Thread-79 org.apache.hadoop.mapred.TaskRunner -
attempt_200901251629_0002_m_000007_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:22.713 WARN Thread-159 org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0011_m_000005_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:22.713 WARN Thread-79 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0002_m_000007_0.
Ignored.
2009-01-25 17:47:22.714 WARN Thread-159 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0011_m_000005_0.
Ignored.
2009-01-25 17:47:22.714 WARN Thread-79 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0002_m_000007_0. Ignored.
2009-01-25 17:47:22.715 WARN Thread-159 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0011_m_000005_0. Ignored.
2009-01-25 17:47:22.811 WARN Thread-119 org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0006_m_000007_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:22.811 WARN Thread-119 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0006_m_000007_0.
Ignored.
2009-01-25 17:47:22.812 WARN Thread-119 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0006_m_000007_0. Ignored.
2009-01-25 17:47:22.960 WARN Thread-89 org.apache.hadoop.mapred.TaskRunner -
attempt_200901251629_0003_m_000007_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:22.961 WARN Thread-89 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0003_m_000007_0.
Ignored.
2009-01-25 17:47:22.962 WARN Thread-89 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0003_m_000007_0. Ignored.
2009-01-25 17:47:23.060 WARN Thread-129 org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0008_m_000003_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:23.061 WARN Thread-129 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0008_m_000003_0.
Ignored.
2009-01-25 17:47:23.062 WARN Thread-129 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0008_m_000003_0. Ignored.
2009-01-25 17:47:23.279 WARN Thread-99 org.apache.hadoop.mapred.TaskRunner -
attempt_200901251629_0004_m_000007_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:23.279 WARN Thread-99 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0004_m_000007_0.
Ignored.
2009-01-25 17:47:23.280 WARN Thread-99 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0004_m_000007_0. Ignored.
2009-01-25 17:47:23.360 WARN Thread-139 org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0009_m_000004_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:23.360 WARN Thread-139 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0009_m_000004_0.
Ignored.
2009-01-25 17:47:23.361 WARN Thread-139 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0009_m_000004_0. Ignored.
2009-01-25 17:47:23.445 WARN Thread-149 org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0010_m_000004_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:23.445 WARN Thread-149 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0010_m_000004_0.
Ignored.
2009-01-25 17:47:23.446 WARN Thread-149 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0010_m_000004_0. Ignored.
2009-01-25 17:47:23.463 WARN Thread-109 org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0005_m_000007_0 Child Error
java.io.IOException: Task process exit with nonzero status of 143.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-01-25 17:47:23.464 WARN Thread-109 org.apache.hadoop.mapred.TaskTracker
- Error from unknown child task: attempt_200901251629_0005_m_000007_0.
Ignored.
2009-01-25 17:47:23.465 WARN Thread-109 org.apache.hadoop.mapred.TaskTracker
- Unknown child task finshed: attempt_200901251629_0005_m_000007_0. Ignored.
2009-01-25 17:48:15.315 INFO main org.apache.hadoop.mapred.TaskTracker -
LaunchTaskAction: attempt_200901251629_0012_m_000004_0
2009-01-25 17:48:28.088 INFO main org.apache.hadoop.mapred.MapTask -
numReduceTasks: 0
2009-01-25 17:48:34.097 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.07784176% 347 docs, 0 errors, 69.4 docs/s
2009-01-25 17:48:37.099 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.13955688% 673 docs, 0 errors, 84.1 docs/s
2009-01-25 17:48:40.100 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.20145722% 1003 docs, 0 errors, 91.2 docs/s
2009-01-25 17:48:43.102 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.26416454% 1343 docs, 0 errors, 95.9 docs/s
2009-01-25 17:48:46.104 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.32754657% 1673 docs, 0 errors, 98.4 docs/s
2009-01-25 17:48:49.105 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.39229122% 2025 docs, 0 errors, 101.3 docs/s
2009-01-25 17:48:52.106 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.44708738% 2315 docs, 0 errors, 100.7 docs/s
2009-01-25 17:48:55.108 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.5117659% 2658 docs, 0 errors, 102.2 docs/s
2009-01-25 17:48:58.109 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.5795401% 3019 docs, 0 errors, 104.1 docs/s
2009-01-25 17:49:01.111 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.6468777% 3377 docs, 0 errors, 105.5 docs/s
2009-01-25 17:49:04.112 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.71514136% 3741 docs, 0 errors, 106.9 docs/s
2009-01-25 17:49:07.114 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.78037554% 4094 docs, 0 errors, 107.7 docs/s
2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.84660184% 4448 docs, 0 errors, 108.5 docs/s
2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.9146803% 4810 docs, 0 errors, 109.3 docs/s
2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
0.9521195% 5044 docs, 0 errors, 107.3 docs/s
2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
1.0% 5364 docs, 0 errors, 107.3 docs/s
2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - Task
attempt_200901251629_0012_m_000004_0 is done.
2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner - Task
'attempt_200901251629_0012_m_000004_0' done.
2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
1.0% 5364 docs, 0 errors, 107.3 docs/s
2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_000004_0
Ignoring status-update since task is 'done'
2009-01-25 17:49:35.582 INFO taskCleanup
org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
job_200901251629_0012
2009-01-25 17:49:35.582 INFO taskCleanup org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0012_m_000004_0 done; removing files.

# Still processes left even though the TaskTracker said: Received
'KillJobAction' for job: job_200901251629_0012
root@mapreduce2:~# jps
10732 Jps
10634 TaskTracker$Child
8660 DataNode
8824 TaskTracker
8730 SecondaryNameNode
25060 Bootstrap
root@mapreduce2:~# date
Sun Jan 25 17:51:48 CET 2009
root@mapreduce2:~#

On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou <ma...@tailsweep.com>wrote:

> Hi.
>
> Today I noticed when I ran a Solr Indexing job through our Hadoop cluster
> that the master MySQL database where screaming about "Too Many Connections".
>
> I wondered how that could happen so I logged into my Hadoop machines and
> searched through the logs. Nothing strange there. Then I just did a jps:
>
> root@mapreduce1:~# jps
> 10701 TaskTracker$Child
> 9567 NameNode
> 5435 TaskTracker$Child
> 31801 Bootstrap
> 7349 TaskTracker$Child
> 6197 TaskTracker$Child
> 7761 TaskTracker$Child
> 10453 TaskTracker$Child
> 11232 TaskTracker$Child
> 11113 TaskTracker$Child
> 9688 DataNode
> 10877 TaskTracker$Child
> 6504 TaskTracker$Child
> 10236 TaskTracker$Child
> 9852 TaskTracker
> 6515 TaskTracker$Child
> 11396 TaskTracker$Child
> 11741 Jps
> 6191 TaskTracker$Child
> 10981 TaskTracker$Child
> 7742 TaskTracker$Child
> 5946 TaskTracker$Child
> 11315 TaskTracker$Child
> 8112 TaskTracker$Child
> 11580 TaskTracker$Child
> 11490 TaskTracker$Child
> 5687 TaskTracker$Child
> 5927 TaskTracker$Child
> 27144 WrapperSimpleApp
> 7368 TaskTracker$Child
>
> Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so it
> only can have one active connection to any shard at any time.
> Background: I ran out of connections during the Christmas holidays since I
> have 60 shards (10 per MySQL machine) and each required a DB-Pool which
> allowed too many active+idle connections.
>
> Anyway I have no active jobs at the moment so the children should have died
> by themselves.
> Fortunately I have a little nice script which kills the bastards: jps
> |egrep "TaskTracker.+" | awk '{print $1}'|xargs kill
> I will probably put that in a cronjob which kills long running children...
>
> Anyway, how can this happen ? Am I doing something really stupid along the
> way ?
> Hard facts:
> Ubuntu Hardy-Heron, 2.6.24-19-server
> java version "1.6.0_06"
> Hadoop-0.18.2
> It's my own classes which fires the jobs through JobClient
> (JobClient.runJob(job))
> I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from a
> bash script)
> I feed deps into hadoop by issuing: job.set("tmpjars", jarFiles); (comes by
> parsing external CLASSPATH ENV in bash)
>
> The client do not complain, se example output below (I write no data to
> HDFS ((HDFS bytes written=774)), since I mostly use it for crawling and all
> crawlers/indexers access my sharding db structure directly without
> intermediate storage):
> 2009-01-25 17:12:11.175 INFO main org.apache.hadoop.mapred.FileInputFormat
> - Total input paths to process : 1
> 2009-01-25 17:12:11.176 INFO main org.apache.hadoop.mapred.FileInputFormat
> - Total input paths to process : 1
> 2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
> Running job: job_200901251629_0011
> 2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -  map
> 0% reduce 0%
> 2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -  map
> 6% reduce 0%
> 2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -  map
> 21% reduce 0%
> 2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient -  map
> 31% reduce 0%
> 2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient -  map
> 51% reduce 0%
> 2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient -  map
> 67% reduce 0%
> 2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient -  map
> 72% reduce 0%
> 2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient -  map
> 84% reduce 0%
> 2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient -  map
> 95% reduce 0%
> 2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient -  map
> 98% reduce 0%
> 2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient - Job
> complete: job_200901251629_0011
> 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> Counters: 7
> 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> File Systems
> 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> HDFS bytes read=2741143
> 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> HDFS bytes written=774
> 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> Job Counters
> 2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
> Rack-local map tasks=9
> 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> Launched map tasks=9
> 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> Map-Reduce Framework
> 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> Map input records=48314
> 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> Map input bytes=2732424
> 2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
> Map output records=0
>
> Any suggestions or pointers would be greatly appreciated. Hmm Coming to
> think about something. I start X threads from inside Hadoop almost cut'n
> pasted from Nutch.
> If a thread somehow would linger, would Hadoop not be able to shutdown even
> though there is nothing more to read from the RecordReader ?
>
> Kindly
>
> //Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/