You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2007/10/18 20:33:51 UTC

[jira] Created: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

extensive map tasks failures because of SocketTimeoutException during statusUpdate
----------------------------------------------------------------------------------

                 Key: HADOOP-2076
                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.16.0
         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
            Reporter: Christian Kunz


A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.

JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.

I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.

2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:484)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
	at java.lang.Thread.run(Thread.java:619)

2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:484)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
	at java.lang.Thread.run(Thread.java:619)

2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:484)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
	at java.lang.Thread.run(Thread.java:619)

2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
8 active threads
Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 4128
  Stack:
    sun.management.ThreadImpl.getThreadInfo0(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
    org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
    org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
    org.apache.hadoop.mapred.Task$1.run(Task.java:315)
    java.lang.Thread.run(Thread.java:619)
Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 6403
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
    java.lang.Thread.run(Thread.java:619)
Thread 9 (IPC Client connection to /127.0.0.1:49458):
  State: RUNNABLE
  Blocked count: 21
  Waited count: 2063
  Stack:
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.read(SocketInputStream.java:129)
    java.io.FilterInputStream.read(FilterInputStream.java:116)
    org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    java.io.DataInputStream.readInt(DataInputStream.java:370)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 6402
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 398
  Waited count: 2270
  Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 257
  Waited count: 2269
  Waiting on java.lang.ref.Reference$Lock@1c66ec7
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 1
  Waited count: 10
  Stack:
    java.io.FileInputStream.readBytes(Native Method)
    java.io.FileInputStream.read(FileInputStream.java:199)
    org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
    java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
    java.io.BufferedInputStream.read(BufferedInputStream.java:317)
    java.io.DataInputStream.read(DataInputStream.java:132)
    org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
    org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
    org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
    org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
    org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
    org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
    org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
    org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
    org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
    org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
    org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
    org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
    org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)

2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0


Log of task that could not start:
2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye updated HADOOP-2076:
-------------------------------------

    Fix Version/s: 0.15.1

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>             Fix For: 0.15.1
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540257 ] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

A couple of days ago, I ran the job successfully with 20 (instead of 30) jetty threads (although with some long tail because of the socket connection timeouts).
To save time I would like to not repeat the job as this time, because we made some progress in a batch of jobs and are further down the line.
Once we hit the problem again (requires a job with some substantial intermediate output), I will try your suggestion.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539742 ] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

I looked at the system resources during the time a node exhibits the timeout problem. At the end of the map phase, nodes which are just  finishing the map tasks (probably merging the output taking of the order of  an hour in our case such that there is a good chance that all of the map tasks do the merging at the same time), with reduce tasks already fetching remote map output, can be completely overwhelmed (load factor > 10, several processes running at relatively high CPU, and some disks at high utilization with at least one at 100%). No wonder that we get socket timeouts.

Changing tasks to do retries when there are socket connection timeouts would  only be a bandaid. Rather, the taskTracker should monitor system resources and throttle the reduce tasks (maybe even the map tasks) during contention.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536082 ] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

Owen, I am not convinced that this is related to Hadoop-2070.

As a matter of fact, as far as I can see the original patch for HADOOP-1788 flushed the buffer when the application is done, but missed two cases: when the application does not use a java inputformat and when it gets aborted. That's why the first 3 out of 4 unit tests were successful; the 4th unit test does not use a java input format, therefore, it failed. I would have thought that without any flushing all unit tests would have failed.

Our application uses a java inputformat and has the input flushed (it actually completes). Also the symptom of missing flushing is that the task it self times out.

The current issue seems to be related to bad communication between task and taskTracker. I checked the tasktracker log around the same time, and found that it was running out of threads 

2007-10-18 19:27:43,719 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on SocketListener0@0.0.0.0:50060

2007-10-18 19:27:44,432 INFO org.apache.hadoop.mapred.TaskTracker: task_200710172336_0016_m_000045_1 <status msg>
2007-10-18 19:27:44,432 INFO org.apache.hadoop.mapred.TaskTracker: task_200710172336_0016_r_000889_1 <statusmsg> > reduce
2007-10-18 19:27:44,437 INFO org.apache.hadoop.mapred.TaskTracker: task_200710172336_0016_r_003314_0 0.70760524% reduce > reduce
2007-10-18 19:27:44,997 INFO org.apache.hadoop.mapred.TaskTracker: task_200710172336_0016_r_003312_0 0.6736239% reduce > reduce
2007-10-18 19:27:45,430 INFO org.apache.hadoop.mapred.TaskTracker: task_200710172336_0016_m_001001_1 0<statusmsg>

2007-10-18 19:27:45,498 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060

2007-10-18 19:27:47,436 INFO org.apache.hadoop.mapred.TaskTracker: task_200710172336_0016_m_000045_1 <statusmsg>


> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541849 ] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

We changed the max number of tasks per node to 2, but currently scaling issues with the namenode are obscuring this issue here (also, we still run with 20 jetty threads).

It would be nice if the user was not forced to reduce the number of tasks when the node resources could handle as many simultaneous applications as CPU's, but not the post-processing inside the infrastructure (simultaneous merging of map tasks and remote fetching by reduce tasks). Would it be possible to optionally let the reduce tasks hold off fetching for a while when the map tasks do the final merge and/or serialize the final merge of map tasks, but do it only when resource contention is detected?

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>             Fix For: 0.15.1
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-2076:
-----------------------------------

    Component/s: mapred
       Priority: Blocker  (was: Major)

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536011 ] 

Owen O'Malley commented on HADOOP-2076:
---------------------------------------

Christian, please try a build with hadoop-2070, which was committed last night. I suspect you are seeing the same problem.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-2076:
--------------------------------

    Fix Version/s:     (was: 0.15.1)
                   0.16.0
         Priority: Major  (was: Blocker)

Determined to not be a blocker.  Assigning to 0.16

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>             Fix For: 0.16.0
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542283 ] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

My apologies for late response.
Will try to get by with using more mappers producing less output.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>             Fix For: 0.16.0
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542068 ] 

Devaraj Das commented on HADOOP-2076:
-------------------------------------

Christian, since this issue doesn't break anything and, as you pointed out, things work with a different configuration, I think that we should downgrade the Priority to 'Major'. The fix for this issue seems to involve quite a lot of benchmarking work and IMO should not block a release.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>             Fix For: 0.15.1
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541701 ] 

Devaraj Das commented on HADOOP-2076:
-------------------------------------

Some thoughts : if we have the thing about monitoring/throttling tasks, we can have the monitoring in place for the running tasks and based on the load on system resources the tasktracker can ask for a new task or not (as part of the heartbeat). However, for your case (and for tasktracker running long running tasks in general), this may or may not work since the system goes into that state towards the middle of the tasks' lives. So, the option is to kill some currently running task(s) or to reduce the (nice) priority of some. This runtime tuning can be made an optional feature since it might adversely affect the performance of some jobs (since in this approach we speculate/analyze, etc.).

The other option, as I commented on earlier, is to control the number of running tasks for a job on a tasktracker. That can be configured by the user (and this affects whether a new task for that job is assigned to a tasktracker or not, when the latter asks for a new task). The runtime tuning is probably the best approach, but this is a simpler one and should generally help (and maybe we can implement both).

Thoughts?

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>             Fix For: 0.15.1
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-2076:
-----------------------------------


With input from Owen and Devaraj, I will try to use less jetty threads to reduce contention and see whether this helps

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540089 ] 

Devaraj Das commented on HADOOP-2076:
-------------------------------------

Christian, for this job, could you please set the tasks per node to 2 and see whether it improves the situation?

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during statusUpdate

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539627 ] 

Christian Kunz commented on HADOOP-2076:
----------------------------------------

Even with 30 jetty threads (instead of default 40) this still happens frequently, generating a very long tail for the job. I noticed that during that time the nodes running map tasks are disk bound.
I will now try with 20 jetty threads, but the protocol between Task and TaskTracker should be fixed to use retries.
I make this a blocker because it slows us down tremendously.

> extensive map tasks failures because of SocketTimeoutException during statusUpdate
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2076
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>         Environment: Oct 17 #718 nightly build with patches 2033 and 2048
>            Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows extensive map tasks failures because of connection timeouts at the end of the task (c++ application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job because of retries. Only a portion of the tasks fail because of the timeout issue, but they spawn other failures because retries and speculatively executed tasks cannot even get  a connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to 10 attempts for maps.
> I attach the log of a task failing because of timeout (which includes a thread dump), and the log of one task which could not start.
> 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: Communication exception: java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:484)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> 	at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> 	at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process Thread Dump: Communication exception
> 8 active threads
> Thread 13 (Comm thread for task_200710172336_0016_m_000071_0):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 4128
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>     org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)
>     org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162)
>     org.apache.hadoop.mapred.Task$1.run(Task.java:315)
>     java.lang.Thread.run(Thread.java:619)
> Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker@141b571):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6403
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
>     java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
>   State: RUNNABLE
>   Blocked count: 21
>   Waited count: 2063
>   Stack:
>     java.net.SocketInputStream.socketRead0(Native Method)
>     java.net.SocketInputStream.read(SocketInputStream.java:129)
>     java.io.FilterInputStream.read(FilterInputStream.java:116)
>     org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>     java.io.DataInputStream.readInt(DataInputStream.java:370)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 6402
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 398
>   Waited count: 2270
>   Waiting on java.lang.ref.ReferenceQueue$Lock@c278b5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 257
>   Waited count: 2269
>   Waiting on java.lang.ref.Reference$Lock@1c66ec7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 1
>   Waited count: 10
>   Stack:
>     java.io.FileInputStream.readBytes(Native Method)
>     java.io.FileInputStream.read(FileInputStream.java:199)
>     org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105)
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>     java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>     java.io.DataInputStream.read(DataInputStream.java:132)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200)
>     org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234)
>     org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157)
>     org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378)
>     org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359)
>     org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254)
>     org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
>     org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217)
>     org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142)
> 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, killing task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:53972. Already tried 10 time(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.