You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org> on 2008/03/18 09:43:25 UTC

[jira] Created: (HADOOP-3039) Runtime exceptions not killing job

Runtime exceptions not killing job
----------------------------------

                 Key: HADOOP-3039
                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.16.0
            Reporter: Amareshwari Sriramadasu
            Priority: Blocker
             Fix For: 0.16.2, 0.17.0


If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.

Thread dump shows:
"DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
   java.lang.Thread.State: RUNNABLE

"Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0xa3af62a0> (a java.util.LinkedList)
	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
	- locked <0xa3af62a0> (a java.util.LinkedList)

"Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
	at java.lang.Thread.run(Unknown Source)

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
	at java.lang.Thread.run(Unknown Source)

"IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
	at java.lang.Object.wait(Object.java:485)
	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)

It looks like Task is waiting for DataStreamer thread to get closed. 
When I did  streamer.setDaemon(true), the behavior was fine.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581797#action_12581797 ] 

Devaraj Das commented on HADOOP-3039:
-------------------------------------

This shouldn't require a testcase (it does a dfsclient thread change) and involves timing issues to do with detecting job failures. 

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3039:
--------------------------------------------

    Attachment: patch-3039.txt

patch adding streamer.setDaemon(true) in DFSClient.

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-3039:
--------------------------------------------

    Status: Patch Available  (was: Open)

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3039:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amareshwari!

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581923#action_12581923 ] 

Hudson commented on HADOOP-3039:
--------------------------------

Integrated in Hadoop-trunk #441 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/441/])

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581491#action_12581491 ] 

Hadoop QA commented on HADOOP-3039:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12378473/patch-3039.txt
against trunk revision 619744.

    @author +1.  The patch does not contain any @author tags.

    tests included -1.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new javac compiler warnings.

    release audit +1.  The applied patch does not generate any new release audit warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2031/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2031/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2031/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2031/console

This message is automatically generated.

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye reassigned HADOOP-3039:
----------------------------------------

    Assignee: Amareshwari Sriramadasu

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3039) Runtime exceptions not killing job

Posted by "Rick Cox (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581284#action_12581284 ] 

Rick Cox commented on HADOOP-3039:
----------------------------------

We're running into this too (in streaming jobs).

Another workaround is to add a {{FileSystem.closeAll()}} to TaskTracker.java just after closing the {{metricsContext}}.

That said, my take from reading FileSystem.java is that the shutdown hook (ClientFinalizer) was intended to do this, but it isn't running because the DataStreamer is a non-daemon thread. This would imply that adding {{streamer.setDaemon(true)}} is a better solution.

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3039) Runtime exceptions not killing job

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581783#action_12581783 ] 

dhruba borthakur commented on HADOOP-3039:
------------------------------------------

+1 Code looks good.

> Runtime exceptions not killing job
> ----------------------------------
>
>                 Key: HADOOP-3039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3039
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.16.2, 0.17.0
>
>         Attachments: patch-3039.txt
>
>
> If a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. In 0.16.0, when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed. But task has to get killed immediately if it throws NPE.
> Thread dump shows:
> "DestroyJavaVM" prio=10 tid=0x0805f800 nid=0x6b5a waiting on condition [0x00000000..0xbfffcc90]
>    java.lang.Thread.State: RUNNABLE
> "Thread-12" prio=10 tid=0x083f1400 nid=0x6b87 in Object.wait() [0xa2f37000..0xa2f37eb0]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa3af62a0> (a java.util.LinkedList)
> 	at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1680)
> 	- locked <0xa3af62a0> (a java.util.LinkedList)
> "Comm thread for task_200803181240_0001_r_000000_0" daemon prio=10 tid=0x0841f000 nid=0x6b6f waiting on condition [0xa307c000..0xa307c130]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.mapred.Task$1.run(Task.java:283)
> 	at java.lang.Thread.run(Unknown Source)
> "org.apache.hadoop.dfs.DFSClient$LeaseChecker@edf3f6" daemon prio=10 tid=0x083fc400 nid=0x6b6d waiting on condition [0xa30cd000..0xa30cd1b0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:626)
> 	at java.lang.Thread.run(Unknown Source)
> "IPC Client connection to localhost/127.0.0.1:9000" daemon prio=10 tid=0x083f6800 nid=0x6b6c in Object.wait() [0xa311d000..0xa311e030]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
> 	- locked <0xa4ac0860> (a org.apache.hadoop.ipc.Client$Connection)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
> It looks like Task is waiting for DataStreamer thread to get closed. 
> When I did  streamer.setDaemon(true), the behavior was fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.