You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Jason (JIRA)" <ji...@apache.org> on 2008/08/21 23:19:44 UTC

[jira] Created: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported with the timeout (600 sec) interval - this patch provides a stack trace of the task

There is little information provided when the TaskTracker kills a Task that has not reported with the timeout (600 sec) interval - this patch provides a stack trace of the task 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-3994
                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
    Affects Versions: 0.16.0
            Reporter: Jason
            Priority: Minor
         Attachments: 0.16_patch

When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class

The attached patch is against 0.16.0, as that is the release we use.
This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
The script that generates the stack trace is very linux specific.
The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported with the timeout (600 sec) interval - this patch provides a stack trace of the task

Posted by "Jason (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason updated HADOOP-3994:
--------------------------

    Attachment: 0.16_patch

Patch against 0.16

> There is little information provided when the TaskTracker kills a Task that has not reported with the timeout (600 sec) interval - this patch provides a stack trace of the task 
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Jason
>            Priority: Minor
>         Attachments: 0.16_patch
>
>
> When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
> This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
> Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class
> The attached patch is against 0.16.0, as that is the release we use.
> This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
> The script that generates the stack trace is very linux specific.
> The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624500#action_12624500 ] 

Steve Loughran commented on HADOOP-3994:
----------------------------------------

This could be really useful; anything to get the PID of a forked process would be handy. As you note, UNIXProcess is undocumented and only likely to surface on sun-derived JVMs; the other risk is instability of their private code. But it would be useful, in other places in the apache portfolio.

* all code to deal with this class should be outside TaskRunner; a separate class for use on demand, 
* the class should include a condition that warns that that the operation is going to work 
* To test, fork a process that Sleeps for 30s or so, and before that sleep has finished, try to get a stack dump. 
* I could imagine a kill() method being useful too.



> There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Jason
>            Priority: Minor
>         Attachments: 0.16_patch
>
>
> When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
> This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
> Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class
> The attached patch is against 0.16.0, as that is the release we use.
> This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
> The script that generates the stack trace is very linux specific.
> The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624651#action_12624651 ] 

Steve Loughran commented on HADOOP-3994:
----------------------------------------

Jason> I suspect having an equivalent for this would be straight forward but I don't currently develop under windows so I didn't try to implement an equivalent.

no direct equivalent to kill -QUIT, I think. And it makes testing harder. Probably best to stick to unix systems.

Using reflection there's a risk this wont work under the security manager; the code should catch SecurityExceptions. But I'd be happier with a bit of reflection abuse than another native library. 

Vinod> Also can we do anything similar to get more information when streaming/pipe tasks timeout too?

It's not so easy if they are native code; they wont have java stacks. 


Vinod> As, a side note, the JAVA getPid() bug(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4244896) is past 9 year celebrations .

it is not alone, try searching for  happy birthday site:bugs.sun.com

> There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Jason
>            Priority: Minor
>         Attachments: 0.16_patch
>
>
> When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
> This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
> Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class
> The attached patch is against 0.16.0, as that is the release we use.
> This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
> The script that generates the stack trace is very linux specific.
> The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task

Posted by "Jason (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason updated HADOOP-3994:
--------------------------

    Summary: There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task   (was: There is little information provided when the TaskTracker kills a Task that has not reported with the timeout (600 sec) interval - this patch provides a stack trace of the task )

> There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Jason
>            Priority: Minor
>         Attachments: 0.16_patch
>
>
> When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
> This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
> Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class
> The attached patch is against 0.16.0, as that is the release we use.
> This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
> The script that generates the stack trace is very linux specific.
> The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task

Posted by "Jason (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624507#action_12624507 ] 

Jason commented on HADOOP-3994:
-------------------------------

The window's jvm equivalent is java.lang.ProcessImpl.java and has the process handle in the private long variable handle.

I suspect having an equivalent for this would be straight forward but I don't currently develop under windows so I didn't try to implement an equivalent.

The code uses reflection right now to get at the private variables - definitly  fragile. I tried very hard to ensure that there would be no crashes if the something was not as expected.


> There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Jason
>            Priority: Minor
>         Attachments: 0.16_patch
>
>
> When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
> This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
> Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class
> The attached patch is against 0.16.0, as that is the release we use.
> This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
> The script that generates the stack trace is very linux specific.
> The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3994) There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624582#action_12624582 ] 

Vinod Kumar Vavilapalli commented on HADOOP-3994:
-------------------------------------------------

Elsewhere on HADOOP-3581(yet to be committed), we used a different method of obtaining the pid of the launching task. For this, just before the task is launched, the launching shell prints out the pid to a pid file(echo $$ > pidfile), and then the task is exec'ed. Later this pid file is read and then is used for memory management of the process. I guess the same pidfile can be used for this issue too. This method works everywhere the shell feature works.

But I agree in general that a getPid() method is a good to have.

bq. [..] UNIXProcess is undocumented and only likely to surface on sun-derived JVMs; the other risk is instability of their private code [..]
We can write native code to avoid the above. But yes, this implies adding another native library.

As, a side note, the JAVA getPid() bug(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4244896) is past 9 year celebrations :).

Also can we do anything similar to get more information when streaming/pipe tasks timeout too?

> There is little information provided when the TaskTracker kills a Task that has not reported within the timeout (600 sec) interval - this patch provides a stack trace of the task 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3994
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Jason
>            Priority: Minor
>         Attachments: 0.16_patch
>
>
> When we have a task that is killed for not reporting, sometimes there is an obvious programming error, and sometimes the reason the job didn't report is unclear.
> This patch will cause the TaskTracker to try to generate a stack trace of the offending task before the task is killed.
> Given how opaque process control is in java, a program is run to generate the stack trace, using the PID extracted from the undocumented UNIXProcess class
> The attached patch is against 0.16.0, as that is the release we use.
> This will only work on Unix machines -- or JVM's what use the java.lang.UNIXProcess implementation for the java Process object.
> The script that generates the stack trace is very linux specific.
> The code changes will run on jvm's where the UNIXProcess class is not available, without failure, but no stack trace will be generated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.