You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Roman Shaposhnik (JIRA)" <ji...@apache.org> on 2012/05/03 18:54:53 UTC

[jira] [Created] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Roman Shaposhnik created HADOOP-8353:
----------------------------------------

             Summary: hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
                 Key: HADOOP-8353
                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
             Project: Hadoop Common
          Issue Type: Improvement
          Components: scripts
    Affects Versions: 0.23.1
            Reporter: Roman Shaposhnik
            Assignee: Roman Shaposhnik
             Fix For: 2.0.0


The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.

I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.

I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.

Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272550#comment-13272550 ] 

Aaron T. Myers commented on HADOOP-8353:
----------------------------------------

Patch looks good to me. Roman, can you comment on what testing you did of this patch?

+1 pending Jenkins and an answer to the above question.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273398#comment-13273398 ] 

Hudson commented on HADOOP-8353:
--------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2305 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2305/])
    HADOOP-8353. hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop. Contributed by Roman Shaposhnik. (Revision 1337251)

     Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1337251
Files : 
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/bin/mr-jobhistory-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/bin/yarn-daemon.sh

                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272597#comment-13272597 ] 

Roman Shaposhnik commented on HADOOP-8353:
------------------------------------------

As far as testing goes -- I just manually replaced existing scripts in the pseudo distributed Hadoop deployment and ran a couple of start/stop commands.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272448#comment-13272448 ] 

Aaron T. Myers commented on HADOOP-8353:
----------------------------------------

bq. The unfortunate truth there is that everything else in that script has a YARN_ prefix (I suspect because it was copied from the yarn-daemon.sh). I'd rather keep things consistent, but if you really think this lonely var should be MR_ prefixed – please let me know.

Got it. Makes sense the way you have it.

bq. HDFS daemons use hadoop-daemon.sh In fact at this point it can be safely called hdfs-daemon.sh since I don't think anything else is really using it.

Got it. Thanks for the explanation.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273982#comment-13273982 ] 

Hudson commented on HADOOP-8353:
--------------------------------

Integrated in Hadoop-Hdfs-trunk #1041 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1041/])
    HADOOP-8353. hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop. Contributed by Roman Shaposhnik. (Revision 1337251)

     Result = FAILURE
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1337251
Files : 
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/bin/mr-jobhistory-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/bin/yarn-daemon.sh

                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271851#comment-13271851 ] 

Aaron T. Myers commented on HADOOP-8353:
----------------------------------------

Patch looks pretty good to me. A few questions:

# Perhaps we should make the message more verbose in the event we fall back to `kill -9' ? I'm thinking something along the lines of "Daemon did not stop gracefully after signaling it X seconds ago. Trying `kill -9 $TARGET_PID'"
# Does it definitely make sense to call the env var "YARN_STOP_TIMEOUT" in the MR job history server? Should it not be "MR_STOP_TIMEOUT" ?
# Do similar changes not need to be made for the HDFS daemons?
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron T. Myers updated HADOOP-8353:
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.0
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

I've just committed this to trunk, branch-2, and branch-2.0.0-alpha.

Thanks a lot for the contribution, Roman.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273422#comment-13273422 ] 

Hudson commented on HADOOP-8353:
--------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #2248 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2248/])
    HADOOP-8353. hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop. Contributed by Roman Shaposhnik. (Revision 1337251)

     Result = ABORTED
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1337251
Files : 
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/bin/mr-jobhistory-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/bin/yarn-daemon.sh

                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273402#comment-13273402 ] 

Hudson commented on HADOOP-8353:
--------------------------------

Integrated in Hadoop-Common-trunk-Commit #2231 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2231/])
    HADOOP-8353. hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop. Contributed by Roman Shaposhnik. (Revision 1337251)

     Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1337251
Files : 
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/bin/mr-jobhistory-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/bin/yarn-daemon.sh

                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272592#comment-13272592 ] 

Aaron T. Myers commented on HADOOP-8353:
----------------------------------------

I'm confident the test failure is unrelated.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Shaposhnik updated HADOOP-8353:
-------------------------------------

    Attachment: HADOOP-8353.patch.txt
    
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272591#comment-13272591 ] 

Hadoop QA commented on HADOOP-8353:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12526380/HADOOP-8353-2.patch.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests in hadoop-common-project/hadoop-common:

                  org.apache.hadoop.fs.viewfs.TestViewFsTrash

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/978//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/978//console

This message is automatically generated.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270733#comment-13270733 ] 

Hadoop QA commented on HADOOP-8353:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12526026/HADOOP-8353.patch.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-common-project/hadoop-common.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/960//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/960//console

This message is automatically generated.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Shaposhnik updated HADOOP-8353:
-------------------------------------

    Status: Patch Available  (was: Open)
    
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272442#comment-13272442 ] 

Roman Shaposhnik commented on HADOOP-8353:
------------------------------------------

@Aaron,

bq. Perhaps we should make the message more verbose 

Agreed. I'll modify the patch to make it more obvious

bq. "YARN_STOP_TIMEOUT" in the MR job history serve

The unfortunate truth there is that *everything* else in that script has a YARN_ prefix (I suspect because it was copied from the yarn-daemon.sh). I'd rather keep things consistent, but if you really think this lonely var should be MR_ prefixed -- please let me know.

bq. Do similar changes not need to be made for the HDFS daemons?

HDFS daemons use hadoop-daemon.sh In fact at this point it can be safely called hdfs-daemon.sh since I don't think anything else is really using it.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274004#comment-13274004 ] 

Hudson commented on HADOOP-8353:
--------------------------------

Integrated in Hadoop-Mapreduce-trunk #1077 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1077/])
    HADOOP-8353. hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop. Contributed by Roman Shaposhnik. (Revision 1337251)

     Result = SUCCESS
atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1337251
Files : 
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/bin/mr-jobhistory-daemon.sh
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/bin/yarn-daemon.sh

                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Shaposhnik updated HADOOP-8353:
-------------------------------------

    Attachment: HADOOP-8353-2.patch.txt

Patch with updated message attached.
                
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: 2.0.0
>
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8353) hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop

Posted by "Aaron T. Myers (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron T. Myers updated HADOOP-8353:
-----------------------------------

    Target Version/s: 2.0.0
       Fix Version/s:     (was: 2.0.0)
    
> hadoop-daemon.sh and yarn-daemon.sh can be misleading on stop
> -------------------------------------------------------------
>
>                 Key: HADOOP-8353
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8353
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: scripts
>    Affects Versions: 0.23.1
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>         Attachments: HADOOP-8353-2.patch.txt, HADOOP-8353.patch.txt
>
>
> The way that stop actions is implemented is a simple SIGTERM sent to the JVM. There's a time delay between when the action is called and when the process actually exists. This can be misleading to the callers of the *-daemon.sh scripts since they expect stop action to return when process is actually stopped.
> I suggest we augment the stop action with a time-delay check for the process status and a SIGKILL once the delay has expired.
> I understand that sending SIGKILL is a measure of last resort and is generally frowned upon among init.d script writers, but the excuse we have for Hadoop is that it is engineered to be a fault tolerant system and thus there's not danger of putting system into an incontinent state by a violent SIGKILL. Of course, the time delay will be long enough to make SIGKILL event a rare condition.
> Finally, there's always an option of an exponential back-off type of solution if we decide that SIGKILL timeout is short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira