You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Raghu Angadi (JIRA)" <ji...@apache.org> on 2008/10/30 20:16:44 UTC

[jira] Created: (HADOOP-4552) Deadlock in RPC Server

Deadlock in RPC Server
----------------------

                 Key: HADOOP-4552
                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
             Project: Hadoop Core
          Issue Type: Bug
    Affects Versions: 0.16.3
            Reporter: Raghu Angadi
            Assignee: Raghu Angadi



RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.

This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4552) Deadlock in RPC Server

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4552:
---------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.19.1)
                       (was: 0.20.0)
                   0.19.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I just committed this.

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: deadlock-example.txt, HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4552) Deadlock in RPC Server

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4552:
---------------------------------

    Attachment: deadlock-example.txt

The thread stack that Aaron sent. Note that "Thread 9" and "Thread 37" block each other.

Regd locking on Selectors, please read the small section "Concurrency" on Java doc for Selectors : http://java.sun.com/javase/6/docs/api/java/nio/channels/Selector.html


> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.1, 0.20.0
>
>         Attachments: deadlock-example.txt, HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4552) Deadlock in RPC Server

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4552:
---------------------------------

    Component/s: ipc
    Description: 
RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.

This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

  was:

RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.

This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 


> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4552) Deadlock in RPC Server

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644937#action_12644937 ] 

Hadoop QA commented on HADOOP-4552:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12393108/HADOOP-4552.patch
  against trunk revision 709609.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3524/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3524/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3524/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3524/console

This message is automatically generated.

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.1, 0.20.0
>
>         Attachments: HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4552) Deadlock in RPC Server

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647158#action_12647158 ] 

Hudson commented on HADOOP-4552:
--------------------------------

Integrated in Hadoop-trunk #659 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/659/])
    . Fix a deadlock in RPC server. (Raghu Angadi)


> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: deadlock-example.txt, HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4552) Deadlock in RPC Server

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4552:
---------------------------------

    Fix Version/s: 0.20.0
                   0.19.1

Thanks Aaaron. I propose this for 0.19.1 and 0.20.0. Not a blocker for most users.

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.1, 0.20.0
>
>         Attachments: HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4552) Deadlock in RPC Server

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644555#action_12644555 ] 

Aaron Kimball commented on HADOOP-4552:
---------------------------------------

As the reporter of the thread mentioned: we've been running with this patch now for a few days and haven't had to restart mapreduce again yet. This seems to have fixed the issue. (Thanks!)

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4552) Deadlock in RPC Server

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4552:
---------------------------------

    Status: Patch Available  (was: Open)

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.1, 0.20.0
>
>         Attachments: HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4552) Deadlock in RPC Server

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646444#action_12646444 ] 

Konstantin Shvachko commented on HADOOP-4552:
---------------------------------------------

+1 This looks right to me.
I would include in this jira more information about the issue, otherwise it is hard to understand the problem.
The stack trace mentioned in the mail-thread should be attached and a link to the Selector documentation would be helpful.

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>             Fix For: 0.19.1, 0.20.0
>
>         Attachments: deadlock-example.txt, HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4552) Deadlock in RPC Server

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-4552:
---------------------------------

    Attachment: HADOOP-4552.patch


The attached patch removes timeout processing out of lock around low level selector keys. The purging loop is executed only once in few minutes and extra iteration is ok.

> Deadlock in RPC Server
> ----------------------
>
>                 Key: HADOOP-4552
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4552
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 0.16.3
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-4552.patch
>
>
> RPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.
> This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html]. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.