You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2009/08/08 20:19:14 UTC

[jira] Created: (HBASE-1754) indefinite hang in IPC under some circumstances

indefinite hang in IPC under some circumstances
-----------------------------------------------

                 Key: HBASE-1754
                 URL: https://issues.apache.org/jira/browse/HBASE-1754
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: Andrew Purtell


If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting forever until the regionserver is restarted and at the TCP level the connection will be reset. However, it is not possible to restart the regionserver if the client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1754) use TCP keepalives

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1754:
----------------------------------

         Priority: Minor  (was: Major)
    Fix Version/s: 0.21.0
                   0.20.0
          Summary: use TCP keepalives  (was: indefinite hang in IPC under some circumstances)

> use TCP keepalives
> ------------------
>
>                 Key: HBASE-1754
>                 URL: https://issues.apache.org/jira/browse/HBASE-1754
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>             Fix For: 0.20.0, 0.21.0
>
>         Attachments: HBASE-1754.patch
>
>
> If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting "forever" (> 12 hours, etc.). This hoses the client, especially if it is trying to look up a region in META. Worse, it is not possible to restart the regionserver if the hung client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1754) indefinite hang in IPC under some circumstances

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1754:
----------------------------------

    Assignee: Andrew Purtell
      Status: Patch Available  (was: Open)

> indefinite hang in IPC under some circumstances
> -----------------------------------------------
>
>                 Key: HBASE-1754
>                 URL: https://issues.apache.org/jira/browse/HBASE-1754
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>         Attachments: HBASE-1754.patch
>
>
> If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting "forever" (> 12 hours, etc.). This hoses the client, especially if it is trying to look up a region in META. Worse, it is not possible to restart the regionserver if the hung client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1754) indefinite hang in IPC under some circumstances

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1754:
----------------------------------

    Description: If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting "forever" (> 12 hours, etc.). This hoses the client, especially if it is trying to look up a region in META. Worse, it is not possible to restart the regionserver if the hung client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level.   (was: If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting forever until the regionserver is restarted and at the TCP level the connection will be reset. However, it is not possible to restart the regionserver if the client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. )

> indefinite hang in IPC under some circumstances
> -----------------------------------------------
>
>                 Key: HBASE-1754
>                 URL: https://issues.apache.org/jira/browse/HBASE-1754
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>
> If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting "forever" (> 12 hours, etc.). This hoses the client, especially if it is trying to look up a region in META. Worse, it is not possible to restart the regionserver if the hung client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1754) indefinite hang in IPC under some circumstances

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1754:
----------------------------------

    Attachment: HBASE-1754.patch

Attached patch enables TCP keepalives by default, allows them to be turned off via hbase-site. 

> indefinite hang in IPC under some circumstances
> -----------------------------------------------
>
>                 Key: HBASE-1754
>                 URL: https://issues.apache.org/jira/browse/HBASE-1754
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>         Attachments: HBASE-1754.patch
>
>
> If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting "forever" (> 12 hours, etc.). This hoses the client, especially if it is trying to look up a region in META. Worse, it is not possible to restart the regionserver if the hung client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1754) use TCP keepalives

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell updated HBASE-1754:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed to branch and trunk.

> use TCP keepalives
> ------------------
>
>                 Key: HBASE-1754
>                 URL: https://issues.apache.org/jira/browse/HBASE-1754
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>             Fix For: 0.20.0, 0.21.0
>
>         Attachments: HBASE-1754.patch
>
>
> If a regionserver crashes while the client is engaged in IPC with it at a vulnerable point in the TCP FSM (ESTABLISHED, no outstanding data to send), the IPC will be stuck waiting "forever" (> 12 hours, etc.). This hoses the client, especially if it is trying to look up a region in META. Worse, it is not possible to restart the regionserver if the hung client is colocated with it on the same host, because the OS will consider port 60020 bound and in use, unless the client is forcibly killed. Killing some types of applications -- especially long running processes which can't redo work from a checkpoint but must start over from the beginning -- can be very painful. Investigate if TCP keepalives can be enabled at the IPC level. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.