You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/07/19 18:53:49 UTC

[jira] Created: (HBASE-2849) Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"

Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"
---------------------------------------------------------------------

                 Key: HBASE-2849
                 URL: https://issues.apache.org/jira/browse/HBASE-2849
             Project: HBase
          Issue Type: Bug
            Reporter: stack
             Fix For: 0.90.0


Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:

" It seems that when Zookeeper
dies and restarts, all client applications need to be restarted too.
I just restarted HBase in non-distributed mode (which includes a ZK)
and now the TSD can't reconnect to ZK unless I restart it too.  I'm
stuck in this loop:

2010-07-19 00:13:05,725 INFO
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection
for client /127.0.0.1:55153 (no session established for
client)2010-07-19 00:13:07,052 INFO
org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection
from /127.0.0.1:55154
2010-07-19 00:13:07,053 INFO
org.apache.zookeeper.server.NIOServerCnxn: Refusing session request
for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is
0xd7 client must try another
server"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2849) Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"

Posted by "Benoit Sigoure (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891848#action_12891848 ] 

Benoit Sigoure commented on HBASE-2849:
---------------------------------------

http://hadoop.apache.org/zookeeper/docs/r3.3.1/api/org/apache/zookeeper/ZooKeeper.html
bq. If for some reason, the client fails to send heart beats to the server for a prolonged period of time (exceeding the sessionTimeout value, for instance), the server will expire the session, and the session ID will become invalid. The client object will no longer be usable. To make ZooKeeper API calls, the application must create a new client object.

So apparently, a new {{ZooKeeper}} object must be created when the session becomes invalid.  This sounds like a bad API, not sure why they did it this way.  In HBase's source code, it seems that the only thing that creates a {{ZooKeeper}} instance is in {{ZooKeeperWrapper#reconnectToZk}}.  This method, although it's public, is only called from 3 other methods in that class: the constructor, {{exists}} and {{deleteUnassignedRegion}}.  The latter, {{deleteUnassignedRegion}}, is only used by the master.  The former, {{exists}}, is only called from the following locations:
* {{ZKUnassignedWatcher}}'s constructor.  This is only used in the master.
* {{RSZookeeperUpdater#startRegionCloseEvent}}.  This is only used in the region server.
* {{ZooKeeperWrapper#createOrUpdateUnassignedRegion}}.  This is only used by the master's {{RegionManager}}.
* {{ZooKeeperWrapper#createUnassignedRegion}} and {{ZooKeeperWrapper#updateUnassignedRegion}}.  Those two methods, even though they're public, are only called from {{ZooKeeperWrapper#createOrUpdateUnassignedRegion}}, which itself is only used by the master's {{RegionManager}}.

In other words, for someone writing an HBase application, only a single {{ZooKeeper}} instance gets created when the {{ZooKeeperWrapper}} is instantiated.  Any failure that causes the client's session to become invalid will is unrecoverable with the current code and the client has to be killed and restarted.

Jonathan, is the work being done for the master rewrite branch going to address this issue?  Bear in mind that here I'm concerned about HBase *client* applications.

> Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2849) HBase clients cannot recover when their ZooKeeper session becomes invalid

Posted by "Benoit Sigoure (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Sigoure updated HBASE-2849:
----------------------------------

    Attachment: 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch

Patch that fixes the issue.  Actually there was some logic I didn't notice earlier in {{HConnectionManager}} to attempt to deal with ZK failures and reconnect when needed, but the code wasn't doing the right thing and didn't work when there was a disconnection between the HBase client and the ZK quorum.  So the patch is rather simple and consists in fixing the existing logic in {{HConnectionManager.ClientZKWatcher}}.

I tested this by starting a long running HBase application, killing the whole ZooKeeper ensemble and restarting it.  The application experiences a hiccup while ZK is unavailable and is able to recover automatically soon after the ZK quorum is back online.  Someone else is more than welcome to write a unit test that simulates this scenario if they feel like it.

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100621
>            Reporter: stack
>            Assignee: Benoit Sigoure
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2849) HBase clients cannot recover when their ZooKeeper session becomes invalid

Posted by "Benoit Sigoure (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Sigoure updated HBASE-2849:
----------------------------------

        Summary: HBase clients cannot recover when their ZooKeeper session becomes invalid  (was: Clients stuck in loop doing "NIOServerCnxn: Closed socket connection")
           Tags: reliability, zookeeper
       Priority: Critical  (was: Major)
    Component/s: client

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2849) Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889934#action_12889934 ] 

Jonathan Gray commented on HBASE-2849:
--------------------------------------

Client+ZK interaction just got a makeover in the master rewrite branch.  It now handles master failovers properly and re-initializes zk connections but I'm not sure it will sustain a zk restart.  Any chance of a reproducing unit test?

> Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2849) HBase clients cannot recover when their ZooKeeper session becomes invalid

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891851#action_12891851 ] 

Jonathan Gray commented on HBASE-2849:
--------------------------------------

It would be possible for the client to try to reconnect after being expired.  It's not built-in to the way I have it now but it's possible to add it.

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2849) HBase clients cannot recover when their ZooKeeper session becomes invalid

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2849:
-------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed.  Thats for the 'duh' patch Benôit.

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100621
>            Reporter: stack
>            Assignee: Benoit Sigoure
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2849) HBase clients cannot recover when their ZooKeeper session becomes invalid

Posted by "Benoit Sigoure (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Sigoure updated HBASE-2849:
----------------------------------

               Status: Patch Available  (was: Open)
    Affects Version/s: 0.89.20100621

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100621
>            Reporter: stack
>            Assignee: Benoit Sigoure
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-2849) HBase clients cannot recover when their ZooKeeper session becomes invalid

Posted by "Benoit Sigoure (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Sigoure reassigned HBASE-2849:
-------------------------------------

    Assignee: Benoit Sigoure

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>            Reporter: stack
>            Assignee: Benoit Sigoure
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2849) Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"

Posted by "Benoit Sigoure (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Sigoure updated HBASE-2849:
----------------------------------

    Description: 
Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:

"It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:

{code}
2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
  Closed socket connection for client /127.0.0.1:55153 (no session established for client)
2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
  Accepted socket connection from /127.0.0.1:55154
2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
  Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
  client must try another server
{code}
"

  was:
Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:

" It seems that when Zookeeper
dies and restarts, all client applications need to be restarted too.
I just restarted HBase in non-distributed mode (which includes a ZK)
and now the TSD can't reconnect to ZK unless I restart it too.  I'm
stuck in this loop:

2010-07-19 00:13:05,725 INFO
org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection
for client /127.0.0.1:55153 (no session established for
client)2010-07-19 00:13:07,052 INFO
org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection
from /127.0.0.1:55154
2010-07-19 00:13:07,053 INFO
org.apache.zookeeper.server.NIOServerCnxn: Refusing session request
for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is
0xd7 client must try another
server"


Reformatting a little bit.

> Clients stuck in loop doing "NIOServerCnxn: Closed socket connection"
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>
> Someone made mention of this loop last week but I don't think I filed an issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need to be restarted too. I just restarted HBase in non-distributed mode (which includes a ZK) and now my application can't reconnect to ZK unless I restart it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.