You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Ted Tuttle <te...@mentacapital.com> on 2016/04/12 02:08:53 UTC

unstable cluster

Hello -

We've started experiencing regular failures of our HBase cluster.  For the last week we've had nightly failures about 1hr after a heavy batch process starts.

In the logs below we see the failure starting at 2016-04-11 03:11 in zookeeper, master and region server logs:

zookeeper:  http://pastebin.com/kf7ja22K

region server: http://pastebin.com/tduJgKqq

master:  http://pastebin.com/0szhi0bJ

The master log seems most interesting.  Here we see problems connecting to Zookeeper then a number of region servers dying in quick succession.  From the log evidence it appears Zookeeper is not responding rather than the more typical GC causing isolated RS to abort.

Any insights on what may be happening here?

Best,
Ted

Re: unstable cluster

Posted by Ted Yu <yu...@gmail.com>.

>From region server log:

2016-04-11 03:11:51,589 WARN org.apache.zookeeper.ClientCnxnSocket:
Connected to an old server; r-o mode will be unavailable
2016-04-11 03:11:51,589 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x52ee1452fec5ac has expired,
closing socket connection

>From zookeeper log:

2016-04-11 03:11:27,323 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] -
Closed socket connection for client /172.20.67.19:58404 which had sessionid
0x52ee1452fec71f
2016-04-11 03:11:53,301 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] -
Closed socket connection for client /172.20.67.13:32946 which had sessionid
0x52ee1452fec6ea

Note the 26 second gap.

What do you see in the logs of the other two zookeeper servers ?

Thanks

On Mon, Apr 11, 2016 at 5:08 PM, Ted Tuttle <te...@mentacapital.com> wrote:

> Hello -
>
> We've started experiencing regular failures of our HBase cluster.  For the
> last week we've had nightly failures about 1hr after a heavy batch process
> starts.
>
> In the logs below we see the failure starting at 2016-04-11 03:11 in
> zookeeper, master and region server logs:
>
> zookeeper:  http://pastebin.com/kf7ja22K
>
> region server: http://pastebin.com/tduJgKqq
>
> master:  http://pastebin.com/0szhi0bJ
>
> The master log seems most interesting.  Here we see problems connecting to
> Zookeeper then a number of region servers dying in quick succession.  From
> the log evidence it appears Zookeeper is not responding rather than the
> more typical GC causing isolated RS to abort.
>
> Any insights on what may be happening here?
>
> Best,
> Ted
>