You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Srinidhi Muppalla <sr...@trulia.com> on 2018/08/31 19:19:48 UTC

HBase unable to connect to zookeeper error

Hello all,

Our production application has recently experienced a very high spike in the following exception along with very large read times to our hbase cluster.

“org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server\n\tat org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)\n\tat org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)\n\tat org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)\n\tat org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)\n\tat org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:623)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:487)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:168)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:605)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:585)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:564)\n\tat org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1152)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1357)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)\n\tat org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:326)\n\tat org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:301)\n\tat org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:166)\n\tat org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:161)\n\tat org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)\n\tat”

This error is not happening consistently as some reads to our table are happening successfully, so I am unable to narrow the issue down to a single configuration or connectivity failure.

Things I’ve tried are:
Using hbase zkcli to connect to our zookeeper server from the master instance. It is able to successfully connect and when running ‘ls’, the “/hbase/meta-region-server” znode is present.
Checking the number of connections that are occurring to our zookeeper instance using the HBase web UI. The number of connections is currently 162. I double checked our hbase config and the value for ‘hbase.zookeeper.property.maxClientCnxns’ is 300.

Any insight into the cause or other steps that I could take to debug this issue would be greatly appreciated.

Thank you,
Srinidhi


Re: HBase unable to connect to zookeeper error

Posted by Josh Elser <el...@apache.org>.
If it was related to maxClientCnxns, you would see sessions being 
torn-down and recreated in HBase on that node, as well as a clear 
message in the ZK server log that it's denying requests because the 
number of outstanding connections from that host exceeds the limit.

ConnectionLoss is a transient ZooKeeper state; more often than not, I 
see this manifest as a result of unplanned pauses in HBase itself. 
Typically this is a result of JVM garbage collection pauses, other times 
from Linux kernel/OS-level pauses. The former you can diagnose via the 
standard JVM GC logging mechanisms, the latter usually via your syslog 
or dmesg.

When looking for unexpected pauses, remember that you also need to look 
at what was happening in ZK. A JVM GC pause in ZK would exhibit the same 
kind of symptoms in HBase.

One final suggestion is to correlate it against other batch jobs (e.g. 
YARN, Spark) which may be running on the same node. It's possible that 
the node is not experiencing any explicit problems, but there is some 
transient workload which happens to run and slows things down.

Have fun digging!

On 8/31/18 3:19 PM, Srinidhi Muppalla wrote:
> Hello all,
> 
> Our production application has recently experienced a very high spike in the following exception along with very large read times to our hbase cluster.
> 
> “org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server\n\tat org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)\n\tat org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)\n\tat org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)\n\tat org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)\n\tat org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:623)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:487)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:168)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:605)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:585)\n\tat org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:564)\n\tat org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1152)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1357)\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)\n\tat org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:326)\n\tat org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:301)\n\tat org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:166)\n\tat org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:161)\n\tat org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)\n\tat”
> 
> This error is not happening consistently as some reads to our table are happening successfully, so I am unable to narrow the issue down to a single configuration or connectivity failure.
> 
> Things I’ve tried are:
> Using hbase zkcli to connect to our zookeeper server from the master instance. It is able to successfully connect and when running ‘ls’, the “/hbase/meta-region-server” znode is present.
> Checking the number of connections that are occurring to our zookeeper instance using the HBase web UI. The number of connections is currently 162. I double checked our hbase config and the value for ‘hbase.zookeeper.property.maxClientCnxns’ is 300.
> 
> Any insight into the cause or other steps that I could take to debug this issue would be greatly appreciated.
> 
> Thank you,
> Srinidhi
>