You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Jingxuan Fu (Jira)" <ji...@apache.org> on 2022/05/09 08:56:00 UTC

[jira] [Created] (HBASE-27016) when “zookeeper.session.timeout” is set to a negative number, HMaster cannot stop normally

Jingxuan Fu created HBASE-27016:
-----------------------------------

             Summary: when “zookeeper.session.timeout” is set to a negative number, HMaster cannot stop normally
                 Key: HBASE-27016
                 URL: https://issues.apache.org/jira/browse/HBASE-27016
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.2.2
         Environment: HBase 2.2.2
os.name=Linux
os.arch=amd64
os.version=5.4.0-72-generic
java.version=1.8.0_211
java.vendor=Oracle Corporation
            Reporter: Jingxuan Fu
            Assignee: Jingxuan Fu


In hbase- default.xml
<property>
    <name>zookeeper.session.timeout</name>
    <value>90000</value>
    <description>ZooKeeper session timeout in milliseconds. It is used in two different ways.
        First, this value is used in the ZK client that HBase uses to connect to the ensemble.
        It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'.
        See https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions.
        For example, if an HBase region server connects to a ZK ensemble that's also managed
        by HBase, then the session timeout will be the one specified by this configuration.
        But, a region server that connects to an ensemble managed with a different configuration
        will be subjected that ensemble's maxSessionTimeout. So, even though HBase might propose
        using 90 seconds, the ensemble can have a max timeout lower than this and it will take
        precedence. The current default maxSessionTimeout that ZK ships with is 40 seconds, which is lower than
        HBase's.
    </description>
</property>
The impact caused by setting zookeeper.session.timeout to negative number is different in local mode and pseudo-distributed mode.

After starting hbase in local mode, hmaster will replace the sessiontimeout configured in this start master() with the default value of 10000 when it is judged as local mode, so setting zookeeper.session.timeout to a negative number in hbase-site.xml will have no effect on the hbase startup process.
private int startMaster() {
  Configuration conf = getConf();
  TraceUtil.initTracer(conf);
​
  try {
    // If 'local', defer to LocalHBaseCluster instance.  Starts master
    // and regionserver both in the one JVM.
    if (LocalHBaseCluster.isLocal(conf)) {
      ...
      int localZKClusterSessionTimeout;
      localZKClusterSessionTimeout = conf.getInt(HConstants.ZK_SESSION_TIMEOUT + ".localHBaseCluster", 10*1000);
      conf.setInt(HConstants.ZK_SESSION_TIMEOUT, localZKClusterSessionTimeout);
      LOG.info("Starting a zookeeper cluster");
      ...
      }
Take the example of setting zookeeper.session.timeout to -1 in hbase-site.xml to start hbase in local mode normally.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./start-hbase.sh 
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
running master, logging to /home/hadoop/hbase-2.2.2-ori/bin/../logs/hbase-hadoop-master-ljq1.out
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ jps
25667 HMaster
21859 Jps
2022-05-09 11:16:57,571 INFO  [master/ljq1:16000:becomeActiveMaster] master.HMaster: Master has completed initialization 8.179sec
However, next, when the hbase shell as a client initiates a request for basic database operations to the started HBase, the zookeeper.session.timeout is negative causing a ConnectionLoss error and any requested operation is invalid as it recreates the conf to load the configuration.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./hbase shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.2, re6513a76c91cceda95dad7af246ac81d46fa2589, Sat Oct 19 10:10:12 UTC 2019
Took 0.0212 seconds                                                                                                                                                                                                                     
hbase(main):001:0> status
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "status"'
Took 34.2850 seconds                                                                                                                                                                                                                    
hbase(main):002:0> create 'aaaaa','a11111','a222222'
ERROR: KeeperErrorCode = ConnectionLoss for /hbase/master
For usage try 'help "create"'
Took 33.2029 seconds
The log shows that a socket connection was received, but consistently throws a warning that no additional data could be read from the client session and that the client may have closed the socket.

Next, use the official termination method . /stop-hbase.sh to shut down hbase, it will also recreate the conf and load the configuration when shutting down. zookeeper.session.timeout will be negative and the terminal will be stuck in an infinite wait, and eventually the process will have to be terminated by kill -9, which is the official termination method . /stop-hbase.sh to shut down hbase is not effective.
hadoop@ljq1:~/hbase-2.2.2-ori/bin$ ./stop-hbase.sh 
stopping hbase.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................force stopping stop-master-command with kill -9 25667
Using the official termination method . /stop-hbase.sh when shutting down hbase, the log shows that the stop request was received as well as the socket connection, but still no additional data can be read from the client session, probably with the warning that the client has closed the socket.
2022年 05月 09日 星期一 14:35:07 CST Stopping hbase (via master)
2022-05-09 14:35:11,588 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33222
2022-05-09 14:35:12,709 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33224
2022-05-09 14:35:13,810 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33226
2022-05-09 14:35:14,912 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33228
2022-05-09 14:35:16,013 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33230
2022-05-09 14:35:17,115 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33232
2022-05-09 14:35:18,217 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33234
2022-05-09 14:35:19,321 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33236
...
2022-05-09 14:36:18,846 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:33344
2022-05-09 14:36:19,033 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket
  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
  at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
  at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,034 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33344 (no session established for client)
2022-05-09 14:36:19,034 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket
  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
  at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
  at java.base/java.lang.Thread.run(Thread.java:829)
2022-05-09 14:36:19,035 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:33342 (no session established for client)
2022-05-09 14:36:19,035 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket
  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
  at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
  at java.base/java.lang.Thread.run(Thread.java:829)
Although, before finally using kill -9 to force a shutdown, the logs show that the reason for not being able to shut down properly may be because the zookeeper client session timed out and the client session did not receive a message from the server within 0ms and was closing the socket connection and trying to reconnect.
2022-05-09 14:55:13,486 INFO  [main] zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=-1 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1cf6d1be
2022-05-09 14:55:13,589 INFO  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2022-05-09 14:55:13,621 WARN  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 0ms for sessionid 0x0
2022-05-09 14:55:13,622 INFO  [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 0ms for sessionid 0x0, closing socket connection and attempting reconnect
2022-05-09 14:55:13,737 INFO  [main] zookeeper.ZooKeeper: Session: 0x0 closed
2022-05-09 14:55:13,739 ERROR [main] util.ServerCommandLine: Failed to run
java.lang.IllegalArgumentException: Invalid path string "/hbase/rs//hbase/rs/ljq1,16020,1652066208559" caused by empty node name specified @10
  at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:99)
  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
  at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
  at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1257)
  at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeFailSilent(ZKUtil.java:1249)
  at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:188)
  at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:144)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
  at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
  at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2942)
2022-05-09 14:55:13,747 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x0
It can be speculated that the process of shutting down hbase lacks a mechanism to determine and handle the impact of a negative zookeeper.session.timeout, resulting in the process of repeatedly closing the socket connection and trying to reconnect, without actually performing the operation of terminating hbase.

And in local mode, the set value of zookeeper.session.timeout in hbase-site.xml is replaced with the default value in the startup phase, while the set value is used in the run phase and shutdown, and this configuration item is used inconsistently in different phases.

 

In pseudo-distributed mode, when zookeeper.session.timeout is set to a negative number, in the startup hbase phase, unlike the local mode replaces zookeeper.session.timeout with the default value of 10000, which can be checked in time for the exception caused by the negative value of the field. The log shows that the connection is closed after several attempts to reconnect after the client and the collection establish a connection judgment timeout, and then the zookeeper client event thread is closed , further leading to the failure to create the regionserver and the master node exits.

The phenomenon observed throughout the process is that at first the Hbase daemons can be started normally, but after a period of time only the process HQuorumPeer remains, and the function cannot be used normally.

 

To summarize, for the effect of setting zookeeper.session.timeout to a negative number, hbase can be started normally in local mode, but no valid hbase database operations can be initiated, and using the official termination method . /stop-hbase.sh does not work to shut down hbase. In pseudo-distributed mode, hbase can be started normally but some of the daemons exit after a period of time and the functionality cannot be used normally.

In both modes, there is a lack of judgment on the impact of negative zookeeper.session.timeout and the related handling mechanism is not perfect, and there is a lack of clear and effective information, which makes it impossible to locate and handle exceptions in time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)