You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hawq.apache.org by Xiang Dai <xi...@iluvatar.ai> on 2018/11/15 06:46:04 UTC

restart failed when HA

Init succeed:
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-Prepare to do 'hawq init'
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-You can find log in:
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_init_20181115.log
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-GPHOME is set to:
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-/usr/local/apache-hawq
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-Init hawq with args: ['init', 'cluster']
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[ERROR]:-Warning: Permanently added '192.168.60.18' (ECDSA) to the list of known hosts.
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-Check if hdfs path is available
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[WARNING]:-2018-11-15 14:41:26.668731, p52400, th140043878250816, WARNING the number of nodes in pipeline is 2 [dx-storage2.novalocal(192.168.60.17), dx-storage.novalocal(192.168.60.24)], is less than the expected number of replica 3 for block [block pool ID: BP-1340065686-192.168.60.24-1542263049168 block ID 1073741827_1003] file /hawq/default_filespace/testFile
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-2 segment hosts defined
20181115:14:41:26:052197 hawq_init:test-computing:gpadmin-[INFO]:-Set default_hash_table_bucket_number as: 12
20181115:14:41:29:052197 hawq_init:test-computing:gpadmin-[INFO]:-Start to init master node: '192.168.60.27'
20181115:14:41:49:052197 hawq_init:test-computing:gpadmin-[INFO]:-20181115:14:41:49:052570 hawqinit.sh:test-computing:gpadmin-[INFO]:-Loading hawq_toolkit...
20181115:14:41:49:052197 hawq_init:test-computing:gpadmin-[INFO]:-Master init successfully
20181115:14:41:49:052197 hawq_init:test-computing:gpadmin-[INFO]:-Start to init standby master: '192.168.60.18'
20181115:14:41:49:052197 hawq_init:test-computing:gpadmin-[INFO]:-This might take a couple of minutes, please wait...
20181115:14:41:52:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-HAWQ master stopped
20181115:14:41:52:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-Sync files to standby from master
20181115:14:41:55:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-Update pg_hba configuration
Warning: Permanently added '192.168.60.18' (ECDSA) to the list of known hosts.
20181115:14:41:56:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-Standby ip address is 192.168.60.18
20181115:14:41:56:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-Start hawq master
20181115:14:41:58:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-HAWQ master started
20181115:14:41:58:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-Try to remove existing standby from catalog
20181115:14:41:58:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-Register standby to master successfully
20181115:14:42:00:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-HAWQ master stopped
20181115:14:42:05:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-HAWQ standby started
20181115:14:42:11:007176 hawqinit.sh:test-computing2:gpadmin-[INFO]:-HAWQ master started
20181115:14:42:12:052197 hawq_init:test-computing:gpadmin-[INFO]:-Init standby successfully
20181115:14:42:12:052197 hawq_init:test-computing:gpadmin-[INFO]:-Init segments in list: ['192.168.60.27', '192.168.60.18']
20181115:14:42:12:052197 hawq_init:test-computing:gpadmin-[INFO]:-Total segment number is: 2
.................
20181115:14:42:29:052197 hawq_init:test-computing:gpadmin-[INFO]:-2 of 2 segments init successfully
20181115:14:42:29:052197 hawq_init:test-computing:gpadmin-[INFO]:-Segments init successfully on nodes '['192.168.60.27', '192.168.60.18']'
20181115:14:42:29:052197 hawq_init:test-computing:gpadmin-[INFO]:-Init HAWQ cluster successfully

But check state failed:

[gpadmin@test-computing ~]$ hawq state
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:-- HAWQ instance status summary
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:------------------------------------------------------
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Master instance                                = Active
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Master standby                                 = 192.168.60.18
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Standby master state                           = Standby host passive
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segment instance count from config file  = 2
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:------------------------------------------------------
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Segment Status
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:------------------------------------------------------
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segments count from catalog      = 0
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segment valid (at master)        = 0
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segment failures (at master)     = 2
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total number of postmaster.pid files missing   = 2
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total number of postmaster.pid files found     = 0

Then restart failed:
[gpadmin@test-computing ~]$ hawq state
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:-- HAWQ instance status summary
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:------------------------------------------------------
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Master instance                                = Active
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Master standby                                 = 192.168.60.18
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Standby master state                           = Standby host passive
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segment instance count from config file  = 2
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:------------------------------------------------------
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Segment Status
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:------------------------------------------------------
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segments count from catalog      = 0
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segment valid (at master)        = 0
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total segment failures (at master)     = 2
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total number of postmaster.pid files missing   = 2
20181115:14:44:25:054941 hawq_state:test-computing:gpadmin-[INFO]:--   Total number of postmaster.pid files found     = 0
[gpadmin@test-computing ~]$ hawq restart cluster -a
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Prepare to do 'hawq stop'
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-You can find log in:
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_stop_20181115.log
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-GPHOME is set to:
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-/usr/local/apache-hawq
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop hawq with args: ['stop', 'cluster']
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[WARNING]:-Failed to connect to database, cannot get hawq_acl_type
20181115:14:45:03:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop hawq cluster
20181115:14:45:04:055070 hawq_stop:test-computing:gpadmin-[INFO]:-There are 0 connections to the database
20181115:14:45:04:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Commencing Master instance shutdown with mode='smart'
20181115:14:45:04:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Master host=192.168.60.27
20181115:14:45:04:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop hawq master
20181115:14:45:05:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Master stopped successfully
20181115:14:45:05:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop hawq standby master
20181115:14:45:06:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Standby master stopped successfully
20181115:14:45:06:055070 hawq_stop:test-computing:gpadmin-[WARNING]:-Try to stop RPS when hawq_acl_type is unknown
20181115:14:45:06:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop Ranger plugin service
20181115:14:45:06:055070 hawq_stop:test-computing:gpadmin-[INFO]:-bash: /usr/local/apache-hawq/ranger/bin/rps.sh: No such file or directory
20181115:14:45:06:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop hawq segment
20181115:14:45:06:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Stop segments in list: ['192.168.60.27', '192.168.60.18']
20181115:14:45:08:055070 hawq_stop:test-computing:gpadmin-[WARNING]:-HAWQ process is not running on 192.168.60.18, skip
20181115:14:45:08:055070 hawq_stop:test-computing:gpadmin-[WARNING]:-HAWQ process is not running on 192.168.60.27, skip
20181115:14:45:08:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Total segment number is: 2

20181115:14:45:08:055070 hawq_stop:test-computing:gpadmin-[INFO]:-0 of 0 segments stop successfully, 2 segments stop skipped
20181115:14:45:08:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Segments stopped successfully
20181115:14:45:08:055070 hawq_stop:test-computing:gpadmin-[INFO]:-Cluster stopped successfully
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-Prepare to do 'hawq start'
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-You can find log in:
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-/home/gpadmin/hawqAdminLogs/hawq_start_20181115.log
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-GPHOME is set to:
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-/usr/local/apache-hawq
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-Start hawq with args: ['start', 'cluster']
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-Gathering information and validating the environment...
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-Start all the nodes in hawq cluster
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-Starting standby master '192.168.60.18'
20181115:14:45:08:055248 hawq_start:test-computing:gpadmin-[INFO]:-Start standby master service
20181115:14:45:08:008250 hawqstandbywatch.py:test-computing2:gpadmin-[INFO]:-Checking standby master status
20181115:14:45:08:008250 hawqstandbywatch.py:test-computing2:gpadmin-[INFO]:-Monitoring logs
20181115:14:45:11:008250 hawqstandbywatch.py:test-computing2:gpadmin-[INFO]:-checking if syncmaster is running
20181115:14:45:12:008250 hawqstandbywatch.py:test-computing2:gpadmin-[INFO]:-syncmaster appears ok, pid 8237
20181115:14:45:12:055248 hawq_start:test-computing:gpadmin-[INFO]:-Standby master started successfully
20181115:14:45:12:055248 hawq_start:test-computing:gpadmin-[INFO]:-Starting master node '192.168.60.27'
20181115:14:45:12:055248 hawq_start:test-computing:gpadmin-[INFO]:-Start master service
20181115:14:45:13:055248 hawq_start:test-computing:gpadmin-[INFO]:-Checking if standby is synced with master
20181115:14:45:13:055248 hawq_start:test-computing:gpadmin-[ERROR]:-Failed to connect to database, this script can only be run when the database is up
Traceback (most recent call last):
  File "/usr/local/apache-hawq/bin/hawq_ctl", line 1459, in <module>
    start_hawq(opts, hawq_dict)
  File "/usr/local/apache-hawq/bin/hawq_ctl", line 1233, in start_hawq
    instance.run()
  File "/usr/local/apache-hawq/bin/hawq_ctl", line 765, in run
    check_return_code(self._start_all_nodes())
  File "/usr/local/apache-hawq/bin/hawq_ctl", line 701, in _start_all_nodes
    check_return_code(self.start_master(), logger, "Master start failed, exit", \
  File "/usr/local/apache-hawq/bin/hawq_ctl", line 618, in start_master
    sync_result = self._check_standby_sync()
  File "/usr/local/apache-hawq/bin/hawq_ctl", line 671, in _check_standby_sync
    for row in rows:
UnboundLocalError: local variable 'rows' referenced before assignment

Can not use start/restart cluster when HA?