You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Joe McDonnell (Jira)" <ji...@apache.org> on 2019/11/13 22:10:00 UTC
[jira] [Resolved] (IMPALA-9150) Restarting minicluster breaks HBase on CDH GBN 1582079

     [ https://issues.apache.org/jira/browse/IMPALA-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe McDonnell resolved IMPALA-9150.
-----------------------------------
    Fix Version/s: Impala 3.4.0
       Resolution: Fixed

> Restarting minicluster breaks HBase on CDH GBN 1582079
> ------------------------------------------------------
>
>                 Key: IMPALA-9150
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9150
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>             Fix For: Impala 3.4.0
>
>
> On the most recent CDH GBN (1582079), restarting HBase using our normal scripts (testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in an unusable HBase. Our testdata/bin/kill-hbase.sh script use the kill-java-service.sh script:
> {code:java}
> "$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2
> {code}
> This kills the region servers before the master. On CDH GBN 1582079, the master gets unhappy:
> {noformat}
> 19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [localhost,16022,1573402351656]
> 19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of localhost,16022,1573402351656 on localhost,16000,1573402349553
> ... same for other region servers ...
> 19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=false
> ... same for other region servers ...
> 19/11/10 16:40:17 INFO master.SplitLogManager: hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir is empty, no logs to split.
> 19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] in 0ms
> ... more stuff ...
> 19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=falsejava.lang.NullPointerException at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058){noformat}
> Then, when the master starts up again, it remains unhappy:
> {noformat}
> 19/11/10 16:50:58 WARN master.HMaster: hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.
> ... more of this ...
> 19/11/10 16:59:28 WARN master.HMaster: hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined.
> 19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete initialization after 900000ms. Please consider submitting a bug report including a thread dump of this process.{noformat}
> This continues for an indefinite amount of time.
> Current workaround: Use HBase's bin/stop-hbase.sh script rather than our testdata/bin/kill-hbase.sh script. I do not see the problem when using that script, as it does a more graceful shutdown. We should look into changing testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)