You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/05/02 17:40:00 UTC

[jira] [Moved] (HBASE-22352) use a system table as an alternative proc store

     [ https://issues.apache.org/jira/browse/HBASE-22352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin moved HIVE-21676 to HBASE-22352:
-------------------------------------------------

        Key: HBASE-22352  (was: HIVE-21676)
    Project: HBase  (was: Hive)

> use a system table as an alternative proc store
> -----------------------------------------------
>
>                 Key: HBASE-22352
>                 URL: https://issues.apache.org/jira/browse/HBASE-22352
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> We keep hitting these issues:
> {noformat}
> 2019-04-30 23:41:52,164 INFO  [master/master:17000:becomeActiveMaster] procedure2.ProcedureExecutor: Starting 16 core workers (bigger of cpus/4 or 16) with max (burst) worker count=160
> 2019-04-30 23:41:52,171 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recover lease on dfs file .../MasterProcWALs/pv2-00000000000000000481.log
> 2019-04-30 23:41:52,176 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recovered lease, attempt=0 on file=.../MasterProcWALs/pv2-00000000000000000481.log after 5ms
> 2019-04-30 23:41:52,288 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recover lease on dfs file .../MasterProcWALs/pv2-00000000000000000482.log
> 2019-04-30 23:41:52,289 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recovered lease, attempt=0 on file=.../MasterProcWALs/pv2-00000000000000000482.log after 1ms
> 2019-04-30 23:41:52,373 INFO  [master/master:17000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=483
> 2019-04-30 23:41:52,375 INFO  [master/master:17000:becomeActiveMaster] procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 206msec
> 2019-04-30 23:41:52,782 INFO  [master/master:17000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 1556 entries in .../MasterProcWALs/pv2-00000000000000000482.log
> 2019-04-30 23:41:55,370 INFO  [master/master:17000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 28113 entries in .../MasterProcWALs/pv2-00000000000000000481.log
> 2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 166, max stack id is 181, root procedure is Procedure(pid=289380, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> 2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 178, max stack id is 181, root procedure is Procedure(pid=289380, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> 2019-04-30 23:41:55,389 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 359, max stack id is 360, root procedure is Procedure(pid=285640, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
> {noformat}
> After which the procedure(s) is/are lost and cluster is stuck permanently.
> There were no errors writing these files in the log, and no issues reading them from HDFS, so it's purely a data loss issue in the structure. 
> I was thinking about debugging it, but on 2nd thought what we are trying to store is some PB blob, by key.
> Coincidentally, we have an "HBase" facility that we already deploy, that does just that... and it even has a WAL implementation. I don't know why we cannot use it for procedure state and have to invent another complex implementation of a KV store inside a KV store.
> In all/most cases, we don't even support rollback and use the latest state, but if we need multiple versions, this HBase product even supports that! 
> I think we should add a hbase:proc table that would be maintained similar to meta. The latter part esp. given the existing code for meta should be much more simple than a separate store impl.
> This should be pluggable and optional via ProcStore interface (made more abstract as relevant - update state, scan state, get)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)