You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/05/02 02:04:00 UTC
[jira] [Created] (HIVE-21676) use a system table as an alternative proc store

Sergey Shelukhin created HIVE-21676:
---------------------------------------

             Summary: use a system table as an alternative proc store
                 Key: HIVE-21676
                 URL: https://issues.apache.org/jira/browse/HIVE-21676
             Project: Hive
          Issue Type: Bug
            Reporter: Sergey Shelukhin


We keep hitting these issues:
{noformat}
2019-04-30 23:41:52,164 INFO  [master/master:17000:becomeActiveMaster] procedure2.ProcedureExecutor: Starting 16 core workers (bigger of cpus/4 or 16) with max (burst) worker count=160
2019-04-30 23:41:52,171 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recover lease on dfs file .../MasterProcWALs/pv2-00000000000000000481.log
2019-04-30 23:41:52,176 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recovered lease, attempt=0 on file=.../MasterProcWALs/pv2-00000000000000000481.log after 5ms
2019-04-30 23:41:52,288 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recover lease on dfs file .../MasterProcWALs/pv2-00000000000000000482.log
2019-04-30 23:41:52,289 INFO  [master/master:17000:becomeActiveMaster] util.FSHDFSUtils: Recovered lease, attempt=0 on file=.../MasterProcWALs/pv2-00000000000000000482.log after 1ms
2019-04-30 23:41:52,373 INFO  [master/master:17000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=483
2019-04-30 23:41:52,375 INFO  [master/master:17000:becomeActiveMaster] procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 206msec
2019-04-30 23:41:52,782 INFO  [master/master:17000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 1556 entries in .../MasterProcWALs/pv2-00000000000000000482.log
2019-04-30 23:41:55,370 INFO  [master/master:17000:becomeActiveMaster] wal.ProcedureWALFormatReader: Read 28113 entries in .../MasterProcWALs/pv2-00000000000000000481.log
2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 166, max stack id is 181, root procedure is Procedure(pid=289380, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 178, max stack id is 181, root procedure is Procedure(pid=289380, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
2019-04-30 23:41:55,389 ERROR [master/master:17000:becomeActiveMaster] wal.WALProcedureTree: Missing stack id 359, max stack id is 360, root procedure is Procedure(pid=285640, ppid=-1, class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
{noformat}

After which the procedure(s) is/are lost and cluster is stuck permanently.
There were no errors writing these files in the log, and no issues reading them from HDFS, so it's purely a data loss issue in the structure. 

I was thinking about debugging it, but on 2nd though what we are trying to store PB state by key.
Coincidentally, we have an "HBase" facility that we already deploy, that does just that... and it even has a WAL implementation. I don't know why we cannot use it for procedure state and have to invent another complex implementation of a KV store inside a KV store.
In all/most cases, we don't even support rollback and use the latest state, but if we need multiple versions, this HBase product even supports that! 
I think we should add a hbase:proc table that would be maintained similar to meta. The latter part esp. given the existing code for meta should be much more simple than a separate store impl.
This should be pluggable and optional via ProcStore interface (made more abstract as relevant - update state, scan state, get)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)