You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Fabrice Rabaute (Jira)" <ji...@apache.org> on 2020/02/12 23:40:00 UTC

[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

    [ https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035800#comment-17035800 ] 

Fabrice Rabaute commented on HBASE-23282:
-----------------------------------------

Hi,

 

I'm having an issue where I have a reported "Unkonwn Server", I upgraded from 2.2.1 to 2.2.3.

But I still get this Unknown Server even after running a SCP.

My region info is as follow:

 

 
{code:java}
COLUMN CELL 
...
 info:server timestamp=1581391440980, value=regionserver-2.hbase.hbase.svc.cluster.local:16020 
 info:serverstartcode timestamp=1581391440980, value=1573519312100 
 info:sn timestamp=1581549272576, value=regionserver-0.hbase.hbase.svc.cluster.local,16020,1581546727391 
 info:state timestamp=1581549272576, value=OPENING
....
 
{code}
 

I don't know what server/serverstartcode/sn mean, but they don't seem to match, startcode are different. Is that expected?

 

In  the HBCK UI, I have this info for the "Inconsistent Regions" reported:

 
{code:java}
encoded region: 353ab75c788cd0f77027706900453c49
location in META: regionserver-2.hbase.hbase.svc.cluster.local,16020,1581546563369
{code}
 

 

I have this info for the "Unknown Servers" reported:

 
{code:java}
RegionInfo: 353ab75c788cd0f77027706900453c49
ServerName: regionserver-2.hbase.hbase.svc.cluster.local,16020,1573519312100
{code}
 

 

It means that I have 3 regionservers reported for this region based on the data.

 

Is there a automated or manual procedure to recover from such a state?

 

Thanks.

 

 

> HBCKServerCrashProcedure for 'Unknown Servers'
> ----------------------------------------------
>
>                 Key: HBASE-23282
>                 URL: https://issues.apache.org/jira/browse/HBASE-23282
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2, proc-v2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.2.3
>
>
> With an overdriving, sustained load, I can fairly easily manufacture an hbase:meta table that references servers that are no longer in the live list nor are members of deadservers; i.e. 'Unknown Servers'.  The new 'HBCK Report' UI in Master has a section where it lists 'Unknown Servers' if any in hbase:meta.
> Once in this state, the repair is awkward. Our assign/unassign Procedure is particularly dogged about insisting that we confirm close/open of Regions when it is going about its business which is well and good if server is in live/dead sets but when an 'Unknown Server', we invariably end up trying to confirm against a non-longer present server (More on this in follow-on issues).
> What is wanted is queuing of a ServerCrashProcedure for each 'Unknown Server'. It would split any WALs (there shouldn't be any if server was restarted) and ideally it would cancel out any assigns and reassign regions off the 'Unknown Server'.  But the 'normal' SCP consults the in-memory cluster state figuring what Regions were on the crashed server... And 'Unknown Servers' don't have state in in-master memory Maps of Servers to Regions or  in DeadServers list which works fine for the usual case.
> Suggestion here is that hbck2 be able to drive in a special SCP, one which would get list of Regions by scanning hbase:meta rather than asking Master memory; an HBCKSCP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)