You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2017/02/16 17:42:41 UTC
[jira] [Commented] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock

    [ https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870345#comment-15870345 ] 

stack commented on HBASE-17653:
-------------------------------

[~toffer] Here is feedback that made me dig in on synchronization to try and tell a coherent story around it. From [~busbey] internal review... as part of an attempt at a backport:

{code}
hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupAdminServer.java
Line 2:

nit: incorrect header.
Line 348:

nit: doesn't the synchronized above mean only one balance will run at a time? so these RIT would be from e.g. server failure? Couldn't those happen after this point?
Line 405:

Throughout this class we're inconsistent with locking on the object returned from this call. From reading both this class' use, the interface RSGroupInfoManager, and the default implementation it's not clear to me what correct synchronized handling is supposed to look like.

However, we synchronize for removing groups but not adding them and sometimes when retrieving them, so my intuition is that something is incorrect.
{code}

and.....


{code}
hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
Line 2:

nit: incorrect header
Line 148:

what are we trying to protect by synchronizing these methods?

if it's rsGroupMap, we still have methods that make use of the map without a proper synchronization call.

reading through things, I think it's trying to keep updates to the reference for rsGroupMap atomic (since I *think* all the things written there are unmodifiable, which would make the unsynchronized access safe). It would be much clearer if we did that via an AtomicReference or atleast documented that this is what our intention is for all these synchronized methods.
{code}

Then, in rb [~Apache9] questioned synchronization here https://reviews.apache.org/r/56570/diff/1?file=1630642#file1630642line221  RSGroupInfoManagerImpl is used by RSGroupAdminServer. It holds the monitor for RSGroupInfoManagerImpl across a set of operations making the reviewer think this method does not need synchronization.

Let me look back at the original and with your readers-should-not-be-blocked prescription, come up w/ a patch for here...

> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> ----------------------------------------------------------------
>
>                 Key: HBASE-17653
>                 URL: https://issues.apache.org/jira/browse/HBASE-17653
>             Project: HBase
>          Issue Type: Bug
>          Components: rsgroup
>            Reporter: stack
>            Assignee: stack
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes scenario under which we  may end up in a deadlock (distributed). Let me repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock but can't. rpc thread task won't release monitor lock since rsgroup region never gets assigned. rsgroup region never gets assigned because it can't update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have writes block reads.
> ....
> In the code prior to this patch the getter methods that retrieve group information (getRSGroup, ofTable, OfServer, etc) don't require the monitor lock so the deadlock cycle is broken.
> ....
> The methods that does mutations and updates to zk and hbase:rsgroup are synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)