You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Matteo Bertozzi (JIRA)" <ji...@apache.org> on 2012/12/28 15:20:13 UTC

[jira] [Commented] (HBASE-7451) [snapshot] regionserver will deadlock when GlobalSnapshotOperation timeout happens

    [ https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540469#comment-13540469 ] 

Matteo Bertozzi commented on HBASE-7451:
----------------------------------------

Thanks for taking a look at the code.

In the last months that code was refactored and simplified, and now the 3phase-commit doesn't exist anymore. Take a look at "HBASE-7212 Globally Barriered Procedure Mechanism" that is the replacement for that (review here: https://reviews.apache.org/r/8240/). 

globally consistent snapshot was also put aside, and it will not be in the first snapshot cut.

You can find the full branch containing the new procedure mechanism and the new online snapshot code here: https://github.com/jmhsieh/hbase/commits/snapshots-work
                
> [snapshot] regionserver will deadlock when GlobalSnapshotOperation timeout happens
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-7451
>                 URL: https://issues.apache.org/jira/browse/HBASE-7451
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>            Reporter: terry zhang
>            Assignee: terry zhang
>
> Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github : https://github.com/matteobertozzi/hbase/
> If we create a snapshot and meet regionserver timeout. Rs will be lock and can not put any data. Please take a look at log below :
> // regionserver snapshot timeout
> org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
> at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
> at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
> at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
> at org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Caused by: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> ... 3 more
> 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv
> //Waiting for 'commit allowed' latch and do not exist
> 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
> 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: Finish snapshot - handling in subtasks on error
> 2012-12-26 18:44:57,212 WARN org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer already marked completed, ignoring!
> 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved start event.
> 2012-12-26 18:45:17,002 DEBUG org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Looking for new operations under znode:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children changed event:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved abort event.
> 2012-12-26 18:45:17,003 DEBUG org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController: Checking for aborted operations on node:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:41,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:43,481 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=11.77 MB, free=1.39 GB, max=1.4 GB, blocks=5, accesses=96, hits=91, hitRatio=94.79%, cachingAccesses=96, cachingHits=91, cachingHitsRatio=94.79%, evictions=0, evicted=0, evictedPerRun=NaN
> 2012-12-26 18:45:46,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:51,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:56,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:01,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:06,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:11,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:16,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:21,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:26,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:31,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:36,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:41,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 'commit allowed' latch. (sleep:5000 ms)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira