You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2018/07/06 00:02:00 UTC
[jira] [Commented] (HBASE-20796) STUCK RIT though region successfully assigned

    [ https://issues.apache.org/jira/browse/HBASE-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534305#comment-16534305 ] 

Hadoop QA commented on HBASE-20796:
-----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  3s{color} | {color:red} HBASE-20796 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.7.0/precommit-patchnames for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HBASE-20796 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12930471/0001-Test.patch |
| Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/13518/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> STUCK RIT though region successfully assigned
> ---------------------------------------------
>
>                 Key: HBASE-20796
>                 URL: https://issues.apache.org/jira/browse/HBASE-20796
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 3.0.0, 2.0.2, 2.1.1
>
>         Attachments: 0001-Test.patch, HBASE-20796.branch-2.0.001.patch
>
>
> This is a good one. We keep logging messages like this:
> {code}
> 2018-06-26 12:32:24,859 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=vd0410.X.Y.com,22101,1529611445046, table=IntegrationTestBigLinkedList_20180525080406, region=e10b35d49528e2453a04c7038e3393d7
> {code}
> ...though the region is successfully assigned.
> Story:
>  * Dispatch an assign 2018-06-26 12:31:27,390 INFO org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046
>  * It gets stuck 2018-06-26 12:32:29,860 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046, table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
>  * We stay STUCK for a while.
>  * The Master notices the server as crashed and starts a SCP.
>  * SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105 found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046
>  * The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote call failed pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
>  * Which eventually succeeds..... Successfully deployed to new server 2018-06-26 12:32:55,429 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829, ppid=370391, state=SUCCESS; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
>  * But then, it looks like the RPC was ongoing and it broke in following way 2018-06-26 12:33:06,378 WARN org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception: org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer (Notice how state for region is OPEN and 'SUCCESS').
>  * Then says 2018-06-26 12:33:06,380 INFO org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10; pid=370829, ppid=370391, state=SUCCESS; AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, location=vc0614.X.Y.Z,22101,1529611443424
>  * And finally...  2018-06-26 12:34:10,727 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=null, table=IntegrationTestBigLinkedList_20180612114844, region=f69ccf7d9178ce166b515e0e2ef019d2
> Restart of Master got rid of the STUCK complaints.
> This is interesting because the stuck rpc and the successful reassign are all riding on the same pid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)