You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Guanghao Zhang <zg...@gmail.com> on 2020/10/22 06:12:41 UTC

Procedure schedule slowly when run branch-2.2 UT TestRegionReplicaFailover

Run the same UT TestRegionReplicaFailover on my local PC, mvn clean test
-Dtest=TestRegionReplicaFailover, branch-2.2 takes 8 mins but branch-2.3
only needs 2 mins.

I found the problem is related to procedure schedule. See the below log:
2020-10-21 13:52:28,097 INFO  [PEWorker-1]
procedure2.ProcedureExecutor(1427): Finished pid=296, ppid=45,
state=SUCCESS;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.6250sec
2020-10-21 13:52:28,538 INFO  [PEWorker-3]
procedure2.ProcedureExecutor(1427): Finished pid=45, ppid=20,
state=SUCCESS; TransitRegionStateProcedure table=testLotsOfRegionRepli2,
region=50703895da3cb8c942d3197600d549bc, ASSIGN in 59.4330sec

The real assign procedure only cost 1.6 seconds but
the TransitRegionStateProcedure cost 59.4 seconds. The pid=45 procedure was
initialized at 2020-10-21 13:51:28,666. It was added to TableQueue
at 2020-10-21 13:51:28,789. But took xlock to run at 2020-10-21
13:52:24,761. See the below log:
2020-10-21 13:51:28,789 DEBUG [PEWorker-4]
procedure.MasterProcedureScheduler(352): Add
TableQueue(testLotsOfRegionRepli2, xlock=true (20) sharedLock=0 size=25) to
run queue because: pid=45, ppid=20,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=testLotsOfRegionRepli2,
region=50703895da3cb8c942d3197600d549bc, ASSIGN has the excusive lock access
2020-10-21 13:52:24,761 INFO  [PEWorker-2]
procedure.MasterProcedureScheduler(737): Took xlock for pid=45, ppid=20,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=testLotsOfRegionRepli2,
region=50703895da3cb8c942d3197600d549bc, ASSIGN


But when I tried this UT on another PC, it only cost 2 mins, which is the
same with branch-2.3. It is weird. Any ideas about this? Thanks.

Re: Procedure schedule slowly when run branch-2.2 UT TestRegionReplicaFailover

Posted by Guanghao Zhang <zg...@gmail.com>.
More procedure workers (hbase.master.procedure.threads 5 => 16) can reduce
the time from 8 mins to 6 mins. But branch-2.3 used the same 5 procedure
workers.....

Guanghao Zhang <zg...@gmail.com> 于2020年10月22日周四 下午2:22写道:

> And I am sure that it is not the lock problem, because there is no
> "Waiting on xlock for" log.
>
> LOG.info("Waiting on xlock for {} held by pid={}", procedure,
>               regionLocks[i].getExclusiveLockProcIdOwner());
>
> Guanghao Zhang <zg...@gmail.com> 于2020年10月22日周四 下午2:12写道:
>
>> Run the same UT TestRegionReplicaFailover on my local PC, mvn clean test
>> -Dtest=TestRegionReplicaFailover, branch-2.2 takes 8 mins but branch-2.3
>> only needs 2 mins.
>>
>> I found the problem is related to procedure schedule. See the below log:
>> 2020-10-21 13:52:28,097 INFO  [PEWorker-1]
>> procedure2.ProcedureExecutor(1427): Finished pid=296, ppid=45,
>> state=SUCCESS;
>> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.6250sec
>> 2020-10-21 13:52:28,538 INFO  [PEWorker-3]
>> procedure2.ProcedureExecutor(1427): Finished pid=45, ppid=20,
>> state=SUCCESS; TransitRegionStateProcedure table=testLotsOfRegionRepli2,
>> region=50703895da3cb8c942d3197600d549bc, ASSIGN in 59.4330sec
>>
>> The real assign procedure only cost 1.6 seconds but
>> the TransitRegionStateProcedure cost 59.4 seconds. The pid=45 procedure was
>> initialized at 2020-10-21 13:51:28,666. It was added to TableQueue
>> at 2020-10-21 13:51:28,789. But took xlock to run at 2020-10-21
>> 13:52:24,761. See the below log:
>> 2020-10-21 13:51:28,789 DEBUG [PEWorker-4]
>> procedure.MasterProcedureScheduler(352): Add
>> TableQueue(testLotsOfRegionRepli2, xlock=true (20) sharedLock=0 size=25) to
>> run queue because: pid=45, ppid=20,
>> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
>> TransitRegionStateProcedure table=testLotsOfRegionRepli2,
>> region=50703895da3cb8c942d3197600d549bc, ASSIGN has the excusive lock access
>> 2020-10-21 13:52:24,761 INFO  [PEWorker-2]
>> procedure.MasterProcedureScheduler(737): Took xlock for pid=45, ppid=20,
>> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
>> TransitRegionStateProcedure table=testLotsOfRegionRepli2,
>> region=50703895da3cb8c942d3197600d549bc, ASSIGN
>>
>>
>> But when I tried this UT on another PC, it only cost 2 mins, which is the
>> same with branch-2.3. It is weird. Any ideas about this? Thanks.
>>
>

Re: Procedure schedule slowly when run branch-2.2 UT TestRegionReplicaFailover

Posted by Guanghao Zhang <zg...@gmail.com>.
And I am sure that it is not the lock problem, because there is no "Waiting
on xlock for" log.

LOG.info("Waiting on xlock for {} held by pid={}", procedure,
              regionLocks[i].getExclusiveLockProcIdOwner());

Guanghao Zhang <zg...@gmail.com> 于2020年10月22日周四 下午2:12写道:

> Run the same UT TestRegionReplicaFailover on my local PC, mvn clean test
> -Dtest=TestRegionReplicaFailover, branch-2.2 takes 8 mins but branch-2.3
> only needs 2 mins.
>
> I found the problem is related to procedure schedule. See the below log:
> 2020-10-21 13:52:28,097 INFO  [PEWorker-1]
> procedure2.ProcedureExecutor(1427): Finished pid=296, ppid=45,
> state=SUCCESS;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.6250sec
> 2020-10-21 13:52:28,538 INFO  [PEWorker-3]
> procedure2.ProcedureExecutor(1427): Finished pid=45, ppid=20,
> state=SUCCESS; TransitRegionStateProcedure table=testLotsOfRegionRepli2,
> region=50703895da3cb8c942d3197600d549bc, ASSIGN in 59.4330sec
>
> The real assign procedure only cost 1.6 seconds but
> the TransitRegionStateProcedure cost 59.4 seconds. The pid=45 procedure was
> initialized at 2020-10-21 13:51:28,666. It was added to TableQueue
> at 2020-10-21 13:51:28,789. But took xlock to run at 2020-10-21
> 13:52:24,761. See the below log:
> 2020-10-21 13:51:28,789 DEBUG [PEWorker-4]
> procedure.MasterProcedureScheduler(352): Add
> TableQueue(testLotsOfRegionRepli2, xlock=true (20) sharedLock=0 size=25) to
> run queue because: pid=45, ppid=20,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=testLotsOfRegionRepli2,
> region=50703895da3cb8c942d3197600d549bc, ASSIGN has the excusive lock access
> 2020-10-21 13:52:24,761 INFO  [PEWorker-2]
> procedure.MasterProcedureScheduler(737): Took xlock for pid=45, ppid=20,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=testLotsOfRegionRepli2,
> region=50703895da3cb8c942d3197600d549bc, ASSIGN
>
>
> But when I tried this UT on another PC, it only cost 2 mins, which is the
> same with branch-2.3. It is weird. Any ideas about this? Thanks.
>