You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Yu Li (JIRA)" <ji...@apache.org> on 2016/03/04 09:13:40 UTC
[jira] [Commented] (HBASE-15213) Fix increment performance regression caused by HBASE-8763 on branch-1.0

    [ https://issues.apache.org/jira/browse/HBASE-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179544#comment-15179544 ] 

Yu Li commented on HBASE-15213:
-------------------------------

We encountered the same issue in our stress testing on 1.1.2, the blocked threads in below jstack lead me to this JIRA:
{noformat}
"B.defaultRpcServer.handler=126,queue=9,port=16020" daemon prio=10 tid=0x00007f55b9fb6000 nid=0x30af9 in Object.wait() [0x00000000441e8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
        - locked <0x00000001fccf05f8> (a java.util.LinkedList)
        at org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127)
        at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3231)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2847)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2787)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:706)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:668)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2062)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2183)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
{noformat}

However, checking the code changes and discussion here, I doubt the fix here has broken correctness of the MVCC mechanism. Allow me to quote the statement given by [~junegunn]
{quote}
* Let's say we have 10 concurrent increment transactions from T1 to T10, each with WriteEntry W1 to W10 in the queue.
* Assume that T1 finishes later than the other transactions, T2 ~ T10 go into wait-loop.
* T1 finishes, removes W1 from the queue, and notifies the other waiting transactions.
* W2 can only be removed if the handler for T2 wakes up and checks the head of the queue, but in the worst case, it can be the last one to wake up.
* So for example, T3 wakes up but W3 is not at the head so waits again, then T10 wakes up to no avail, T9, T7, .... finally T2
{quote}
The patch truly resolved the issue mentioned in the quote, say if T2 not done it will jump to wakeup T3, but I'm afraid this is *incorrect*. MVCC is meant to guarantee the transaction (T2 in this case) arrives first will become visible (to readers) first, right?

What's more, in {{waitForPreviousTransactionsComplete(WriteEntry waitedEntry)}} now we have:
{code}
  public void waitForPreviousTransactionsComplete(WriteEntry waitedEntry) {
    boolean interrupted = false;
    WriteEntry w = waitedEntry;
    w.markCompleted();
    ...
{code}
I think we should never mark the entry as completed before all previous transaction complete.

However, the jstack proves the lock issue truly exists. After digging and comparing codes in 0.98.12 and master branch, I think the real issue is that we now wait on {{writeQueue}} rather than {{readWaiters}} thus causing lots more contention (and if check the codes, we could find that we are still calling {{readWaiters#notifyAll}} in {{advanceMemstore}} while actually no more wait on it at all...). And I think the reason branch-1.2 and master don't suffer from this issue is that HBASE-12751 also fixed this part (wait on readWaiters instead of writeQueue) underground.

I've already prepared one fix at hand and am now testing. [~stack], [~chenheng] and [~junegunn], please let me know your thoughts on my analysis. And if you agree on my conclusion, I'll open another JIRA to attach my patch. Thanks in advance and please kindly forgive me if I've stated anything stupid.

> Fix increment performance regression caused by HBASE-8763 on branch-1.0
> -----------------------------------------------------------------------
>
>                 Key: HBASE-15213
>                 URL: https://issues.apache.org/jira/browse/HBASE-15213
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Performance
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>             Fix For: 1.1.4, 1.0.4
>
>         Attachments: 15157v3.branch-1.1.patch, HBASE-15213-increment.png, HBASE-15213.branch-1.0.patch, HBASE-15213.v1.branch-1.0.patch
>
>
> This is an attempt to fix the increment performance regression caused by HBASE-8763 on branch-1.0.
> I'm aware that hbase.increment.fast.but.narrow.consistency was added to branch-1.0 (HBASE-15031) to address the issue and a separate work is ongoing on master branch, but anyway, this is my take on the problem.
> I read through HBASE-14460 and HBASE-8763 but it wasn't clear to me what caused the slowdown but I could indeed reproduce the performance regression.
> Test setup:
> - Server: 4-core Xeon 2.4GHz Linux server running mini cluster (100 handlers, JDK 1.7)
> - Client: Another box of the same spec
> - Increments on random 10k records on a single-region table, recreated every time
> Increment throughput (TPS):
> || Num threads || Before HBASE-8763 (d6cc2fb) || branch-1.0 || branch-1.0 (narrow-consistency) ||
> || 1            | 2661                         | 2486        | 2359  |
> || 2            | 5048                         | 5064        | 4867  |
> || 4            | 7503                         | 8071        | 8690  |
> || 8            | 10471                        | 10886       | 13980 |
> || 16           | 15515                        | 9418        | 18601 |
> || 32           | 17699                        | 5421        | 20540 |
> || 64           | 20601                        | 4038        | 25591 |
> || 96           | 19177                        | 3891        | 26017 |
> We can clearly observe that the throughtput degrades as we increase the number of concurrent requests, which led me to believe that there's severe context switching overhead and I could indirectly confirm that suspicion with cs entry in vmstat output. branch-1.0 shows a much higher number of context switches even with much lower throughput.
> Here are the observations:
> - WriteEntry in the writeQueue can only be removed by the very handler that put it, only when it is at the front of the queue and marked complete.
> - Since a WriteEntry is marked complete after the wait-loop, only one entry can be removed at a time.
> - This stringent condition causes O(N^2) context switches where n is the number of concurrent handlers processing requests.
> So what I tried here is to mark WriteEntry complete before we go into wait-loop. With the change, multiple WriteEntries can be shifted at a time without context switches. I changed writeQueue to LinkedHashSet since fast containment check is needed as WriteEntry can be removed by any handler.
> The numbers look good, it's virtually identical to pre-HBASE-8763 era.
> || Num threads || branch-1.0 with fix ||
> || 1            | 2459                 |
> || 2            | 4976                 |
> || 4            | 8033                 |
> || 8            | 12292                |
> || 16           | 15234                |
> || 32           | 16601                |
> || 64           | 19994                |
> || 96           | 20052                |
> So what do you think about it? Please let me know if I'm missing anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)