You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Xu Cang (JIRA)" <ji...@apache.org> on 2018/06/01 03:35:00 UTC
[jira] [Commented] (HBASE-18116) Replication source in-memory accounting should not include bulk transfer hfiles

    [ https://issues.apache.org/jira/browse/HBASE-18116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16497509#comment-16497509 ] 

Xu Cang commented on HBASE-18116:
---------------------------------

The TestGlobalThrottlerTest itself is buggy.

I could make it fail by changing quota to 121 such as

*conf1.setInt(HConstants.REPLICATION_SOURCE_TOTAL_BUFFER_KEY, 121);*

and change check  to 363 (3 times, because there are 3 peers) such as,

*if (size > 363) {*

 

Then the test failed. I put some debugging log here:

We can see it exceeds the limit I set which is 363.

 

 

2018-05-31 20:28:53,023 DEBUG [RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=46855] regionserver.ReplicationSink(239): Started replicating mutations.
2018-05-31 20:28:53,027 DEBUG [RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=46855] regionserver.ReplicationSink(243): Finished replicating mutations.
2018-05-31 20:28:53,038 INFO [RS_REFRESH_PEER-regionserver/xcang-wsl:0-0.replicationSource,peer1.replicationSource.wal-reader.xcang-wsl%2C39693%2C1527823677629,peer1] regionserver.ReplicationSourceWALReader(387): ~~~~~~~~~!!! acquireBufferQuota size is 120
2018-05-31 20:28:53,038 INFO [RS_REFRESH_PEER-regionserver/xcang-wsl:0-1.replicationSource,peer2.replicationSource.wal-reader.xcang-wsl%2C39693%2C1527823677629,peer2] regionserver.ReplicationSourceWALReader(387): ~~~~~~~~~!!! acquireBufferQuota size is 120
2018-05-31 20:28:53,038 INFO [RS_REFRESH_PEER-regionserver/xcang-wsl:0-0.replicationSource,peer3.replicationSource.wal-reader.xcang-wsl%2C39693%2C1527823677629,peer3] regionserver.ReplicationSourceWALReader(387): ~~~~~~~~~!!! acquireBufferQuota size is 120
2018-05-31 20:28:53,068 INFO [Thread-437] regionserver.TestGlobalThrottler(143): @@@@size :*480*
2018-05-31 20:28:53,068 INFO [Thread-437] regionserver.TestGlobalThrottler(148): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@size :480
2018-05-31 20:28:53,118 INFO [Thread-437] regionserver.TestGlobalThrottler(143): @@@@size :*480*
2018-05-31 20:28:53,119 INFO [Thread-437] regionserver.TestGlobalThrottler(148): @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@size :480
2018-05-31 20:28:53,131 DEBUG [RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=46855] regionserver.ReplicationSink(239): Started replicating mutations.
2018-05-31 20:28:53,134 DEBUG [RpcServer.replication.FPBQ.Fifo.handler=1,queue=0,port=46855] regionserver.ReplicationSink(243): Finished replicating mutations.

 

 

 

I will send a fix patch soon. And I have another fix in ReplicationSourceShipper.java which fixes batchSize calculation after operations are done and deducting correct size from totalUsedBuffer

 

> Replication source in-memory accounting should not include bulk transfer hfiles
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-18116
>                 URL: https://issues.apache.org/jira/browse/HBASE-18116
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Andrew Purtell
>            Assignee: Xu Cang
>            Priority: Major
>             Fix For: 3.0.0, 2.1.0, 1.5.0
>
>         Attachments: HBASE-18116-branch-1.patch, HBASE-18116.master.001.patch, HBASE-18116.master.002.patch, HBASE-18116.master.003.patch
>
>
> In ReplicationSourceWALReaderThread we maintain a global quota on enqueued replication work for preventing OOM by queuing up too many edits into queues on heap. When calculating the size of a given replication queue entry, if it has associated hfiles (is a bulk load to be replicated as a batch of hfiles), we get the file sizes and include the sum. We then apply that result to the quota. This isn't quite right. Those hfiles will be pulled by the sink as a file copy, not pushed by the source. The cells in those files are not queued in memory at the source and therefore shouldn't be counted against the quota.
> Related, the sum of the hfile sizes are also included when checking if queued work exceeds the configured replication queue capacity, which is by default 64 MB. HFiles are commonly much larger than this. 
> So what happens is when we encounter a bulk load replication entry typically both the quota and capacity limits are exceeded, we break out of loops, and send right away. What is transferred on the wire via HBase RPC though has only a partial relationship to the calculation. 
> Depending how you look at it, it makes sense to factor hfile file sizes against replication queue capacity limits. The sink will be occupied transferring those files at the HDFS level. Anyway, this is how we have been doing it and it is too late to change now. I do not however think it is correct to apply hfile file sizes against a quota for in memory state on the source. The source doesn't queue or even transfer those bytes. 
> Something I noticed while working on HBASE-18027.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)