You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Wellington Chevreuil (JIRA)" <ji...@apache.org> on 2018/11/09 12:00:02 UTC
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch

    [ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681348#comment-16681348 ] 

Wellington Chevreuil commented on HBASE-21461:
----------------------------------------------

Uploaded an initial patch as txt, since I'm not sure it would be applied to the proper hbase-operator-tools repository. Since this specific CP is dependent on hbase branch-1, maybe we should create similar branch structure for hbase-operator-tools repository, so that we can place tools that are targeted to specific hbase versions on related branches?

> Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21461
>                 URL: https://issues.apache.org/jira/browse/HBASE-21461
>             Project: HBase
>          Issue Type: New Feature
>          Components: hbase-operator-tools, Replication
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion clients may lead to single WalEntry containing too many edits for same cell. This would cause *ReplicationSink,* in the target cluster, to attempt single batch mutation with too many operations, what in turn can lead to very large RPC requests, which may not fit in the final target RS rpc queue. In this case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, attempt=4/4 failed=2ops, last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on regionserver01.example.com,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 actions: RemoteWithExtrasException: 2 times, 
> at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be piling up on source cluster WALs/oldWALs folder. Typical workaround requires manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and splitting those into smaller batches on the *reReplicateLogEntries* method hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC requests, which may already help avoid such scenario. That is not available for 1.2 releases, though, and this CP tool may still be relevant for 1.2 clusters. It may also be still worth having it to workaround any potential unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)