You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Solvannan R M (JIRA)" <ji...@apache.org> on 2019/08/03 13:08:00 UTC
[jira] [Comment Edited] (HBASE-22784) OldWALs not cleared in a
replication slave cluster (cyclic replication bw 2 clusters)
[ https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899460#comment-16899460 ]
Solvannan R M edited comment on HBASE-22784 at 8/3/19 1:07 PM:
---------------------------------------------------------------
Hi [~wchevreuil],
Thanks for the reply !
We have managed to extract the following information relating to the issue:
*HBase DFS Usage Report:*
We could see the oldWALs size is very high compared to actual data size
{code:java}
hbaseuser@10.216.xxx.xxx~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xxxxxx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/xxxx
0 /hbasedata/xxxx
0 /hbasedata/xxxxx{code}
*Zookeeper myQueuesZnodes:*
The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3] get /hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629 10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383
....
{code}
*Status 'replication' output of the regionserver:*
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat Aug 03 17:50:44 IST 2019
{code}
*Stacktrace of the replicationSource thread:*
We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3" #10121291 daemon prio=5 os_prio=0 tid=0x00007f00e14ea000 nid=0x6d4c0 waiting on condition [0x00007ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007ef8889a1b98> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
Kindly let us know if you need any specific logs for your analysis.
was (Author: solvannan):
Hi [~wchevreuil],
Thanks for the reply !
We have managed to extract the following information relating to the issue:
*HBase DFS Usage Report:*
We could see the oldWALs size is very high compared to actual data size
{code:java}
hbaseuser@10.216.xxx.xxx~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xxxxxx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/xxxx
0 /hbasedata/xxxx
0 /hbasedata/xxxxx{code}
*Zookeeper myQueuesZnodes:*
The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3] get /hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629 10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383
....
{code}
*Status 'replication' output of the regionserver:*
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat Aug 03 17:50:44 IST 2019
{code}
*Stacktrace of the replicationSource thread:*
We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3" #10121291 daemon prio=5 os_prio=0 tid=0x00007f00e14ea000 nid=0x6d4c0 waiting on condition [0x00007ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007ef8889a1b98> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
**Kindly let us know if you need any specific logs for your analysis.
> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)
> -------------------------------------------------------------------------------------
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication
> Affects Versions: 1.4.9, 1.4.10
> Reporter: Solvannan R M
> Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, we observed the following behaviour.
> # New entry is added to WAL (Edit replicated from other cluster).
> # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards new entry from other cluster).
> # Entry is null, RSWALRT neither updates the batch stats (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
> # ReplicationSource thread is blocked in entryBachQueue.take().
> # So ReplicationSource#updateLogPosition has never invoked and WAL file is never cleared from ReplicationQueue.
> # Hence LogCleaner on the master, doesn't deletes the oldWAL files from hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread process and clears the oldWAL files from replication queues and hence master cleans up the WALs
> Please provide us a solution
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)