You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Gary Helmling (JIRA)" <ji...@apache.org> on 2016/02/08 21:29:39 UTC

[jira] [Commented] (HBASE-15234) ReplicationLogCleaner can abort due to transient ZK issues

    [ https://issues.apache.org/jira/browse/HBASE-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137632#comment-15137632 ] 

Gary Helmling commented on HBASE-15234:
---------------------------------------

The issue here is that, while {{ReplicationLogCleaner.getDeletableFiles()}} opts to ignore the {{KeeperException}} which can be thrown and to continue, {{ReplicationQueuesClientZKImpl.getQueuesZNodeCversion()}}, which is called further down in the stack, also catches and handles the {{KeeperException}}, calling {{abort()}} on the provided Abortable (ReplicationLogCleaner).

For the log cleaner chores, do we ever want a KeeperException here to be fatal?  Either the log cleaner should warn but continue on these exceptions (which seems to be the original intent), or else it should delegate back up on the abort and fully abort the running master.  The current situation of stopping the cleaner chore just leaves the system in a bad state.

> ReplicationLogCleaner can abort due to transient ZK issues
> ----------------------------------------------------------
>
>                 Key: HBASE-15234
>                 URL: https://issues.apache.org/jira/browse/HBASE-15234
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Replication
>            Reporter: Gary Helmling
>            Assignee: Gary Helmling
>            Priority: Critical
>
> The ReplicationLogCleaner delegate for the LogCleaner chore can abort due to transient errors reading the replication znodes, leaving the log cleaner chore stopped, but the master still running.  This causes logs to build up in the oldWALs directory, which can even hit storage or file count limits in HDFS, causing problems.
> We've seen this happen in a couple of clusters when a rolling restart was performed on the zk peers (only one restarted at a time).
> The full stack trace when the log cleaner aborts is:
> {noformat}
> 16/02/02 15:22:39 WARN zookeeper.ZKUtil: replicationLogCleaner-0x1522c8b93c2fbae, quorum=XXXXXXXXXXXXXXXXXXXX, baseZNode=/hbase Unable to get data of znode /hbase/replication/rs  
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)  
>         at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80)  
>         at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)  
>         at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)  
>         at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)  
> 16/02/02 15:22:39 ERROR zookeeper.ZooKeeperWatcher: replicationLogCleaner-0x1522c8b93c2fbae, quorum=XXXXXXXXXXXXXXXXXXXX, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception  
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)  
>         at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80)  
>         at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)  
>         at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 16/02/02 15:22:39 WARN master.ReplicationLogCleaner: Aborting ReplicationLogCleaner because Failed to get stat of replication rs node
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
>         at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80)
>         at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
>         at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 16/02/02 15:22:40 WARN master.ReplicationLogCleaner: Failed to read zookeeper, skipping checking deletable files
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)