You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2013/02/05 05:01:04 UTC
[jira] [Commented] (HBASE-7293) [replication] Remove dead sinks from ReplicationSource.currentPeers and pick new ones

    [ https://issues.apache.org/jira/browse/HBASE-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570956#comment-13570956 ] 

Hudson commented on HBASE-7293:
-------------------------------

Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/])
    HBASE-7293 [replication] Remove dead sinks from ReplicationSource.currentPeers and pick new ones (Revision 1437238)

     Result = FAILURE
larsh : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [replication] Remove dead sinks from ReplicationSource.currentPeers and pick new ones
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-7293
>                 URL: https://issues.apache.org/jira/browse/HBASE-7293
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.3, 0.96.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.5
>
>         Attachments: 7293-0.94.txt, 7293-0.94-v2.txt, 7293-0.96.txt
>
>
> I happened to look at a log today where I saw a lot lines like this:
> {noformat}
> 2012-12-06 23:29:08,318 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave cluster looks down: This server is in the failed servers list: sv4r20s49/10.4.20.49:10304
> 2012-12-06 23:29:15,987 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of a local or network error: 
> java.net.ConnectException: Connection refused
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:519)
> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:484)
> 	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:416)
> 	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:462)
> 	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1150)
> 	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1000)
> 	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
> 	at $Proxy14.replicateLogEntries(Unknown Source)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:627)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
> 2012-12-06 23:29:15,988 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave cluster looks down: Connection refused
> {noformat}
> What struck me as weird is this had been going on for some days, I would expect the RS to find new servers if it wasn't able to replicate. But the reality is that only a few of the chosen sink RS were down so eventually the source hits one that's good and is never able to refresh its list of servers.
> We should remove the dead servers, it's spammy and probably adds some slave lag.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira