You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "terry zhang (JIRA)" <ji...@apache.org> on 2012/08/30 08:16:07 UTC
[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue

     [ https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

terry zhang updated HBASE-6695:
-------------------------------

    Description: 
When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. 

Log:

2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue

2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data

2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data

2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data

2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME
TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
......
{color:red} 
This server is down .....
{color}

ZK node status:

[zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
[lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]



{color:red} 
dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted
{color}









  was:
When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. 

Log:

2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue

2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data

2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data

2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data

2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME
TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
......
{color:red} 
This server is down .....
{color}

ZK node status:
[zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
[lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
{color:red} 
dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted
{color}









    
> [Replication] Data will lose if RegionServer down during transferqueue
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6695
>                 URL: https://issues.apache.org/jira/browse/HBASE-6695
>             Project: HBase
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.94.1
>            Reporter: terry zhang
>            Priority: Critical
>
> When we ware testing Replication failover feature we found if we kill a regionserver during it transferqueue ,we found only part of the hlog znode copy to the right path because failover process is interrupted. 
> Log:
> 2012-08-29 12:20:05,660 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
> 2012-08-29 12:20:05,765 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
> 2012-08-29 12:20:05,850 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
> 2012-08-29 12:20:05,938 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
> 2012-08-29 12:20:06,055 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
> 2012-08-29 12:20:06,277 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Failed all from region=.ME
> TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
> java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at 
> ......
> {color:red} 
> This server is down .....
> {color}
> ZK node status:
> [zk: 10.232.98.77:2181(CONNECTED) 6] ls /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
> [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
> {color:red} 
> dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted
> {color}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira