You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "churro morales (JIRA)" <ji...@apache.org> on 2015/01/07 02:04:34 UTC

[jira] [Created] (HBASE-12814) Zero downtime upgrade from 94 to 98 with replication

churro morales created HBASE-12814:
--------------------------------------

             Summary: Zero downtime upgrade from 94 to 98 with replication
                 Key: HBASE-12814
                 URL: https://issues.apache.org/jira/browse/HBASE-12814
             Project: HBase
          Issue Type: New Feature
    Affects Versions: 0.94.26, 0.98.10
            Reporter: churro morales
            Assignee: churro morales


Here at Flurry we want to upgrade our HBase cluster from 94 to 98 while not having any downtime and maintaining master / master replication. 

Summary:
Replication is done via thrift RPC between clusters.  It is configurable on a peer by peer basis and the one caveat is that a thrift server starts up on every node which proxies the request to the ReplicationSink.  


For the upgrade process:
* in hbase-site.xml two new configuration parameters are added:
** *Required*
*** hbase.replication.sink.enable.thrift -> true
*** hbase.replication.thrift.server.port -> <thrit_server_port>
** *Optional*
*** hbase.replication.thrift.protection {default: AUTHENTICATION}
*** hbase.replication.thrift.framed {default: false}
*** hbase.replication.thrift.compact {default: true}

- All regionservers can be rolling restarted (no downtime), all clusters must have the respective patch for this to work.
- the hbase shell add_peer command takes an additional parameter for rpc protocol
- example: {code} add_peer '1' "hbase-101:2181:/hbase", "THRIFT" {code}

Now comes the fun part when you want to upgrade your cluster from 94 to 98 you simply pause replication to the cluster being upgraded, do the upgrade and un-pause replication.  Once you have a pair of clusters only replicating inbound and outbound with the 98 release.  You can start replicating via the native rpc protocol by adding the peer again without the _THRIFT_ parameter and subsequently deleting the peer with the thrift protocol.  Because replication is idempotent I don't see any issues as long as you wait for the backlog to drain after un-pausing replication. 

Special thanks to Francis Liu at Yahoo for laying the groundwork and Mr. Dave Latham for his invaluable knowledge and assistance.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)