You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by jd...@apache.org on 2014/07/29 02:30:12 UTC

git commit: HBASE-11522 Move Replication information into the Ref Guide (Misty Stanley-Jones)

Repository: hbase
Updated Branches:
  refs/heads/master ff655e04d -> fe54e7d7a


HBASE-11522 Move Replication information into the Ref Guide (Misty Stanley-Jones)


Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/fe54e7d7
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/fe54e7d7
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/fe54e7d7

Branch: refs/heads/master
Commit: fe54e7d7ae40bc9b23540a04db1cc947428c5bdf
Parents: ff655e0
Author: Jean-Daniel Cryans <jd...@cloudera.com>
Authored: Mon Jul 28 17:26:43 2014 -0700
Committer: Jean-Daniel Cryans <jd...@cloudera.com>
Committed: Mon Jul 28 17:26:43 2014 -0700

----------------------------------------------------------------------
 src/main/docbkx/ops_mgt.xml        | 494 +++++++++++++++++++++++++++++-
 src/main/site/xdoc/replication.xml | 516 +-------------------------------
 2 files changed, 484 insertions(+), 526 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hbase/blob/fe54e7d7/src/main/docbkx/ops_mgt.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/ops_mgt.xml b/src/main/docbkx/ops_mgt.xml
index 6108b9c..5d01f82 100644
--- a/src/main/docbkx/ops_mgt.xml
+++ b/src/main/docbkx/ops_mgt.xml
@@ -1107,18 +1107,490 @@ false
   <section
     xml:id="cluster_replication">
     <title>Cluster Replication</title>
-    <para>See <link
-        xlink:href="http://hbase.apache.org/replication.html">Cluster Replication</link>. </para>
-    <note xml:id="cluster.replication.preserving.tags">
-      <title>Preserving Tags During Replication</title>
-      <para>By default, the codec used for replication between clusters strips tags, such as
-        cell-level ACLs, from cells. To prevent the tags from being stripped, you can use a
-        different codec which does not strip them. Configure
-          <code>hbase.replication.rpc.codec</code> to use
-          <literal>org.apache.hadoop.hbase.codec.KeyValueCodecWithTags</literal>, on both the source
-        and sink RegionServers involved in the replication. This option was introduced in <link
-          xlink:href="https://issues.apache.org/jira/browse/HBASE-10322">HBASE-10322</link>.</para>
+    <note>
+      <para>This information was previously available at <link
+          xlink:href="http://hbase.apache.org/replication.html">Cluster Replication</link>. </para>
     </note>
+    <para>HBase provides a replication mechanism to copy data between HBase
+      clusters. Replication can be used as a disaster recovery solution and as a mechanism for high
+      availability. You can also use replication to separate web-facing operations from back-end
+      jobs such as MapReduce.</para>
+
+    <para>In terms of architecture, HBase replication is master-push. This takes advantage of the
+      fact that each region server has its own write-ahead log (WAL). One master cluster can
+      replicate to any number of slave clusters, and each region server replicates its own stream of
+      edits. For more information on the different properties of master/slave replication and other
+      types of replication, see the article <link
+        xlink:href="http://highscalability.com/blog/2009/8/24/how-google-serves-data-from-multiple-datacenters.html">How
+        Google Serves Data From Multiple Datacenters</link>.</para>
+
+    <para>Replication is asynchronous, allowing clusters to be geographically distant or to have
+      some gaps in availability. This also means that data between master and slave clusters will
+      not be instantly consistent. Rows inserted on the master are not immediately available or
+      consistent with rows on the slave clusters. rows inserted on the master cluster won’t be
+      available at the same time on the slave clusters. The goal is eventual consistency. </para>
+    
+    <para>The replication format used in this design is conceptually the same as the <firstterm><link
+          xlink:href="http://dev.mysql.com/doc/refman/5.1/en/replication-formats.html">statement-based
+          replication</link></firstterm> design used by MySQL. Instead of SQL statements, entire
+      WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the
+      clients) are replicated in order to maintain atomicity. </para>
+    
+    <para>The WALs for each region server must be kept in HDFS as long as they are needed to
+      replicate data to any slave cluster. Each region server reads from the oldest log it needs to
+      replicate and keeps track of the current position inside ZooKeeper to simplify failure
+      recovery. That position, as well as the queue of WALs to process, may be different for every
+      slave cluster.</para>
+    
+    <para>The clusters participating in replication can be of different sizes. The master
+      cluster relies on randomization to attempt to balance the stream of replication on the slave clusters</para>
+    
+    <para>HBase supports master/master and cyclic replication as well as replication to multiple
+      slaves.</para>
+    
+    <figure>
+      <title>Replication Architecture Overview</title>
+      <mediaobject>
+        <imageobject>
+          <imagedata fileref="replication_overview.png" />
+        </imageobject>
+        <textobject>
+          <para>Illustration of the replication architecture in HBase, as described in the prior
+            text.</para>
+        </textobject>
+      </mediaobject>
+    </figure>
+    
+    <formalpara>
+      <title>Enabling and Configuring Replication</title>
+      <para>See the <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements">
+          API documentation for replication</link> for information on enabling and configuring
+        replication.</para>
+    </formalpara>
+    
+    <section>
+      <title>Life of a WAL Edit</title>
+      <para>A single WAL edit goes through several steps in order to be replicated to a slave
+        cluster.</para>
+
+      <orderedlist>
+        <title>When the slave responds correctly:</title>
+        <listitem>
+          <para>A HBase client uses a Put or Delete operation to manipulate data in HBase.</para>
+        </listitem>
+        <listitem>
+          <para>The region server writes the request to the WAL in a way that would allow it to be
+            replayed if it were not written successfully.</para>
+        </listitem>
+        <listitem>
+          <para>If the changed cell corresponds to a column family that is scoped for replication,
+            the edit is added to the queue for replication.</para>
+        </listitem>
+        <listitem>
+          <para>In a separate thread, the edit is read from the log, as part of a batch process.
+            Only the KeyValues that are eligible for replication are kept. Replicable KeyValues are
+            part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as
+              <code>hbase:meta</code>, and did not originate from the target slave cluster, in the
+            case of cyclic replication.</para>
+        </listitem>
+        <listitem>
+          <para>The edit is tagged with the master's UUID and added to a buffer. When the buffer is
+            filled, or the reader reaches the end of the file, the buffer is sent to a random region
+            server on the slave cluster.</para>
+        </listitem>
+        <listitem>
+          <para>The region server reads the edits sequentially and separates them into buffers, one
+            buffer per table. After all edits are read, each buffer is flushed using <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html"
+              >HTable</link>, HBase's normal client. The master's UUID is preserved in the edits
+            they are applied, in order to allow for cyclic replication.</para>
+        </listitem>
+        <listitem>
+          <para>In the master, the offset for the WAL that is currently being replicated is
+            registered in ZooKeeper.</para>
+        </listitem>
+      </orderedlist>
+      <orderedlist>
+        <title>When the slave does not respond:</title>
+        <listitem>
+          <para>The first three steps, where the edit is inserted, are identical.</para>
+        </listitem>
+        <listitem>
+          <para>Again in a separate thread, the region server reads, filters, and edits the log
+            edits in the same way as above. The slave region server does not answer the RPC
+            call.</para>
+        </listitem>
+        <listitem>
+          <para>The master sleeps and tries again a configurable number of times.</para>
+        </listitem>
+        <listitem>
+          <para>If the slave region server is still not available, the master selects a new subset
+            of region server to replicate to, and tries again to send the buffer of edits.</para>
+        </listitem>
+        <listitem>
+          <para>Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper. Logs that are
+              <firstterm>archived</firstterm> by their region server, by moving them from the region
+            server's log directory to a central log directory, will update their paths in the
+            in-memory queue of the replicating thread.</para>
+        </listitem>
+        <listitem>
+          <para>When the slave cluster is finally available, the buffer is applied in the same way
+            as during normal processing. The master region server will then replicate the backlog of
+            logs that accumulated during the outage.</para>
+        </listitem>
+      </orderedlist>
+
+
+      <note xml:id="cluster.replication.preserving.tags">
+        <title>Preserving Tags During Replication</title>
+        <para>By default, the codec used for replication between clusters strips tags, such as
+          cell-level ACLs, from cells. To prevent the tags from being stripped, you can use a
+          different codec which does not strip them. Configure
+            <code>hbase.replication.rpc.codec</code> to use
+            <literal>org.apache.hadoop.hbase.codec.KeyValueCodecWithTags</literal>, on both the
+          source and sink RegionServers involved in the replication. This option was introduced in
+            <link xlink:href="https://issues.apache.org/jira/browse/HBASE-10322"
+          >HBASE-10322</link>.</para>
+      </note>
+    </section>
+    
+    <section>
+      <title>Replication Internals</title>
+      <variablelist>
+        <varlistentry>
+          <term>Replication State in ZooKeeper</term>
+          <listitem>
+            <para>HBase replication maintains its state in ZooKeeper. By default, the state is
+              contained in the base node <filename>/hbase/replication</filename>. This node contains
+              two child nodes, the <code>Peers</code> znode and the <code>RS</code> znode.</para>
+          </listitem>
+        </varlistentry>
+        <varlistentry>
+          <term>The <code>Peers</code> Znode</term>
+          <listitem>
+            <para>The <code>peers</code> znode is stored in
+                <filename>/hbase/replication/peers</filename> by default. It consists of a list of
+              all peer replication clusters, along with the status of each of them. The value of
+              each peer is its cluster key, which is provided in the HBase Shell. The cluster key
+              contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the
+              ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.</para>
+            <screen>
+/hbase/replication/peers
+  /1 [Value: zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase]
+  /2 [Value: zk5.host.com,zk6.host.com,zk7.host.com:2181:/hbase]            
+          </screen>
+            <para>Each peer has a child znode which indicates whether or not replication is enabled
+              on that cluster. These peer-state znodes do not contain any child znodes, but only
+              contain a Boolean value. This value is read and maintained by the
+                R<code>eplicationPeer.PeerStateTracker</code> class.</para>
+            <screen>
+/hbase/replication/peers
+  /1/peer-state [Value: ENABLED]
+  /2/peer-state [Value: DISABLED]
+          </screen>
+          </listitem>
+        </varlistentry>
+        <varlistentry>
+          <term>The <code>RS</code> Znode</term>
+          <listitem>
+            <para>The <code>rs</code> znode contains a list of WAL logs which need to be replicated.
+              This list is divided into a set of queues organized by region server and the peer
+              cluster the region server is shipping the logs to. The rs znode has one child znode
+              for each region server in the cluster. The child znode name is the region server's
+              hostname, client port, and start code. This list includes both live and dead region
+              servers.</para>
+            <screen>
+/hbase/replication/rs
+  /hostname.example.org,6020,1234
+  /hostname2.example.org,6020,2856            
+          </screen>
+            <para>Each <code>rs</code> znode contains a list of WAL replication queues, one queue
+              for each peer cluster it replicates to. These queues are represented by child znodes
+              named by the cluster ID of the peer cluster they represent.</para>
+            <screen>
+/hbase/replication/rs
+  /hostname.example.org,6020,1234
+    /1
+    /2            
+          </screen>
+            <para>Each queue has one child znode for each WAL log that still needs to be replicated.
+              the value of these child znodes is the last position that was replicated. This
+              position is updated each time a WAL log is replicated.</para>
+            <screen>
+/hbase/replication/rs
+  /hostname.example.org,6020,1234
+    /1
+      23522342.23422 [VALUE: 254]
+      12340993.22342 [VALUE: 0]            
+          </screen>
+          </listitem>
+        </varlistentry>
+      </variablelist>
+    </section>
+    <section>
+      <title>Replication Configuration Options</title>
+    <informaltable>
+      <tgroup cols="3">
+        <thead>
+          <row>
+            <entry>Option</entry>
+            <entry>Description</entry>
+            <entry>Default</entry>
+          </row>
+        </thead>
+        <tbody>
+          <row>
+            <entry><para><code>zookeeper.znode.parent</code></para></entry>
+            <entry><para>The name of the base ZooKeeper znode used for HBase</para></entry>
+            <entry><para><literal>/hbase</literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>zookeeper.znode.replication</code></para></entry>
+            <entry><para>The name of the base znode used for replication</para></entry>
+            <entry><para><literal>replication</literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>zookeeper.znode.replication.peers</code></para></entry>
+            <entry><para>The name of the <code>peer</code> znode</para></entry>
+            <entry><para><literal>peers</literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>zookeeper.znode.replication.peers.state</code></para></entry>
+            <entry><para>The name of <code>peer-state</code> znode</para></entry>
+            <entry><para><literal>peer-state</literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>zookeeper.znode.replication.rs</code></para></entry>
+            <entry><para>The name of the <code>rs</code> znode</para></entry>
+            <entry><para><literal>rs</literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>hbase.replication</code></para></entry>
+            <entry><para>Whether replication is enabled or disabled on a given cluster</para></entry>
+            <entry><para><literal>false</literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>eplication.sleep.before.failover</code></para></entry>
+            <entry><para>How many milliseconds a worker should sleep before attempting to replicate
+              a dead region server's WAL queues.</para></entry>
+            <entry><para><literal></literal></para></entry>
+          </row>
+          <row>
+            <entry><para><code>replication.executor.workers</code></para></entry>
+            <entry><para>The number of region servers a given region server should attempt to
+                failover simultaneously.</para></entry>
+            <entry><para><literal>1</literal></para></entry>
+          </row>
+        </tbody>
+      </tgroup>
+    </informaltable>
+    </section>
+    
+    <section>
+      <title>Replication Implementation Details</title>
+      <formalpara>
+        <title>Choosing Region Servers to Replicate To</title>
+        <para>When a master cluster region server initiates a replication source to a slave cluster,
+          it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It
+          then scans the <filename>rs/</filename> directory to discover all the available sinks
+          (region servers that are accepting incoming streams of edits to replicate) and randomly
+          chooses a subset of them using a configured ratio which has a default value of 10%. For
+          example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for
+          edits that this master cluster region server sends. Because this selection is performed by
+          each master region server, the probability that all slave region servers are used is very
+          high, and this method works for clusters of any size. For example, a master cluster of 10
+          machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the
+          master cluster region servers to choose one machine each at random.</para>
+      </formalpara>
+      <para>A ZooKeeper watcher is placed on the
+            <filename>${<replaceable>zookeeper.znode.parent</replaceable>}/rs</filename> node of the
+        slave cluster by each of the master cluster's region servers. This watch is used to monitor
+        changes in the composition of the slave cluster. When nodes are removed from the slave
+        cluster, or if nodes go down or come back up, the master cluster's region servers will
+        respond by selecting a new pool of slave region servers to replicate to.</para>
+
+      <formalpara>
+        <title>Keeping Track of Logs</title>
+
+        <para>Each master cluster region server has its own znode in the replication znodes
+          hierarchy. It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are
+          created), and each of these contain a queue of WALs to process. Each of these queues will
+          track the WALs created by that region server, but they can differ in size. For example, if
+          one slave cluster becomes unavailable for some time, the WALs should not be deleted, so
+          they need to stay in the queue while the others are processed. See <xref
+            linkend="rs.failover.details"/> for an example.</para>
+      </formalpara>
+      <para>When a source is instantiated, it contains the current WAL that the region server is
+        writing to. During log rolling, the new file is added to the queue of each slave cluster's
+        znode just before it is made available. This ensures that all the sources are aware that a
+        new log exists before the region server is able to append edits into it, but this operations
+        is now more expensive. The queue items are discarded when the replication thread cannot read
+        more entries from a file (because it reached the end of the last block) and there are other
+        files in the queue. This means that if a source is up to date and replicates from the log
+        that the region server writes to, reading up to the "end" of the current file will not
+        delete the item in the queue.</para>
+      <para>A log can be archived if it is no longer used or if the number of logs exceeds
+          <code>hbase.regionserver.maxlogs</code> because the insertion rate is faster than regions
+        are flushed. When a log is archived, the source threads are notified that the path for that
+        log changed. If a particular source has already finished with an archived log, it will just
+        ignore the message. If the log is in the queue, the path will be updated in memory. If the
+        log is currently being replicated, the change will be done atomically so that the reader
+        doesn't attempt to open the file when has already been moved. Because moving a file is a
+        NameNode operation , if the reader is currently reading the log, it won't generate any
+        exception.</para>
+      <formalpara>
+        <title>Reading, Filtering and Sending Edits</title>
+        <para>By default, a source attempts to read from a WAL and ship log entries to a sink as
+          quickly as possible. Speed is limited by the filtering of log entries Only KeyValues that
+          are scoped GLOBAL and that do not belong to catalog tables will be retained. Speed is also
+          limited by total size of the list of edits to replicate per slave, which is limited to 64
+          MB by default. With this configuration, a master cluster region server with three slaves
+          would use at most 192 MB to store data to replicate. This does not account for the data
+          which was filtered but not garbage collected.</para>
+      </formalpara>
+      <para>Once the maximum size of edits has been buffered or the reader reaces the end of the
+        WAL, the source thread stops reading and chooses at random a sink to replicate to (from the
+        list that was generated by keeping only a subset of slave region servers). It directly
+        issues a RPC to the chosen region server and waits for the method to return. If the RPC was
+        successful, the source determines whether the current file has been emptied or it contains
+        more data which needs to be read. If the file has been emptied, the source deletes the znode
+        in the queue. Otherwise, it registers the new offset in the log's znode. If the RPC threw an
+        exception, the source will retry 10 times before trying to find a different sink.</para>
+      <formalpara>
+        <title>Cleaning Logs</title>
+        <para>If replication is not enabled, the master's log-cleaning thread deletes old logs using
+          a configured TTL. This TTL-based method does not work well with replication, because
+          archived logs which have exceeded their TTL may still be in a queue. The default behavior
+          is augmented so that if a log is past its TTL, the cleaning thread looks up every queue
+          until it finds the log, while caching queues it has found. If the log is not found in any
+          queues, the log will be deleted. The next time the cleaning process needs to look for a
+          log, it starts by using its cached list.</para>
+      </formalpara>
+      <formalpara xml:id="rs.failover.details">
+        <title>Region Server Failover</title>
+        <para>When no region servers are failing, keeping track of the logs in ZooKeeper adds no
+          value. Unfortunately, region servers do fail, and since ZooKeeper is highly available, it
+          is useful for managing the transfer of the queues in the event of a failure.</para>
+      </formalpara>
+      <para>Each of the master cluster region servers keeps a watcher on every other region server,
+        in order to be notified when one dies (just as the master does). When a failure happens,
+        they all race to create a znode called <literal>lock</literal> inside the dead region
+        server's znode that contains its queues. The region server that creates it successfully then
+        transfers all the queues to its own znode, one at a time since ZooKeeper does not support
+        renaming queues. After queues are all transferred, they are deleted from the old location.
+        The znodes that were recovered are renamed with the ID of the slave cluster appended with
+        the name of the dead server.</para>
+      <para>Next, the master cluster region server creates one new source thread per copied queue,
+        and each of the source threads follows the read/filter/ship pattern. The main difference is
+        that those queues will never receive new data, since they do not belong to their new region
+        server. When the reader hits the end of the last log, the queue's znode is deleted and the
+        master cluster region server closes that replication source.</para>
+      <para>Given a master cluster with 3 region servers replicating to a single slave with id
+          <literal>2</literal>, the following hierarchy represents what the znodes layout could be
+        at some point in time. The region servers' znodes all contain a <literal>peers</literal>
+        znode which contains a single queue. The znode names in the queues represent the actual file
+        names on HDFS in the form
+            <literal><replaceable>address</replaceable>,<replaceable>port</replaceable>.<replaceable>timestamp</replaceable></literal>.</para>
+      <screen>
+/hbase/replication/rs/
+  1.1.1.1,60020,123456780/
+    2/
+      1.1.1.1,60020.1234  (Contains a position)
+      1.1.1.1,60020.1265
+  1.1.1.2,60020,123456790/
+    2/
+      1.1.1.2,60020.1214  (Contains a position)
+      1.1.1.2,60020.1248
+      1.1.1.2,60020.1312
+  1.1.1.3,60020,    123456630/
+    2/
+      1.1.1.3,60020.1280  (Contains a position)            
+          </screen>
+      <para>Assume that 1.1.1.2 loses its ZooKeeper session. The survivors will race to create a
+        lock, and, arbitrarily, 1.1.1.3 wins. It will then start transferring all the queues to its
+        local peers znode by appending the name of the dead server. Right before 1.1.1.3 is able to
+        clean up the old znodes, the layout will look like the following:</para>
+      <screen>
+/hbase/replication/rs/
+  1.1.1.1,60020,123456780/
+    2/
+      1.1.1.1,60020.1234  (Contains a position)
+      1.1.1.1,60020.1265
+  1.1.1.2,60020,123456790/
+    lock
+    2/
+      1.1.1.2,60020.1214  (Contains a position)
+      1.1.1.2,60020.1248
+      1.1.1.2,60020.1312
+  1.1.1.3,60020,123456630/
+    2/
+      1.1.1.3,60020.1280  (Contains a position)
+
+    2-1.1.1.2,60020,123456790/
+      1.1.1.2,60020.1214  (Contains a position)
+      1.1.1.2,60020.1248
+      1.1.1.2,60020.1312            
+          </screen>
+      <para>Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from
+        1.1.1.2, it dies too. Some new logs were also created in the normal queues. The last region
+        server will then try to lock 1.1.1.3's znode and will begin transferring all the queues. The
+        new layout will be:</para>
+      <screen>
+/hbase/replication/rs/
+  1.1.1.1,60020,123456780/
+    2/
+      1.1.1.1,60020.1378  (Contains a position)
+
+    2-1.1.1.3,60020,123456630/
+      1.1.1.3,60020.1325  (Contains a position)
+      1.1.1.3,60020.1401
+
+    2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
+      1.1.1.2,60020.1312  (Contains a position)
+  1.1.1.3,60020,123456630/
+    lock
+    2/
+      1.1.1.3,60020.1325  (Contains a position)
+      1.1.1.3,60020.1401
+
+    2-1.1.1.2,60020,123456790/
+      1.1.1.2,60020.1312  (Contains a position)            
+          </screen>
+      <formalpara>
+        <title>Replication Metrics</title>
+        <para>The following metrics are exposed at the global region server level and (since HBase
+          0.95) at the peer level:</para>
+      </formalpara>
+      <variablelist>
+        <varlistentry>
+          <term><code>source.sizeOfLogQueue</code></term>
+          <listitem>
+            <para>number of WALs to process (excludes the one which is being processed) at the
+              Replication source</para>
+          </listitem>
+        </varlistentry>
+        <varlistentry>
+          <term><code>source.shippedOps</code></term>
+          <listitem>
+            <para>number of mutations shipped</para>
+          </listitem>
+        </varlistentry>
+        <varlistentry>
+          <term><code>source.logEditsRead</code></term>
+          <listitem>
+            <para>number of mutations read from HLogs at the replication source</para>
+          </listitem>
+        </varlistentry>
+        <varlistentry>
+          <term><code>source.ageOfLastShippedOp</code></term>
+          <listitem>
+            <para>age of last batch that was shipped by the replication source</para>
+          </listitem>
+        </varlistentry>
+      </variablelist>
+
+    </section>
   </section>
   <section
     xml:id="ops.backup">

http://git-wip-us.apache.org/repos/asf/hbase/blob/fe54e7d7/src/main/site/xdoc/replication.xml
----------------------------------------------------------------------
diff --git a/src/main/site/xdoc/replication.xml b/src/main/site/xdoc/replication.xml
index 97aaf51..2633f08 100644
--- a/src/main/site/xdoc/replication.xml
+++ b/src/main/site/xdoc/replication.xml
@@ -26,520 +26,6 @@
     </title>
   </properties>
   <body>
-    <section name="Overview">
-      <p>
-        The replication feature of Apache HBase (TM) provides a way to copy data between HBase deployments. It
-        can serve as a disaster recovery solution and can contribute to provide
-        higher availability at the HBase layer. It can also serve more practically;
-        for example, as a way to easily copy edits from a web-facing cluster to a "MapReduce"
-        cluster which will process old and new data and ship back the results
-        automatically.
-      </p>
-      <p>
-        The basic architecture pattern used for Apache HBase replication is (HBase cluster) master-push;
-        it is much easier to keep track of what’s currently being replicated since
-        each region server has its own write-ahead-log (aka WAL or HLog), just like
-        other well known solutions like MySQL master/slave replication where
-        there’s only one bin log to keep track of. One master cluster can
-        replicate to any number of slave clusters, and each region server will
-        participate to replicate their own stream of edits. For more information
-        on the different properties of master/slave replication and other types
-        of replication, please consult <a href="http://highscalability.com/blog/2009/8/24/how-google-serves-data-from-multiple-datacenters.html">
-        How Google Serves Data From Multiple Datacenters</a>.
-      </p>
-      <p>
-        The replication is done asynchronously, meaning that the clusters can
-        be geographically distant, the links between them can be offline for
-        some time, and rows inserted on the master cluster won’t be
-        available at the same time on the slave clusters (eventual consistency).
-      </p>
-      <p>
-        The replication format used in this design is conceptually the same as
-        <a href="http://dev.mysql.com/doc/refman/5.1/en/replication-formats.html">
-        MySQL’s statement-based replication </a>. Instead of SQL statements, whole
-        WALEdits (consisting of multiple cell inserts coming from the clients'
-        Put and Delete) are replicated in order to maintain atomicity.
-      </p>
-      <p>
-        The HLogs from each region server are the basis of HBase replication,
-        and must be kept in HDFS as long as they are needed to replicate data
-        to any slave cluster. Each RS reads from the oldest log it needs to
-        replicate and keeps the current position inside ZooKeeper to simplify
-        failure recovery. That position can be different for every slave 
-        cluster, same for the queue of HLogs to process.
-      </p>
-      <p>
-        The clusters participating in replication can be of asymmetric sizes
-        and the master cluster will do its “best effort” to balance the stream
-        of replication on the slave clusters by relying on randomization.
-      </p>
-      <p>
-        As of version 0.92, Apache HBase supports master/master and cyclic
-        replication as well as replication to multiple slaves.
-      </p>
-      <img src="images/replication_overview.png"/>
-    </section>
-    <section name="Enabling replication">
-      <p>
-        The guide on enabling and using cluster replication is contained
-        in the API documentation shipped with your Apache HBase distribution.
-      </p>
-      <p>
-        The most up-to-date documentation is
-        <a href="apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements">
-        available at this address</a>.
-      </p>
-    </section>
-    <section name="Life of a log edit">
-      <p>
-        The following sections describe the life of a single edit going from a
-        client that communicates with a master cluster all the way to a single
-        slave cluster.
-      </p>
-      <section name="Normal processing">
-        <p>
-          The client uses an API that sends a Put, Delete or ICV to a region
-          server. The key values are transformed into a WALEdit by the region
-          server and is inspected by the replication code that, for each family
-          that is scoped for replication, adds the scope to the edit. The edit
-          is appended to the current WAL and is then applied to its MemStore.
-        </p>
-        <p>
-          In a separate thread, the edit is read from the log (as part of a batch)
-          and only the KVs that are replicable are kept (that is, that they are part
-          of a family scoped GLOBAL in the family's schema, non-catalog so not
-          hbase:meta or -ROOT-, and did not originate in the target slave cluster - in
-          case of cyclic replication).
-        </p>
-        <p>
-          The edit is then tagged with the master's cluster UUID.
-          When the buffer is filled, or the reader hits the end of the file,
-          the buffer is sent to a random region server on the slave cluster.
-        </p>
-        <p>
-          Synchronously, the region server that receives the edits reads them
-          sequentially and separates each of them into buffers, one per table.
-          Once all edits are read, each buffer is flushed using HTable, the normal
-          HBase client.The master's cluster UUID is retained in the edits applied at 
-          the slave cluster in order to allow cyclic replication.
-        </p>
-        <p>
-          Back in the master cluster's region server, the offset for the current
-          WAL that's being replicated is registered in ZooKeeper.
-        </p>
-      </section>
-      <section name="Non-responding slave clusters">
-        <p>
-          The edit is inserted in the same way.
-        </p>
-        <p>
-          In the separate thread, the region server reads, filters and buffers
-          the log edits the same way as during normal processing. The slave
-          region server that's contacted doesn't answer to the RPC, so the master
-          region server will sleep and retry up to a configured number of times.
-          If the slave RS still isn't available, the master cluster RS will select a
-          new subset of RS to replicate to and will retry sending the buffer of
-          edits.
-        </p>
-        <p>
-          In the mean time, the WALs will be rolled and stored in a queue in
-          ZooKeeper. Logs that are archived by their region server (archiving is
-          basically moving a log from the region server's logs directory to a
-          central logs archive directory) will update their paths in the in-memory
-          queue of the replicating thread.
-        </p>
-        <p>
-          When the slave cluster is finally available, the buffer will be applied
-          the same way as during normal processing. The master cluster RS will then
-          replicate the backlog of logs.
-        </p>
-      </section>
-    </section>
-    <section name="Internals">
-      <p>
-        This section describes in depth how each of replication's internal
-        features operate.
-      </p>
-      <section name="Replication Zookeeper State">
-        <p>
-            HBase replication maintains all of its state in Zookeeper. By default, this state is 
-            contained in the base znode:
-        </p>
-        <pre>
-                /hbase/replication
-        </pre>
-        <p>
-            There are two major child znodes in the base replication znode:
-            <ul>
-            <li><b>Peers znode:</b> /hbase/replication/peers</li>
-            <li><b>RS znode:</b> /hbase/replication/rs</li>
-            </ul>
-        </p>
-        <section name="The Peers znode">
-            <p>
-                The <b>peers znode</b> contains a list of all peer replication clusters and the 
-                current replication state of those clusters. It has one child <i>peer znode</i> 
-                for each peer cluster. The <i>peer znode</i> is named with the cluster id provided 
-                by the user in the HBase shell. The value of the <i>peer znode</i> contains 
-                the peers cluster key provided by the user in the HBase Shell. The cluster key 
-                contains a list of zookeeper nodes in the clusters quorum, the client port for the 
-                zookeeper quorum, and the base znode for HBase 
-                (i.e. “zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase”).
-            </p>
-            <pre>
-                /hbase/replication/peers
-                    /1 [Value: zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase]
-                    /2 [Value: zk5.host.com,zk6.host.com,zk7.host.com:2181:/hbase]
-            </pre>
-            <p>
-                Each of these <i>peer znodes</i> has a child znode that indicates whether or not 
-                replication is enabled on that peer cluster. These <i>peer-state znodes</i> do not 
-                have child znodes and simply contain a boolean value (i.e. ENABLED or DISABLED). 
-                This value is read/maintained by the <i>ReplicationPeer.PeerStateTracker</i> class. 
-            </p>
-            <pre>
-                /hbase/replication/peers
-                    /1/peer-state [Value: ENABLED]
-                    /2/peer-state [Value: DISABLED]
-            </pre>
-        </section>
-        <section name="The RS znode">
-            <p>
-                The <b>rs znode</b> contains a list of all outstanding HLog files in the cluster 
-                that need to be replicated. The list is divided into a set of queues organized by 
-                region server and the peer cluster the region server is shipping the HLogs to. The 
-                <b>rs znode</b> has one child znode for each region server in the cluster. The child 
-                znode name is simply the regionserver name (a concatenation of the region server’s 
-                hostname, client port and start code). These region servers could either be dead or alive.
-            </p>
-            <pre>
-                /hbase/replication/rs
-                    /hostname.example.org,6020,1234
-                    /hostname2.example.org,6020,2856
-            </pre>
-            <p>
-                Within each region server znode, the region server maintains a set of HLog replication 
-                queues. Each region server has one queue for every peer cluster it replicates to. 
-                These queues are represented by child znodes named using the cluster id of the peer 
-                cluster they represent (see the peer znode section).
-            </p>
-            <pre>
-                /hbase/replication/rs
-                    /hostname.example.org,6020,1234
-                        /1
-                        /2
-            </pre>
-            <p>
-                Each queue has one child znode for every HLog that still needs to be replicated. 
-                The value of these HLog child znodes is the latest position that has been replicated. 
-                This position is updated every time a HLog entry is replicated.
-            </p>
-            <pre>
-                /hbase/replication/rs
-                    /hostname.example.org,6020,1234
-                        /1
-                            23522342.23422 [VALUE: 254]
-                            12340993.22342 [VALUE: 0]
-            </pre>
-        </section>
-      </section>
-      <section name="Configuration Parameters">
-        <section name="Zookeeper znode paths">
-            <p>
-                All of the base znode names are configurable through parameters:
-            </p>
-            <table border="1">
-                <tr>
-                    <td><b>Parameter</b></td>
-                    <td><b>Default Value</b></td>
-                </tr>
-                                <tr>
-                    <td>zookeeper.znode.parent</td>
-                    <td>/hbase</td>
-                </tr>
-                <tr>
-                    <td>zookeeper.znode.replication</td>
-                    <td>replication</td>
-                </tr>
-                <tr>
-                    <td>zookeeper.znode.replication.peers</td>
-                    <td>peers</td>
-                </tr>
-                <tr>
-                    <td>zookeeper.znode.replication.peers.state</td>
-                    <td>peer-state</td>
-                </tr>
-                <tr>
-                    <td>zookeeper.znode.replication.rs</td>
-                    <td>rs</td>
-                </tr>
-            </table>
-            <p>
-                The default replication znode structure looks like the following:
-            </p>
-            <pre>
-                /hbase/replication/peers/{peerId}/peer-state
-                /hbase/replication/rs
-            </pre>
-        </section>
-        <section name="Other parameters">
-            <ul>
-                <li><b>hbase.replication</b> (Default: false) - Controls whether replication is enabled 
-                or disabled for the cluster.</li>
-                <li><b>replication.sleep.before.failover</b> (Default: 2000) - The amount of time a failover 
-                worker waits before attempting to replicate a dead region server’s HLog queues.</li>
-                <li><b>replication.executor.workers</b> (Default: 1) - The number of dead region servers 
-                one region server should attempt to failover simultaneously.</li>
-            </ul>
-        </section>
-      </section>
-      <section name="Choosing region servers to replicate to">
-        <p>
-          When a master cluster RS initiates a replication source to a slave cluster,
-          it first connects to the slave's ZooKeeper ensemble using the provided
-          cluster key (that key is composed of the value of hbase.zookeeper.quorum,
-          zookeeper.znode.parent and hbase.zookeeper.property.clientPort). It
-          then scans the "rs" directory to discover all the available sinks
-          (region servers that are accepting incoming streams of edits to replicate)
-          and will randomly choose a subset of them using a configured
-          ratio (which has a default value of 10%). For example, if a slave
-          cluster has 150 machines, 15 will be chosen as potential recipient for
-          edits that this master cluster RS will be sending. Since this is done by all
-          master cluster RSs, the probability that all slave RSs are used is very high,
-          and this method works for clusters of any size. For example, a master cluster
-          of 10 machines replicating to a slave cluster of 5 machines with a ratio
-          of 10% means that the master cluster RSs will choose one machine each
-          at random, thus the chance of overlapping and full usage of the slave
-          cluster is higher.
-        </p>
-        <p>
-          A ZK watcher is placed on the ${zookeeper.znode.parent}/rs node of 
-          the slave cluster by each of the master cluster's region servers. 
-          This watch is used to monitor changes in the composition of the 
-          slave cluster. When nodes are removed from the slave cluster (or 
-          if nodes go down and/or come back up), the master cluster's region
-          servers will respond by selecting a new pool of slave region servers
-          to replicate to.
-        </p>
-      </section>
-      <section name="Keeping track of logs">
-        <p>
-          Every master cluster RS has its own znode in the replication znodes hierarchy.
-          It contains one znode per peer cluster (if 5 slave clusters, 5 znodes
-          are created), and each of these contain a queue
-          of HLogs to process. Each of these queues will track the HLogs created
-          by that RS, but they can differ in size. For example, if one slave
-          cluster becomes unavailable for some time then the HLogs should not be deleted,
-          thus they need to stay in the queue (while the others are processed).
-          See the section named "Region server failover" for an example.
-        </p>
-        <p>
-          When a source is instantiated, it contains the current HLog that the
-          region server is writing to. During log rolling, the new file is added
-          to the queue of each slave cluster's znode just before it's made available.
-          This ensures that all the sources are aware that a new log exists
-          before HLog is able to append edits into it, but this operations is
-          now more expensive.
-          The queue items are discarded when the replication thread cannot read
-          more entries from a file (because it reached the end of the last block)
-          and that there are other files in the queue.
-          This means that if a source is up-to-date and replicates from the log
-          that the region server writes to, reading up to the "end" of the
-          current file won't delete the item in the queue.
-        </p>
-        <p>
-          When a log is archived (because it's not used anymore or because there's
-          too many of them per hbase.regionserver.maxlogs typically because insertion
-          rate is faster than region flushing), it will notify the source threads that the path
-          for that log changed. If the a particular source was already done with
-          it, it will just ignore the message. If it's in the queue, the path
-          will be updated in memory. If the log is currently being replicated,
-          the change will be done atomically so that the reader doesn't try to
-          open the file when it's already moved. Also, moving a file is a NameNode
-          operation so, if the reader is currently reading the log, it won't
-          generate any exception.
-        </p>
-      </section>
-      <section name="Reading, filtering and sending edits">
-        <p>
-          By default, a source will try to read from a log file and ship log
-          entries as fast as possible to a sink. This is first limited by the
-          filtering of log entries; only KeyValues that are scoped GLOBAL and
-          that don't belong to catalog tables will be retained. A second limit
-          is imposed on the total size of the list of edits to replicate per slave,
-          which by default is 64MB. This means that a master cluster RS with 3 slaves
-          will use at most 192MB to store data to replicate. This doesn't account
-          the data filtered that wasn't garbage collected.
-        </p>
-        <p>
-          Once the maximum size of edits was buffered or the reader hits the end
-          of the log file, the source thread will stop reading and will choose
-          at random a sink to replicate to (from the list that was generated by
-          keeping only a subset of slave RSs). It will directly issue a RPC to
-          the chosen machine and will wait for the method to return. If it's
-          successful, the source will determine if the current file is emptied
-          or if it should continue to read from it. If the former, it will delete
-          the znode in the queue. If the latter, it will register the new offset
-          in the log's znode. If the RPC threw an exception, the source will retry
-          10 times until trying to find a different sink.
-        </p>
-      </section>
-      <section name="Cleaning logs">
-        <p>
-          If replication isn't enabled, the master's logs cleaning thread will
-          delete old logs using a configured TTL. This doesn't work well with
-          replication since archived logs passed their TTL may still be in a
-          queue. Thus, the default behavior is augmented so that if a log is
-          passed its TTL, the cleaning thread will lookup every queue until it
-          finds the log (while caching the ones it finds). If it's not found,
-          the log will be deleted. The next time it has to look for a log,
-          it will first use its cache.
-        </p>
-      </section>
-      <section name="Region server failover">
-        <p>
-          As long as region servers don't fail, keeping track of the logs in ZK
-          doesn't add any value. Unfortunately, they do fail, so since ZooKeeper
-          is highly available we can count on it and its semantics to help us
-          managing the transfer of the queues.
-        </p>
-        <p>
-          All the master cluster RSs keep a watcher on every other one of them to be
-          notified when one dies (just like the master does). When it happens,
-          they all race to create a znode called "lock" inside the dead RS' znode
-          that contains its queues. The one that creates it successfully will
-          proceed by transferring all the queues to its own znode (one by one
-          since ZK doesn't support the rename operation) and will delete all the
-          old ones when it's done. The recovered queues' znodes will be named
-          with the id of the slave cluster appended with the name of the dead
-          server. 
-        </p>
-        <p>
-          Once that is done, the master cluster RS will create one new source thread per
-          copied queue, and each of them will follow the read/filter/ship pattern.
-          The main difference is that those queues will never have new data since
-          they don't belong to their new region server, which means that when
-          the reader hits the end of the last log, the queue's znode will be
-          deleted and the master cluster RS will close that replication source.
-        </p>
-        <p>
-          For example, consider a master cluster with 3 region servers that's
-          replicating to a single slave with id '2'. The following hierarchy
-          represents what the znodes layout could be at some point in time. We
-          can see the RSs' znodes all contain a "peers" znode that contains a
-          single queue. The znode names in the queues represent the actual file
-          names on HDFS in the form "address,port.timestamp".
-        </p>
-        <pre>
-/hbase/replication/rs/
-                      1.1.1.1,60020,123456780/
-                          2/
-                              1.1.1.1,60020.1234  (Contains a position)
-                              1.1.1.1,60020.1265
-                      1.1.1.2,60020,123456790/
-                          2/
-                              1.1.1.2,60020.1214  (Contains a position)
-                              1.1.1.2,60020.1248
-                              1.1.1.2,60020.1312
-                      1.1.1.3,60020,    123456630/
-                          2/
-                              1.1.1.3,60020.1280  (Contains a position)
-        </pre>
-        <p>
-          Now let's say that 1.1.1.2 loses its ZK session. The survivors will race
-          to create a lock, and for some reasons 1.1.1.3 wins. It will then start
-          transferring all the queues to its local peers znode by appending the
-          name of the dead server. Right before 1.1.1.3 is able to clean up the
-          old znodes, the layout will look like the following:
-        </p>
-        <pre>
-/hbase/replication/rs/
-                      1.1.1.1,60020,123456780/
-                          2/
-                              1.1.1.1,60020.1234  (Contains a position)
-                              1.1.1.1,60020.1265
-                      1.1.1.2,60020,123456790/
-                          lock
-                          2/
-                              1.1.1.2,60020.1214  (Contains a position)
-                              1.1.1.2,60020.1248
-                              1.1.1.2,60020.1312
-                      1.1.1.3,60020,123456630/
-                          2/
-                              1.1.1.3,60020.1280  (Contains a position)
-
-                          2-1.1.1.2,60020,123456790/
-                              1.1.1.2,60020.1214  (Contains a position)
-                              1.1.1.2,60020.1248
-                              1.1.1.2,60020.1312
-        </pre>
-        <p>
-          Some time later, but before 1.1.1.3 is able to finish replicating the
-          last HLog from 1.1.1.2, let's say that it dies too (also some new logs
-          were created in the normal queues). The last RS will then try to lock
-          1.1.1.3's znode and will begin transferring all the queues. The new
-          layout will be:
-        </p>
-        <pre>
-/hbase/replication/rs/
-                      1.1.1.1,60020,123456780/
-                          2/
-                              1.1.1.1,60020.1378  (Contains a position)
-
-                          2-1.1.1.3,60020,123456630/
-                              1.1.1.3,60020.1325  (Contains a position)
-                              1.1.1.3,60020.1401
-
-                          2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
-                              1.1.1.2,60020.1312  (Contains a position)
-                      1.1.1.3,60020,123456630/
-                          lock
-                          2/
-                              1.1.1.3,60020.1325  (Contains a position)
-                              1.1.1.3,60020.1401
-
-                          2-1.1.1.2,60020,123456790/
-                              1.1.1.2,60020.1312  (Contains a position)
-        </pre>
-      </section>
-    </section>
-    <section name="Replication Metrics"> 
-    Following the some useful metrics which can be used to check the replication progress:
-	<ul>
-	 <li><b>source.sizeOfLogQueue:</b> number of HLogs to process (excludes the one which is being 
-		processed) at the Replication source</li>
-    	 <li><b>source.shippedOps:</b> number of mutations shipped</li>
-   	 <li><b>source.logEditsRead:</b> number of mutations read from HLogs at the replication source</li>
-    	 <li><b>source.ageOfLastShippedOp:</b> age of last batch that was shipped by the replication source</li>
-	</ul>
-	Please note that the above metrics are at the global level at this regionserver. In 0.95.0 and onwards, these
-	 metrics are also exposed per peer level.
-    </section>
-	
-    <section name="FAQ">
-      <section name="GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later?">
-        <p>
-          Yes, this is for much later.
-        </p>
-      </section>
-      <section name="You need a bulk edit shipper? Something that allows you transfer 64MB of edits in one go?">
-        <p>
-          You can use the HBase-provided utility called CopyTable from the package
-          org.apache.hadoop.hbase.mapreduce in order to have a discp-like tool to
-          bulk copy data.
-        </p>
-      </section>
-      <section name="Is it a mistake that WALEdit doesn't carry Put and Delete objects, that we have to reinstantiate not only when replicating but when replaying edits also?">
-        <p>
-          Yes, this behavior would help a lot but it's not currently available
-          in HBase (BatchUpdate had that, but it was lost in the new API).
-        </p>
-      </section>
-      <section name="Is there an issue replicating on Hadoop 1.0/1.1 when short-circuit reads are enabled?">
-        <p>
-          Yes. See <a href="https://issues.apache.org/jira/browse/HDFS-2757">HDFS-2757</a>.
-        </p>
-      </section>
-    </section>
+    <p>This information has been moved to <a href="http://hbase.apache.org/book.html#cluster_replication">the Cluster Replication</a> section of the <a href="http://hbase.apache.org/book.html">Apache HBase Reference Guide</a>.</p>
   </body>
 </document>