You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by st...@apache.org on 2012/11/30 18:38:33 UTC
svn commit: r1415759 - in /hbase/trunk/src/docbkx: configuration.xml ops_mgt.xml

Author: stack
Date: Fri Nov 30 17:38:32 2012
New Revision: 1415759

URL: http://svn.apache.org/viewvc?rev=1415759&view=rev
Log:
Add note on 'bad disk'

Modified:
    hbase/trunk/src/docbkx/configuration.xml
    hbase/trunk/src/docbkx/ops_mgt.xml

Modified: hbase/trunk/src/docbkx/configuration.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/configuration.xml?rev=1415759&r1=1415758&r2=1415759&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/configuration.xml (original)
+++ hbase/trunk/src/docbkx/configuration.xml Fri Nov 30 17:38:32 2012
@@ -941,6 +941,8 @@ index e70ebc6..96f8c27 100644
       </section>
 
       <section xml:id="recommended_configurations"><title>Recommended Configurations</title>
+          <section xml:id="recommended_configurations.zk">
+              <title>ZooKeeper Configuration</title>
           <section xml:id="zookeeper.session.timeout"><title><varname>zookeeper.session.timeout</varname></title>
           <para>The default timeout is three minutes (specified in milliseconds). This means
               that if a server crashes, it will be three minutes before the Master notices
@@ -967,6 +969,18 @@ index e70ebc6..96f8c27 100644
           <para>See <xref linkend="zookeeper"/>.
           </para>
       </section>
+      </section>
+      <section xml:id="recommended.configurations.hdfs">
+          <title>HDFS Configurations</title>
+          <section xml:id="dfs.datanode.failed.volumes.tolerated">
+              <title>dfs.datanode.failed.volumes.tolerated</title>
+              <para>This is the "...number of volumes that are allowed to fail before a datanode stops offering service. By default
+                  any volume failure will cause a datanode to shutdown" from the <filename>hdfs-default.xml</filename>
+                  description.  If you have > three or four disks, you might want to set this to 1 or if you have many disks,
+                  two or more.
+              </para>
+          </section>
+      </section>
           <section xml:id="hbase.regionserver.handler.count"><title><varname>hbase.regionserver.handler.count</varname></title>
           <para>
           This setting defines the number of threads that are kept open to answer

Modified: hbase/trunk/src/docbkx/ops_mgt.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/ops_mgt.xml?rev=1415759&r1=1415758&r2=1415759&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/ops_mgt.xml (original)
+++ hbase/trunk/src/docbkx/ops_mgt.xml Fri Nov 30 17:38:32 2012
@@ -380,6 +380,20 @@ false
             </para>
         </note>
         </para>
+        <section xml:id="bad.disk">
+            <title>Bad or Failing Disk</title>
+            <para>It is good having <xref linkend="dfs.datanode.failed.volumes.tolerated" /> set if you have a decent number of disks
+            per machine for the case where a disk plain dies.  But usually disks do the "John Wayne" -- i.e. take a while
+            to go down spewing errors in <filename>dmesg</filename> -- or for some reason, run much slower than their
+            companions.  In this case you want to decommission the disk.  You have two options.  You can
+            <xlink href="http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F">decommission the datanode</xlink>
+            or, less disruptive in that only the bad disks data will be rereplicated, is that you can stop the datanode,
+            unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the
+            datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0).  The regionserver will
+            throw some errors in its logs as it recalibrates where to get its data from -- it will likely
+            roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
+            </para>
+        </section>
         </section>
         <section xml:id="rolling">
             <title>Rolling Restart</title>