You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/10/27 20:36:12 UTC
[Hadoop Wiki] Update of "HDFS-RAID" by PatrickKling

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "HDFS-RAID" page has been changed by PatrickKling.
The comment on this change is: Re-structuring.
http://wiki.apache.org/hadoop/HDFS-RAID?action=diff&rev1=1&rev2=2

--------------------------------------------------

- This HDFS RAID module implements a Distributed Raid File System. It is used alongwith
+ = HDFS RAID =
+ 
+ == Overview ==
+ 
+ The HDFS RAID module provides a DistributedRaidFileSystem (DRFS) that is
- an instance of the Hadoop Distributed File System (HDFS). It can be used to
+ used along with an instance of the Hadoop DistributedFileSystem (DFS). A file
- provide better protection against data corruption. It can also be used to
- reduce the total storage requirements of HDFS.
+ stored in the DRFS (the ''source file'') is divided into ''stripes'' consisting of several blocks. For each
+ stripe, a number of parity blocks are stored in the ''parity file'' corresponding to this source file.
+ This makes it possible to recompute blocks in the source file or parity file when they are lost or corrupted.
  
+ The main benefit of the DRFS is the increased protection
+ against data corruption it provides. Because of this increased protection, 
+ replication levels can be lowered while maintaining the same availability guarantees, which
+ results in significant storage space savings.
- Distributed Raid File System consists of two main software components. The first component
- is the RaidNode, a daemon that creates parity files from specified HDFS files.
- The second component "raidfs" is a software that is layered over a HDFS client and it
- intercepts all calls that an application makes to the HDFS client. If HDFS encounters
- corrupted data while reading a file, the raidfs client detects it; it uses the
- relevant parity blocks to recover the corrupted data (if possible) and returns
- the data to the application. The application is completely transparent to the
- fact that parity data was used to satisfy it's read request.
  
+ == Architecture  and implementation ==
- The primary use of this feature is to save disk space for HDFS files.
- HDFS typically stores data in triplicate.
- The Distributed Raid File System can be configured in such a way that a set of
- data blocks of a file are combined together to form one or more parity blocks.
- This allows one to reduce the replication factor of a HDFS file from 3 to 2
- while keeping the failure probabilty relatively same as before. This typically
- results in saving 25% to 30% of storage space in a HDFS cluster.
  
+ HDFS Raid consists of several software components:
  
+  * the DRFS client, which provides application access to the the files in the DRFS and transparently recovers any corrupt or missing blocks encountered when reading a file,
+  * the RaidNode, a daemon that creates and maintains parity files for all data files stored in the DRFS,
+  * the BlockFixer, which periodically recomputes blocks that have been lost or corrupted,
+  * the RaidFsck utility, which allows the administrator to manually trigger the recomputation of missing or corrupt blocks and to check for files that have become irrecoverably corrupted.
  
- INSTALLING and CONFIGURING:
+ === DRFS client ===
  
+ The DRFS client is implemented as a layer on top of the DFS client that intercepts all incoming calls and passes
+ them on to the underlying client. Whenever the underlying DFS throws a ChecksumException or a BlockMissingException
+ (because he source file contains corrupt or missing blocks), the DRFS client catches these exceptions, locates
+ the parity file for the current source file and recomputes the missing blocks before returning them to the application.
+ 
+ It is important to note that while the DRFS client recomputes missing blocks when reading corrupt files it does not
+ insert these missing blocks back into the file system. Instead, it discards them once the application request has been fulfilled.
+ The BlockFixer daemon and the RaidFsck tool can be used to persistently fix bad blocks.
+ 
+ === RaidNode ===
+ 
+ The RaidNode periodically scans all paths for which the configuration specifies that they
+ should be stored in the DRFS. For each path, it recursively inspects all files that have more than 2 blocks
+ and selects those that have not been recently modified (default is within the last 24 hours).
+ Once it has selected a source file, it iterates over all its stripes and creates the appropriate number of 
+ parity blocks for each stripe. The parity blocks are then concatenated together and stored as the parity file
+ corresponding to this source file. Once the parity file has been created, the replication factor for the corresponding
+ source file is lowered as specified in the configuration.
+ The RaidNode also periodically deletes parity files
+ that have become orphaned or outdated.
+ 
+ There are currently two implementations of the RaidNode:
+  * LocalRaidNode, which computes parity blocks locally at the RaidNode. Since computing parity blocks is a computationally expensive task the scalability of this approach is limited.
+  * DistributedRaidNode, which dispatches map reduce tasks to compute parity blocks.
+ 
+ === BlockFixer ===
+ 
+ (currently under development)
+ 
+ The BlockFixer is a daemon that runs at the RaidNode
+ 
+ === RaidFsck ===
+ 
+ (currently under development)
+ 
+ == Using HDFS RAID ==
+ 
+ === Installation ===
+ 
- The entire code is packaged in the form of a single jar file hadoop-*-raid.jar.
+ The entire code is packaged in the form of a single jar file named `hadoop-*-raid.jar`.
- To use HDFS Raid, you need to put the above mentioned jar file on
+ To use HDFS Raid, you need to put the above mentioned jar file into
- the CLASSPATH. The easiest way is to copy the hadoop-*-raid.jar
+ Hadoop's `CLASSPATH`. The easiest way to achieve this is to copy `hadoop-*-raid.jar`
- from HADOOP_HOME/build/contrib/raid to HADOOP_HOME/lib. Alternatively
+ from `$HADOOP_HOME/build/contrib/raid` to `$HADOOP_HOME/lib`. Alternatively
- you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh.
+ you can modify `$HADOOP_CLASSPATH` (defined in `conf/hadoop-env.sh`) to include this jar.
  
+ === Configuration ===
+ 
- There is a single configuration file named raid.xml that describes the HDFS
+ There is a single configuration file named `raid.xml` that describes the HDFS
+ paths for which RAID should be used. A sample of this file can be found in
+ `src/contrib/raid/conf/raid.xml`. To apply the policies defined in `raid.xml`, 
+ a reference has to be added to `hdfs-site.xml`:
+ {{{
+ <property>
- path(s) that you want to raid. A sample of this file can be found in
- sc/contrib/raid/conf/raid.xml. Please edit the entries in this file to list the
- path(s) that you want to raid. Then, edit the hdfs-site.xml file for
- your installation to include a reference to this raid.xml. You can add the
- following to your hdfs-site.xml
-         <property>
-           <name>raid.config.file</name>
+   <name>raid.config.file</name>
-           <value>/mnt/hdfs/DFS/conf/raid.xml</value>
+   <value>/mnt/hdfs/DFS/conf/raid.xml</value>
-           <description>This is needed by the RaidNode </description>
+   <description>This is needed by the RaidNode </description>
-         </property>
+ </property>
+ }}}
  
- Please add an entry to your hdfs-site.xml to enable hdfs clients to use the
- parity bits to recover corrupted data.
- 
-        <property>
+ In order to get clients to use the DRFS client (which transparently fixes bad blocks),
+ add the following configuration property to `hdfs-site.xml`:
+ {{{
+ <property>
-          <name>fs.hdfs.impl</name>
+   <name>fs.hdfs.impl</name>
-          <value>org.apache.hadoop.dfs.DistributedRaidFileSystem</value>
+   <value>org.apache.hadoop.dfs.DistributedRaidFileSystem</value>
-          <description>The FileSystem for hdfs: uris.</description>
+   <description>The FileSystem for hdfs: uris.</description>
-        </property>
+ </property>
+ }}}
  
- OPTIONAL CONFIGIURATION:
+ === Additional (optional) configuration ===
  
- The following properties can be set in hdfs-site.xml to further tune you configuration:
+ The following properties can be set in `hdfs-site.xml` to further tune the DRFS configuration:
  
-     Specifies the location where parity files are located.
+   Specify the location where parity files should be stored:
+   {{{
-         <property>
+   <property>
-           <name>hdfs.raid.locations</name>
+     <name>hdfs.raid.locations</name>
-           <value>hdfs://newdfs.data:8000/raid</value>
+     <value>hdfs://newdfs.data:8000/raid</value>
-           <description>The location for parity files. If this is
+     <description>The location for parity files. If this is
-           is not defined, then defaults to /raid.
+       is not defined, then defaults to /raid.
-           </descrition>
+     </descrition>
-         </property>
+   </property>
+   }}}
  
-     Specify the parity stripe length
+   Specify the parity stripe length in blocks:
+   {{{
-         <property>
+   <property>
-           <name>hdfs.raid.stripeLength</name>
+     <name>hdfs.raid.stripeLength</name>
-           <value>10</value>
+     <value>10</value>
-           <description>The number of blocks in a file to be combined into
+     <description>The number of blocks in a file to be combined into
-           a single raid parity block. The default value is 5. The lower
+       a single raid parity block. The default value is 5. The lower
-           the number the greater is the disk space you will save when you
+       the number the greater is the disk space you will save when you
-           enable raid.
+       enable raid.
-           </description>
+     </description>
-         </property>
+   </property>
+   }}}
  
-     Specify the size of HAR part-files
+   Specify the size of HAR part-files:
+   {{{
-         <property>
+   <property>
-           <name>raid.har.partfile.size</name>
+     <name>raid.har.partfile.size</name>
-           <value>4294967296</value>
+     <value>4294967296</value>
-           <description>The size of HAR part files that store raid parity
+     <description>The size of HAR part files that store raid parity
-          files. The default is 4GB. The higher the number the fewer the
+       files. The default is 4GB. The higher the number the fewer the
-          number of files used to store the HAR archive.
+       number of files used to store the HAR archive.
-           </description>
+     </description>
-         </property>
+   </property>
+   }}}
  
-     Specifies the block placement policy for raid
+   Specify the block placement policy for raid:
+   {{{
-         <property>
+   <property>
-           <name>dfs.block.replicator.classname</name>
+     <name>dfs.block.replicator.classname</name>
-           <value>
+     <value>
-             org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyRaid
+       org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyRaid
-           </value>
+     </value>
-           <description>The name of the class which specifies how to place
+     <description>The name of the class which specifies how to place
-             blocks in HDFS. The class BlockPlacementPolicyRaid will try to
+       blocks in HDFS. The class BlockPlacementPolicyRaid will try to
-             avoid co-located replicas of the same stripe. This will greatly
+       avoid co-located replicas of the same stripe. This will greatly
-             reduce the probability of raid file corruption.
+       reduce the probability of raid file corruption.
-           </descrition>
+     </descrition>
-         </property>
+   </property>
+   }}}
  
-     Specify the pool for fair scheduler
+   Specify the fairshare scheduler pool that should be use to run jobs dispatched by the DistributedBlockFixer:
+   {{{
-         <property>
+   <property>
-           <name>raid.mapred.fairscheduler.pool</name>
+     <name>raid.mapred.fairscheduler.pool</name>
-           <value>none</value>
+     <value>none</value>
-           <description>The name of the fair scheduler pool to use.</description>
+     <description>The name of the fair scheduler pool to use.</description>
-         </property>
+   </property>
+   }}}
  
-     Specify which implementation of RaidNode to use.
+   Specify which implementation of RaidNode to use (local or distributed):
+   {{{
-         <property>
+   <property>
-           <name>raid.classname</name>
+     <name>raid.classname</name>
-           <value>org.apache.hadoop.raid.DistRaidNode</value>
+     <value>org.apache.hadoop.raid.DistRaidNode</value>
-           <description>Specify which implementation of RaidNode to use
+     <description>Specify which implementation of RaidNode to use
-           (class name).
+       (class name).
-           </description>
+     </description>
-         </property>
+   </property>
+   }}}
  
-     Specify the periodicy at which the RaidNode re-calculates (if necessary)
+   Specify how frequently the RaidNode re-calculates out-dated or missing
-     the parity blocks
+   parity blocks:
+   {{{
-         <property>
+   <property>
-           <name>raid.policy.rescan.interval</name>
+     <name>raid.policy.rescan.interval</name>
-           <value>5000</value>
+     <value>5000</value>
-           <description>Specify the periodicity in milliseconds after which
+     <description>Specify the periodicity in milliseconds after which
-           all source paths are rescanned and parity blocks recomputed if
+       all source paths are rescanned and parity blocks recomputed if
-           necessary. By default, this value is 1 hour.
+       necessary. By default, this value is 1 hour.
-           </description>
+     </description>
-         </property>
+   </property>
+   }}}
  
-     By default, the DistributedRaidFileSystem assumes that the underlying file
+   By default, the DRFS assumes that the underlying file
-     system is the DistributedFileSystem. If you want to layer the DistributedRaidFileSystem
+   system is a DFS. To layer the DRFS
-     over some other file system, then define a property named fs.raid.underlyingfs.impl
+   over some other file system, define a property named `fs.raid.underlyingfs.impl`
-     that specifies the name of the underlying class. For example, if you want to layer
+   that specifies the name of the underlying class. For example, to layer
-     The DistributedRaidFileSystem over an instance of the NewFileSystem, then
+   The DRFS over an instance of `NewFileSystem`, use the following
+   property:
+   {{{
-         <property>
+   <property>
-           <name>fs.raid.underlyingfs.impl</name>
+     <name>fs.raid.underlyingfs.impl</name>
-           <value>org.apche.hadoop.new.NewFileSystem</value>
+     <value>org.apche.hadoop.new.NewFileSystem</value>
-           <description>Specify the filesystem that is layered immediately below the
+     <description>Specify the filesystem that is layered immediately below the
-           DistributedRaidFileSystem. By default, this value is DistributedFileSystem.
+       DistributedRaidFileSystem. By default, this value is DistributedFileSystem.
-           </description>
+     </description>
+   </property
+   }}}
  
- ADMINISTRATION:
+ === Administration ===
  
- The Distributed Raid File System  provides support for administration at runtime without
+ The DRFS  provides support for administration at runtime without
  any downtime to cluster services.  It is possible to add/delete new paths to be raided without
  interrupting any load on the cluster. If you change raid.xml, its contents will be
  reload within seconds and the new contents will take effect immediately.
@@ -164, +224 @@

  start-raidnode-remote.sh (and do the equivalent thing for stop-mapred.sh and
  stop-raidnode-remote.sh).
  
- Run fsckraid periodically (being developed as part of another JIRA). This valudates parity
+ Run fsckraid periodically (being developed as part of another JIRA). This validates parity
  blocks of a file.
  
- IMPLEMENTATION:
  
- The RaidNode periodically scans all the specified paths in the configuration
- file. For each path, it recursively scans all files that have more than 2 blocks
- and that has not been modified during the last few hours (default is 24 hours).
- It picks the specified number of blocks (as specified by the stripe size),
- from the file, generates a parity block by combining them and
- stores the results as another HDFS file in the specified destination
- directory. There is a one-to-one mapping between a HDFS
- file and its parity file. The RaidNode also periodically finds parity files
- that are orphaned and deletes them.
  
- The Distributed Raid FileSystem is layered over a DistributedFileSystem
- instance intercepts all calls that go into HDFS. HDFS throws a ChecksumException
- or a BlocMissingException when a file read encounters bad data. The layered
- Distributed Raid FileSystem catches these exceptions, locates the corresponding
- parity file, extract the original data from the parity files and feeds the
- extracted data back to the application in a completely transparent way.
  
+ 
+ 
- The layered Distributed Raid FileSystem does not fix the data-loss that it
- encounters while serving data. It merely make the application transparently
- use the parity blocks to re-create the original data. A command line tool
- "fsckraid" is currently under development that will fix the corrupted files
- by extracting the data from the associated parity files. An adminstrator
- can run "fsckraid" manually as and when needed.