You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2019/11/22 23:39:20 UTC

[GitHub] [accumulo-website] ctubbsii commented on a change in pull request #213: add erasure coding warning to quickstart, and add EC guide

ctubbsii commented on a change in pull request #213: add erasure coding warning to quickstart, and add EC guide
URL: https://github.com/apache/accumulo-website/pull/213#discussion_r349835405
 
 

 ##########
 File path: _docs-2/administration/erasure-coding.md
 ##########
 @@ -0,0 +1,190 @@
+---
+title: Erasure Coding
+category: administration
+order: 9
+---
+
+With the release of version 3.0.0, Hadoop introduced the use of [Erasure Coding]
+(EC) in HDFS.  By default HDFS achieves durability via block replication.
+Usually the replication count is 3, resulting in a storage overhead of 200%.
+Hadoop 3 introduced EC as a better way to achieve durability. EC behaves much 
+like RAID 5 or 6...for *k* blocks of data, *m* blocks of parity data are generated,
+from which the original data can be recovered in the event of disk or node 
+failures (erasures, in EC parlance).  A typical EC scheme is Reed-Solomon 6-3,
+where 6 data blocks produce 3 parity blocks, an overhead of only 50%.  In
+addition to doubling the available disk space, RS-6-3 is also more fault
+tolerant...a loss of 3 data blocks can be tolerated, whereas triple replication
+can only sustain a loss of two.
+
+To use EC with Accumulo, it is highly recommended that you first rebuild Hadoop 
+with support for Intel's ISA-L library. Instructions for doing this can be found 
+[here](https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Enable_Intel_ISA-L)
+
+### Important Warning
+As noted 
+[here](https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations),
+the current EC implementation does not support hflush() and hsync().  These 
+functions are no-ops, which means that EC coded files are not guaranteed to
+be written to disk after a sync or flush.  For this reason, **EC should never
+be used for the Accumulo write-ahead logs.  Data loss may, and most likely will,
+occur.** It is also recommended that tables in the `accumulo` namespace (`root` and
+`metadata` for example) continue to use replication.
+
+### HDFS ec Command
+Encoding policy in HDFS is set at the directory level, with children inheriting
+policies from their parents if not explicitly set.  The encoding policy for a directory
+can be manipulated via the `hdfs ec` command, documented
+[here](https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Administrative_commands).
+
+The first step is to determine which policies are configured for your HDFS instance.
+This is done via the `-listPolicies` command.  The following listing shows that there
+are 5 configured policies, of which only 3 (RS-10-4-1024k, RS-6-3-1024k, and RS-6-3-64k)
+are enabled for use.
+
+<pre>$ hdfs ec -listPolicies
 
 Review comment:
   @etseidl You should get the same block behavior with markdown with 3 backticks, as in:
   
   ```
   Line 1. This is a long line that will continue scrolling off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right.... off to the right....
   Line 2. This is a shorter line
   Line 3. This is another long line... another long line... another long line... another long line... another long line... another long line... another long line... another long line... another long line... another long line... another long line... 
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services