You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-commits@hadoop.apache.org by at...@apache.org on 2013/01/30 02:52:15 UTC
svn commit: r1440245 [1/2] - in
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src:
main/docs/src/documentation/content/xdocs/ site/apt/
Author: atm
Date: Wed Jan 30 01:52:14 2013
New Revision: 1440245
URL: http://svn.apache.org/viewvc?rev=1440245&view=rev
Log:
HADOOP-9221. Convert remaining xdocs to APT. Contributed by Andy Isaacson.
Added:
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/SLGUserGuide.apt.vm
Removed:
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/SLG_user_guide.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/faultinject_framework.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_editsviewer.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_imageviewer.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_permissions_guide.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_quota_admin_guide.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_user_guide.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hftp.xml
hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/libhdfs.xml
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,312 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ Fault Injection Framework and Development Guide
+ ---
+ ---
+ ${maven.build.timestamp}
+
+Fault Injection Framework and Development Guide
+
+%{toc|section=1|fromDepth=0}
+
+* Introduction
+
+ This guide provides an overview of the Hadoop Fault Injection (FI)
+ framework for those who will be developing their own faults (aspects).
+
+ The idea of fault injection is fairly simple: it is an infusion of
+ errors and exceptions into an application's logic to achieve a higher
+ coverage and fault tolerance of the system. Different implementations
+ of this idea are available today. Hadoop's FI framework is built on top
+ of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.
+
+* Assumptions
+
+ The current implementation of the FI framework assumes that the faults
+ it will be emulating are of non-deterministic nature. That is, the
+ moment of a fault's happening isn't known in advance and is a coin-flip
+ based.
+
+* Architecture of the Fault Injection Framework
+
+ Components layout
+
+** Configuration Management
+
+ This piece of the FI framework allows you to set expectations for
+ faults to happen. The settings can be applied either statically (in
+ advance) or in runtime. The desired level of faults in the framework
+ can be configured two ways:
+
+ * editing src/aop/fi-site.xml configuration file. This file is
+ similar to other Hadoop's config files
+
+ * setting system properties of JVM through VM startup parameters or
+ in build.properties file
+
+** Probability Model
+
+ This is fundamentally a coin flipper. The methods of this class are
+ getting a random number between 0.0 and 1.0 and then checking if a new
+ number has happened in the range of 0.0 and a configured level for the
+ fault in question. If that condition is true then the fault will occur.
+
+ Thus, to guarantee the happening of a fault one needs to set an
+ appropriate level to 1.0. To completely prevent a fault from happening
+ its probability level has to be set to 0.0.
+
+ Note: The default probability level is set to 0 (zero) unless the level
+ is changed explicitly through the configuration file or in the runtime.
+ The name of the default level's configuration parameter is fi.*
+
+** Fault Injection Mechanism: AOP and AspectJ
+
+ The foundation of Hadoop's FI framework includes a cross-cutting
+ concept implemented by AspectJ. The following basic terms are important
+ to remember:
+
+ * A cross-cutting concept (aspect) is behavior, and often data, that
+ is used across the scope of a piece of software
+
+ * In AOP, the aspects provide a mechanism by which a cross-cutting
+ concern can be specified in a modular way
+
+ * Advice is the code that is executed when an aspect is invoked
+
+ * Join point (or pointcut) is a specific point within the application
+ that may or not invoke some advice
+
+** Existing Join Points
+
+ The following readily available join points are provided by AspectJ:
+
+ * Join when a method is called
+
+ * Join during a method's execution
+
+ * Join when a constructor is invoked
+
+ * Join during a constructor's execution
+
+ * Join during aspect advice execution
+
+ * Join before an object is initialized
+
+ * Join during object initialization
+
+ * Join during static initializer execution
+
+ * Join when a class's field is referenced
+
+ * Join when a class's field is assigned
+
+ * Join when a handler is executed
+
+* Aspect Example
+
+----
+ package org.apache.hadoop.hdfs.server.datanode;
+
+ import org.apache.commons.logging.Log;
+ import org.apache.commons.logging.LogFactory;
+ import org.apache.hadoop.fi.ProbabilityModel;
+ import org.apache.hadoop.hdfs.server.datanode.DataNode;
+ import org.apache.hadoop.util.DiskChecker.*;
+
+ import java.io.IOException;
+ import java.io.OutputStream;
+ import java.io.DataOutputStream;
+
+ /**
+ * This aspect takes care about faults injected into datanode.BlockReceiver
+ * class
+ */
+ public aspect BlockReceiverAspects {
+ public static final Log LOG = LogFactory.getLog(BlockReceiverAspects.class);
+
+ public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
+ pointcut callReceivePacket() : call (* OutputStream.write(..))
+ && withincode (* BlockReceiver.receivePacket(..))
+ // to further limit the application of this aspect a very narrow 'target' can be used as follows
+ // && target(DataOutputStream)
+ && !within(BlockReceiverAspects +);
+
+ before () throws IOException : callReceivePacket () {
+ if (ProbabilityModel.injectCriteria(BLOCK_RECEIVER_FAULT)) {
+ LOG.info("Before the injection point");
+ Thread.dumpStack();
+ throw new DiskOutOfSpaceException ("FI: injected fault point at " +
+ thisJoinPoint.getStaticPart( ).getSourceLocation());
+ }
+ }
+ }
+----
+
+ The aspect has two main parts:
+
+ * The join point pointcut callReceivepacket() which servers as an
+ identification mark of a specific point (in control and/or data
+ flow) in the life of an application.
+
+ * A call to the advice - before () throws IOException :
+ callReceivepacket() - will be injected (see Putting It All
+ Together) before that specific spot of the application's code.
+
+ The pointcut identifies an invocation of class' java.io.OutputStream
+ write() method with any number of parameters and any return type. This
+ invoke should take place within the body of method receivepacket() from
+ classBlockReceiver. The method can have any parameters and any return
+ type. Possible invocations of write() method happening anywhere within
+ the aspect BlockReceiverAspects or its heirs will be ignored.
+
+ Note 1: This short example doesn't illustrate the fact that you can
+ have more than a single injection point per class. In such a case the
+ names of the faults have to be different if a developer wants to
+ trigger them separately.
+
+ Note 2: After the injection step (see Putting It All Together) you can
+ verify that the faults were properly injected by searching for ajc
+ keywords in a disassembled class file.
+
+* Fault Naming Convention and Namespaces
+
+ For the sake of a unified naming convention the following two types of
+ names are recommended for a new aspects development:
+
+ * Activity specific notation (when we don't care about a particular
+ location of a fault's happening). In this case the name of the
+ fault is rather abstract: fi.hdfs.DiskError
+
+ * Location specific notation. Here, the fault's name is mnemonic as
+ in: fi.hdfs.datanode.BlockReceiver[optional location details]
+
+* Development Tools
+
+ * The Eclipse AspectJ Development Toolkit may help you when
+ developing aspects
+
+ * IntelliJ IDEA provides AspectJ weaver and Spring-AOP plugins
+
+* Putting It All Together
+
+ Faults (aspects) have to injected (or woven) together before they can
+ be used. Follow these instructions:
+ * To weave aspects in place use:
+
+----
+ % ant injectfaults
+----
+
+ * If you misidentified the join point of your aspect you will see a
+ warning (similar to the one shown here) when 'injectfaults' target
+ is completed:
+
+----
+ [iajc] warning at
+ src/test/aop/org/apache/hadoop/hdfs/server/datanode/ \
+ BlockReceiverAspects.aj:44::0
+ advice defined in org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects
+ has not been applied [Xlint:adviceDidNotMatch]
+----
+
+ * It isn't an error, so the build will report the successful result.
+ To prepare dev.jar file with all your faults weaved in place
+ (HDFS-475 pending) use:
+
+----
+ % ant jar-fault-inject
+----
+
+ * To create test jars use:
+
+----
+ % ant jar-test-fault-inject
+----
+
+ * To run HDFS tests with faults injected use:
+
+----
+ % ant run-test-hdfs-fault-inject
+----
+
+** How to Use the Fault Injection Framework
+
+ Faults can be triggered as follows:
+
+ * During runtime:
+
+----
+ % ant run-test-hdfs -Dfi.hdfs.datanode.BlockReceiver=0.12
+----
+
+ To set a certain level, for example 25%, of all injected faults
+ use:
+
+----
+ % ant run-test-hdfs-fault-inject -Dfi.*=0.25
+----
+
+ * From a program:
+
+----
+ package org.apache.hadoop.fs;
+
+ import org.junit.Test;
+ import org.junit.Before;
+
+ public class DemoFiTest {
+ public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
+ @Override
+ @Before
+ public void setUp() {
+ //Setting up the test's environment as required
+ }
+
+ @Test
+ public void testFI() {
+ // It triggers the fault, assuming that there's one called 'hdfs.datanode.BlockReceiver'
+ System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.12");
+ //
+ // The main logic of your tests goes here
+ //
+ // Now set the level back to 0 (zero) to prevent this fault from happening again
+ System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.0");
+ // or delete its trigger completely
+ System.getProperties().remove("fi." + BLOCK_RECEIVER_FAULT);
+ }
+
+ @Override
+ @After
+ public void tearDown() {
+ //Cleaning up test test environment
+ }
+ }
+----
+
+ As you can see above these two methods do the same thing. They are
+ setting the probability level of <<<hdfs.datanode.BlockReceiver>>> at 12%.
+ The difference, however, is that the program provides more flexibility
+ and allows you to turn a fault off when a test no longer needs it.
+
+* Additional Information and Contacts
+
+ These two sources of information are particularly interesting and worth
+ reading:
+
+ * {{http://www.eclipse.org/aspectj/doc/next/devguide/}}
+
+ * AspectJ Cookbook (ISBN-13: 978-0-596-00654-9)
+
+ If you have additional comments or questions for the author check
+ {{{https://issues.apache.org/jira/browse/HDFS-435}HDFS-435}}.
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,106 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+
+ ---
+ Offline Edits Viewer Guide
+ ---
+ Erik Steffl
+ ---
+ ${maven.build.timestamp}
+
+Offline Edits Viewer Guide
+
+ \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+ Offline Edits Viewer is a tool to parse the Edits log file. The current
+ processors are mostly useful for conversion between different formats,
+ including XML which is human readable and easier to edit than native
+ binary format.
+
+ The tool can parse the edits formats -18 (roughly Hadoop 0.19) and
+ later. The tool operates on files only, it does not need Hadoop cluster
+ to be running.
+
+ Input formats supported:
+
+ [[1]] <<binary>>: native binary format that Hadoop uses internally
+
+ [[2]] <<xml>>: XML format, as produced by xml processor, used if filename
+ has <<<.xml>>> (case insensitive) extension
+
+ The Offline Edits Viewer provides several output processors (unless
+ stated otherwise the output of the processor can be converted back to
+ original edits file):
+
+ [[1]] <<binary>>: native binary format that Hadoop uses internally
+
+ [[2]] <<xml>>: XML format
+
+ [[3]] <<stats>>: prints out statistics, this cannot be converted back to
+ Edits file
+
+* Usage
+
+----
+ bash$ bin/hdfs oev -i edits -o edits.xml
+----
+
+*-----------------------:-----------------------------------+
+| Flag | Description |
+*-----------------------:-----------------------------------+
+|[<<<-i>>> ; <<<--inputFile>>>] <input file> | Specify the input edits log file to
+| | process. Xml (case insensitive) extension means XML format otherwise
+| | binary format is assumed. Required.
+*-----------------------:-----------------------------------+
+|[<<-o>> ; <<--outputFile>>] <output file> | Specify the output filename, if the
+| | specified output processor generates one. If the specified file already
+| | exists, it is silently overwritten. Required.
+*-----------------------:-----------------------------------+
+|[<<-p>> ; <<--processor>>] <processor> | Specify the image processor to apply
+| | against the image file. Currently valid options are
+| | <<<binary>>>, <<<xml>>> (default) and <<<stats>>>.
+*-----------------------:-----------------------------------+
+|<<[-v ; --verbose] >> | Print the input and output filenames and pipe output of
+| | processor to console as well as specified file. On extremely large
+| | files, this may increase processing time by an order of magnitude.
+*-----------------------:-----------------------------------+
+|<<[-h ; --help] >> | Display the tool usage and help information and exit.
+*-----------------------:-----------------------------------+
+
+* Case study: Hadoop cluster recovery
+
+ In case there is some problem with hadoop cluster and the edits file is
+ corrupted it is possible to save at least part of the edits file that
+ is correct. This can be done by converting the binary edits to XML,
+ edit it manually and then convert it back to binary. The most common
+ problem is that the edits file is missing the closing record (record
+ that has opCode -1). This should be recognized by the tool and the XML
+ format should be properly closed.
+
+ If there is no closing record in the XML file you can add one after
+ last correct record. Anything after the record with opCode -1 is
+ ignored.
+
+ Example of a closing record (with opCode -1):
+
++----
+ <RECORD>
+ <OPCODE>-1</OPCODE>
+ <DATA>
+ </DATA>
+ </RECORD>
++----
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,418 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ Offline Image Viewer Guide
+ ---
+ ---
+ ${maven.build.timestamp}
+
+Offline Image Viewer Guide
+
+ \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+ The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
+ files to human-readable formats in order to allow offline analysis and
+ examination of an Hadoop cluster's namespace. The tool is able to
+ process very large image files relatively quickly, converting them to
+ one of several output formats. The tool handles the layout formats that
+ were included with Hadoop versions 16 and up. If the tool is not able
+ to process an image file, it will exit cleanly. The Offline Image
+ Viewer does not require an Hadoop cluster to be running; it is entirely
+ offline in its operation.
+
+ The Offline Image Viewer provides several output processors:
+
+ [[1]] Ls is the default output processor. It closely mimics the format of
+ the lsr command. It includes the same fields, in the same order, as
+ lsr : directory or file flag, permissions, replication, owner,
+ group, file size, modification date, and full path. Unlike the lsr
+ command, the root path is included. One important difference
+ between the output of the lsr command this processor, is that this
+ output is not sorted by directory name and contents. Rather, the
+ files are listed in the order in which they are stored in the
+ fsimage file. Therefore, it is not possible to directly compare the
+ output of the lsr command this this tool. The Ls processor uses
+ information contained within the Inode blocks to calculate file
+ sizes and ignores the -skipBlocks option.
+
+ [[2]] Indented provides a more complete view of the fsimage's contents,
+ including all of the information included in the image, such as
+ image version, generation stamp and inode- and block-specific
+ listings. This processor uses indentation to organize the output
+ into a hierarchal manner. The lsr format is suitable for easy human
+ comprehension.
+
+ [[3]] Delimited provides one file per line consisting of the path,
+ replication, modification time, access time, block size, number of
+ blocks, file size, namespace quota, diskspace quota, permissions,
+ username and group name. If run against an fsimage that does not
+ contain any of these fields, the field's column will be included,
+ but no data recorded. The default record delimiter is a tab, but
+ this may be changed via the -delimiter command line argument. This
+ processor is designed to create output that is easily analyzed by
+ other tools, such as [36]Apache Pig. See the [37]Analyzing Results
+ section for further information on using this processor to analyze
+ the contents of fsimage files.
+
+ [[4]] XML creates an XML document of the fsimage and includes all of the
+ information within the fsimage, similar to the lsr processor. The
+ output of this processor is amenable to automated processing and
+ analysis with XML tools. Due to the verbosity of the XML syntax,
+ this processor will also generate the largest amount of output.
+
+ [[5]] FileDistribution is the tool for analyzing file sizes in the
+ namespace image. In order to run the tool one should define a range
+ of integers [0, maxSize] by specifying maxSize and a step. The
+ range of integers is divided into segments of size step: [0, s[1],
+ ..., s[n-1], maxSize], and the processor calculates how many files
+ in the system fall into each segment [s[i-1], s[i]). Note that
+ files larger than maxSize always fall into the very last segment.
+ The output file is formatted as a tab separated two column table:
+ Size and NumFiles. Where Size represents the start of the segment,
+ and numFiles is the number of files form the image which size falls
+ in this segment.
+
+* Usage
+
+** Basic
+
+ The simplest usage of the Offline Image Viewer is to provide just an
+ input and output file, via the -i and -o command-line switches:
+
+----
+ bash$ bin/hdfs oiv -i fsimage -o fsimage.txt
+----
+
+ This will create a file named fsimage.txt in the current directory
+ using the Ls output processor. For very large image files, this process
+ may take several minutes.
+
+ One can specify which output processor via the command-line switch -p.
+ For instance:
+
+----
+ bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML
+----
+
+ or
+
+----
+ bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented
+----
+
+ This will run the tool using either the XML or Indented output
+ processor, respectively.
+
+ One command-line option worth considering is -skipBlocks, which
+ prevents the tool from explicitly enumerating all of the blocks that
+ make up a file in the namespace. This is useful for file systems that
+ have very large files. Enabling this option can significantly decrease
+ the size of the resulting output, as individual blocks are not
+ included. Note, however, that the Ls processor needs to enumerate the
+ blocks and so overrides this option.
+
+Example
+
+ Consider the following contrived namespace:
+
+----
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:17 /anotherDir
+ -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 21:15 /anotherDir/biggerfile
+ -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 21:17 /anotherDir/smallFile
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem
+ drwx-wx-wx - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one/two
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:16 /user
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 21:19 /user/theuser
+----
+
+ Applying the Offline Image Processor against this file with default
+ options would result in the following output:
+
+----
+ machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt
+
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:17 /anotherDir
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /user
+ -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 14:15 /anotherDir/biggerfile
+ -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 14:17 /anotherDir/smallFile
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem
+ drwx-wx-wx - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one/two
+ drwxr-xr-x - theuser supergroup 0 2009-03-16 14:19 /user/theuser
+----
+
+ Similarly, applying the Indented processor would generate output that
+ begins with:
+
+----
+ machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
+
+ FSImage
+ ImageVersion = -19
+ NamespaceID = 2109123098
+ GenerationStamp = 1003
+ INodes [NumInodes = 12]
+ Inode
+ INodePath =
+ Replication = 0
+ ModificationTime = 2009-03-16 14:16
+ AccessTime = 1969-12-31 16:00
+ BlockSize = 0
+ Blocks [NumBlocks = -1]
+ NSQuota = 2147483647
+ DSQuota = -1
+ Permissions
+ Username = theuser
+ GroupName = supergroup
+ PermString = rwxr-xr-x
+ ...remaining output omitted...
+----
+
+* Options
+
+*-----------------------:-----------------------------------+
+| <<Flag>> | <<Description>> |
+*-----------------------:-----------------------------------+
+| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
+| | process. Required.
+*-----------------------:-----------------------------------+
+| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if the
+| | specified output processor generates one. If the specified file already
+| | exists, it is silently overwritten. Required.
+*-----------------------:-----------------------------------+
+| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to apply
+| | against the image file. Currently valid options are Ls (default), XML
+| | and Indented..
+*-----------------------:-----------------------------------+
+| <<<-skipBlocks>>> | Do not enumerate individual blocks within files. This may
+| | save processing time and outfile file space on namespaces with very
+| | large files. The Ls processor reads the blocks to correctly determine
+| | file sizes and ignores this option.
+*-----------------------:-----------------------------------+
+| <<<-printToScreen>>> | Pipe output of processor to console as well as specified
+| | file. On extremely large namespaces, this may increase processing time
+| | by an order of magnitude.
+*-----------------------:-----------------------------------+
+| <<<-delimiter>>> <arg>| When used in conjunction with the Delimited processor,
+| | replaces the default tab delimiter with the string specified by arg.
+*-----------------------:-----------------------------------+
+| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
+*-----------------------:-----------------------------------+
+
+* Analyzing Results
+
+ The Offline Image Viewer makes it easy to gather large amounts of data
+ about the hdfs namespace. This information can then be used to explore
+ file system usage patterns or find specific files that match arbitrary
+ criteria, along with other types of namespace analysis. The Delimited
+ image processor in particular creates output that is amenable to
+ further processing by tools such as [38]Apache Pig. Pig provides a
+ particularly good choice for analyzing these data as it is able to deal
+ with the output generated from a small fsimage but also scales up to
+ consume data from extremely large file systems.
+
+ The Delimited image processor generates lines of text separated, by
+ default, by tabs and includes all of the fields that are common between
+ constructed files and files that were still under constructed when the
+ fsimage was generated. Examples scripts are provided demonstrating how
+ to use this output to accomplish three tasks: determine the number of
+ files each user has created on the file system, find files were created
+ but have not accessed, and find probable duplicates of large files by
+ comparing the size of each file.
+
+ Each of the following scripts assumes you have generated an output file
+ using the Delimited processor named foo and will be storing the results
+ of the Pig analysis in a file named results.
+
+** Total Number of Files for Each User
+
+ This script processes each path within the namespace, groups them by
+ the file owner and determines the total number of files each user owns.
+
+----
+ numFilesOfEachUser.pig:
+ -- This script determines the total number of files each user has in
+ -- the namespace. Its output is of the form:
+ -- username, totalNumFiles
+
+ -- Load all of the fields from the file
+ A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
+ replication:int,
+ modTime:chararray,
+ accessTime:chararray,
+ blockSize:long,
+ numBlocks:int,
+ fileSize:long,
+ NamespaceQuota:int,
+ DiskspaceQuota:int,
+ perms:chararray,
+ username:chararray,
+ groupname:chararray);
+
+
+ -- Grab just the path and username
+ B = FOREACH A GENERATE path, username;
+
+ -- Generate the sum of the number of paths for each user
+ C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path);
+
+ -- Save results
+ STORE C INTO '$outputFile';
+----
+
+ This script can be run against pig with the following command:
+
+----
+ bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig
+----
+
+ The output file's content will be similar to that below:
+
+----
+ bart 1
+ lisa 16
+ homer 28
+ marge 2456
+----
+
+** Files That Have Never Been Accessed
+
+ This script finds files that were created but whose access times were
+ never changed, meaning they were never opened or viewed.
+
+----
+ neverAccessed.pig:
+ -- This script generates a list of files that were created but never
+ -- accessed, based on their AccessTime
+
+ -- Load all of the fields from the file
+ A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
+ replication:int,
+ modTime:chararray,
+ accessTime:chararray,
+ blockSize:long,
+ numBlocks:int,
+ fileSize:long,
+ NamespaceQuota:int,
+ DiskspaceQuota:int,
+ perms:chararray,
+ username:chararray,
+ groupname:chararray);
+
+ -- Grab just the path and last time the file was accessed
+ B = FOREACH A GENERATE path, accessTime;
+
+ -- Drop all the paths that don't have the default assigned last-access time
+ C = FILTER B BY accessTime == '1969-12-31 16:00';
+
+ -- Drop the accessTimes, since they're all the same
+ D = FOREACH C GENERATE path;
+
+ -- Save results
+ STORE D INTO '$outputFile';
+----
+
+ This script can be run against pig with the following command and its
+ output file's content will be a list of files that were created but
+ never viewed afterwards.
+
+----
+ bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig
+----
+
+** Probable Duplicated Files Based on File Size
+
+ This script groups files together based on their size, drops any that
+ are of less than 100mb and returns a list of the file size, number of
+ files found and a tuple of the file paths. This can be used to find
+ likely duplicates within the filesystem namespace.
+
+----
+ probableDuplicates.pig:
+ -- This script finds probable duplicate files greater than 100 MB by
+ -- grouping together files based on their byte size. Files of this size
+ -- with exactly the same number of bytes can be considered probable
+ -- duplicates, but should be checked further, either by comparing the
+ -- contents directly or by another proxy, such as a hash of the contents.
+ -- The scripts output is of the type:
+ -- fileSize numProbableDuplicates {(probableDup1), (probableDup2)}
+
+ -- Load all of the fields from the file
+ A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
+ replication:int,
+ modTime:chararray,
+ accessTime:chararray,
+ blockSize:long,
+ numBlocks:int,
+ fileSize:long,
+ NamespaceQuota:int,
+ DiskspaceQuota:int,
+ perms:chararray,
+ username:chararray,
+ groupname:chararray);
+
+ -- Grab the pathname and filesize
+ B = FOREACH A generate path, fileSize;
+
+ -- Drop files smaller than 100 MB
+ C = FILTER B by fileSize > 100L * 1024L * 1024L;
+
+ -- Gather all the files of the same byte size
+ D = GROUP C by fileSize;
+
+ -- Generate path, num of duplicates, list of duplicates
+ E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files;
+
+ -- Drop all the files where there are only one of them
+ F = FILTER E by numDupes > 1L;
+
+ -- Sort by the size of the files
+ G = ORDER F by fileSize;
+
+ -- Save results
+ STORE G INTO '$outputFile';
+----
+
+ This script can be run against pig with the following command:
+
+----
+ bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig
+----
+
+ The output file's content will be similar to that below:
+
+----
+ 1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)}
+ 1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)}
+ 1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)}
+ 1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)}
+ 1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)}
+----
+
+ Each line includes the file size in bytes that was found to be
+ duplicated, the number of duplicates found, and a list of the
+ duplicated paths. Files less than 100MB are ignored, providing a
+ reasonable likelihood that files of these exact sizes may be
+ duplicates.
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,257 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ HDFS Permissions Guide
+ ---
+ ---
+ ${maven.build.timestamp}
+
+HDFS Permissions Guide
+
+ \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+ The Hadoop Distributed File System (HDFS) implements a permissions
+ model for files and directories that shares much of the POSIX model.
+ Each file and directory is associated with an owner and a group. The
+ file or directory has separate permissions for the user that is the
+ owner, for other users that are members of the group, and for all other
+ users. For files, the r permission is required to read the file, and
+ the w permission is required to write or append to the file. For
+ directories, the r permission is required to list the contents of the
+ directory, the w permission is required to create or delete files or
+ directories, and the x permission is required to access a child of the
+ directory.
+
+ In contrast to the POSIX model, there are no setuid or setgid bits for
+ files as there is no notion of executable files. For directories, there
+ are no setuid or setgid bits directory as a simplification. The Sticky
+ bit can be set on directories, preventing anyone except the superuser,
+ directory owner or file owner from deleting or moving the files within
+ the directory. Setting the sticky bit for a file has no effect.
+ Collectively, the permissions of a file or directory are its mode. In
+ general, Unix customs for representing and displaying modes will be
+ used, including the use of octal numbers in this description. When a
+ file or directory is created, its owner is the user identity of the
+ client process, and its group is the group of the parent directory (the
+ BSD rule).
+
+ Each client process that accesses HDFS has a two-part identity composed
+ of the user name, and groups list. Whenever HDFS must do a permissions
+ check for a file or directory foo accessed by a client process,
+
+ * If the user name matches the owner of foo, then the owner
+ permissions are tested;
+ * Else if the group of foo matches any of member of the groups list,
+ then the group permissions are tested;
+ * Otherwise the other permissions of foo are tested.
+
+ If a permissions check fails, the client operation fails.
+
+* User Identity
+
+ As of Hadoop 0.22, Hadoop supports two different modes of operation to
+ determine the user's identity, specified by the
+ hadoop.security.authentication property:
+
+ * <<simple>>
+
+ In this mode of operation, the identity of a client process is
+ determined by the host operating system. On Unix-like systems,
+ the user name is the equivalent of `whoami`.
+
+ * <<kerberos>>
+
+ In Kerberized operation, the identity of a client process is
+ determined by its Kerberos credentials. For example, in a
+ Kerberized environment, a user may use the kinit utility to
+ obtain a Kerberos ticket-granting-ticket (TGT) and use klist to
+ determine their current principal. When mapping a Kerberos
+ principal to an HDFS username, all components except for the
+ primary are dropped. For example, a principal
+ todd/foobar@CORP.COMPANY.COM will act as the simple username
+ todd on HDFS.
+
+ Regardless of the mode of operation, the user identity mechanism is
+ extrinsic to HDFS itself. There is no provision within HDFS for
+ creating user identities, establishing groups, or processing user
+ credentials.
+
+* Group Mapping
+
+ Once a username has been determined as described above, the list of
+ groups is determined by a group mapping service, configured by the
+ hadoop.security.group.mapping property. The default implementation,
+ org.apache.hadoop.security.ShellBasedUnixGroupsMapping, will shell out
+ to the Unix bash -c groups command to resolve a list of groups for a
+ user.
+
+ An alternate implementation, which connects directly to an LDAP server
+ to resolve the list of groups, is available via
+ org.apache.hadoop.security.LdapGroupsMapping. However, this provider
+ should only be used if the required groups reside exclusively in LDAP,
+ and are not materialized on the Unix servers. More information on
+ configuring the group mapping service is available in the Javadocs.
+
+ For HDFS, the mapping of users to groups is performed on the NameNode.
+ Thus, the host system configuration of the NameNode determines the
+ group mappings for the users.
+
+ Note that HDFS stores the user and group of a file or directory as
+ strings; there is no conversion from user and group identity numbers as
+ is conventional in Unix.
+
+* Understanding the Implementation
+
+ Each file or directory operation passes the full path name to the name
+ node, and the permissions checks are applied along the path for each
+ operation. The client framework will implicitly associate the user
+ identity with the connection to the name node, reducing the need for
+ changes to the existing client API. It has always been the case that
+ when one operation on a file succeeds, the operation might fail when
+ repeated because the file, or some directory on the path, no longer
+ exists. For instance, when the client first begins reading a file, it
+ makes a first request to the name node to discover the location of the
+ first blocks of the file. A second request made to find additional
+ blocks may fail. On the other hand, deleting a file does not revoke
+ access by a client that already knows the blocks of the file. With the
+ addition of permissions, a client's access to a file may be withdrawn
+ between requests. Again, changing permissions does not revoke the
+ access of a client that already knows the file's blocks.
+
+* Changes to the File System API
+
+ All methods that use a path parameter will throw <<<AccessControlException>>>
+ if permission checking fails.
+
+ New methods:
+
+ * <<<public FSDataOutputStream create(Path f, FsPermission permission,
+ boolean overwrite, int bufferSize, short replication, long
+ blockSize, Progressable progress) throws IOException;>>>
+
+ * <<<public boolean mkdirs(Path f, FsPermission permission) throws
+ IOException;>>>
+
+ * <<<public void setPermission(Path p, FsPermission permission) throws
+ IOException;>>>
+
+ * <<<public void setOwner(Path p, String username, String groupname)
+ throws IOException;>>>
+
+ * <<<public FileStatus getFileStatus(Path f) throws IOException;>>>
+
+ will additionally return the user, group and mode associated with the
+ path.
+
+ The mode of a new file or directory is restricted my the umask set as a
+ configuration parameter. When the existing <<<create(path, â¦)>>> method
+ (without the permission parameter) is used, the mode of the new file is
+ <<<0666 & ^umask>>>. When the new <<<create(path, permission, â¦)>>> method
+ (with the permission parameter P) is used, the mode of the new file is
+ <<<P & ^umask & 0666>>>. When a new directory is created with the existing
+ <<<mkdirs(path)>>>
+ method (without the permission parameter), the mode of the new
+ directory is <<<0777 & ^umask>>>. When the new <<<mkdirs(path, permission)>>>
+ method (with the permission parameter P) is used, the mode of new
+ directory is <<<P & ^umask & 0777>>>.
+
+* Changes to the Application Shell
+
+ New operations:
+
+ * <<<chmod [-R] mode file â¦>>>
+
+ Only the owner of a file or the super-user is permitted to change
+ the mode of a file.
+
+ * <<<chgrp [-R] group file â¦>>>
+
+ The user invoking chgrp must belong to the specified group and be
+ the owner of the file, or be the super-user.
+
+ * <<<chown [-R] [owner][:[group]] file â¦>>>
+
+ The owner of a file may only be altered by a super-user.
+
+ * <<<ls file â¦>>>
+
+ * <<<lsr file â¦>>>
+
+ The output is reformatted to display the owner, group and mode.
+
+* The Super-User
+
+ The super-user is the user with the same identity as name node process
+ itself. Loosely, if you started the name node, then you are the
+ super-user. The super-user can do anything in that permissions checks
+ never fail for the super-user. There is no persistent notion of who was
+ the super-user; when the name node is started the process identity
+ determines who is the super-user for now. The HDFS super-user does not
+ have to be the super-user of the name node host, nor is it necessary
+ that all clusters have the same super-user. Also, an experimenter
+ running HDFS on a personal workstation, conveniently becomes that
+ installation's super-user without any configuration.
+
+ In addition, the administrator my identify a distinguished group using
+ a configuration parameter. If set, members of this group are also
+ super-users.
+
+* The Web Server
+
+ By default, the identity of the web server is a configuration
+ parameter. That is, the name node has no notion of the identity of the
+ real user, but the web server behaves as if it has the identity (user
+ and groups) of a user chosen by the administrator. Unless the chosen
+ identity matches the super-user, parts of the name space may be
+ inaccessible to the web server.
+
+* Configuration Parameters
+
+ * <<<dfs.permissions = true>>>
+
+ If yes use the permissions system as described here. If no,
+ permission checking is turned off, but all other behavior is
+ unchanged. Switching from one parameter value to the other does not
+ change the mode, owner or group of files or directories.
+ Regardless of whether permissions are on or off, chmod, chgrp and
+ chown always check permissions. These functions are only useful in
+ the permissions context, and so there is no backwards compatibility
+ issue. Furthermore, this allows administrators to reliably set
+ owners and permissions in advance of turning on regular permissions
+ checking.
+
+ * <<<dfs.web.ugi = webuser,webgroup>>>
+
+ The user name to be used by the web server. Setting this to the
+ name of the super-user allows any web client to see everything.
+ Changing this to an otherwise unused identity allows web clients to
+ see only those things visible using "other" permissions. Additional
+ groups may be added to the comma-separated list.
+
+ * <<<dfs.permissions.superusergroup = supergroup>>>
+
+ The name of the group of super-users.
+
+ * <<<fs.permissions.umask-mode = 0022>>>
+
+ The umask used when creating files and directories. For
+ configuration files, the decimal value 18 may be used.
+
+ * <<<dfs.cluster.administrators = ACL-for-admins>>>
+
+ The administrators for the cluster specified as an ACL. This
+ controls who can access the default servlets, etc. in the HDFS.
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,118 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ HDFS Quotas Guide
+ ---
+ ---
+ ${maven.build.timestamp}
+
+HDFS Quotas Guide
+
+ \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+ The Hadoop Distributed File System (HDFS) allows the administrator to
+ set quotas for the number of names used and the amount of space used
+ for individual directories. Name quotas and space quotas operate
+ independently, but the administration and implementation of the two
+ types of quotas are closely parallel.
+
+* Name Quotas
+
+ The name quota is a hard limit on the number of file and directory
+ names in the tree rooted at that directory. File and directory
+ creations fail if the quota would be exceeded. Quotas stick with
+ renamed directories; the rename operation fails if operation would
+ result in a quota violation. The attempt to set a quota will still
+ succeed even if the directory would be in violation of the new quota. A
+ newly created directory has no associated quota. The largest quota is
+ Long.Max_Value. A quota of one forces a directory to remain empty.
+ (Yes, a directory counts against its own quota!)
+
+ Quotas are persistent with the fsimage. When starting, if the fsimage
+ is immediately in violation of a quota (perhaps the fsimage was
+ surreptitiously modified), a warning is printed for each of such
+ violations. Setting or removing a quota creates a journal entry.
+
+* Space Quotas
+
+ The space quota is a hard limit on the number of bytes used by files in
+ the tree rooted at that directory. Block allocations fail if the quota
+ would not allow a full block to be written. Each replica of a block
+ counts against the quota. Quotas stick with renamed directories; the
+ rename operation fails if the operation would result in a quota
+ violation. A newly created directory has no associated quota. The
+ largest quota is <<<Long.Max_Value>>>. A quota of zero still permits files
+ to be created, but no blocks can be added to the files. Directories don't
+ use host file system space and don't count against the space quota. The
+ host file system space used to save the file meta data is not counted
+ against the quota. Quotas are charged at the intended replication
+ factor for the file; changing the replication factor for a file will
+ credit or debit quotas.
+
+ Quotas are persistent with the fsimage. When starting, if the fsimage
+ is immediately in violation of a quota (perhaps the fsimage was
+ surreptitiously modified), a warning is printed for each of such
+ violations. Setting or removing a quota creates a journal entry.
+
+* Administrative Commands
+
+ Quotas are managed by a set of commands available only to the
+ administrator.
+
+ * <<<dfsadmin -setQuota <N> <directory>...<directory> >>>
+
+ Set the name quota to be N for each directory. Best effort for each
+ directory, with faults reported if N is not a positive long
+ integer, the directory does not exist or it is a file, or the
+ directory would immediately exceed the new quota.
+
+ * <<<dfsadmin -clrQuota <directory>...<directory> >>>
+
+ Remove any name quota for each directory. Best effort for each
+ directory, with faults reported if the directory does not exist or
+ it is a file. It is not a fault if the directory has no quota.
+
+ * <<<dfsadmin -setSpaceQuota <N> <directory>...<directory> >>>
+
+ Set the space quota to be N bytes for each directory. This is a
+ hard limit on total size of all the files under the directory tree.
+ The space quota takes replication also into account, i.e. one GB of
+ data with replication of 3 consumes 3GB of quota. N can also be
+ specified with a binary prefix for convenience, for e.g. 50g for 50
+ gigabytes and 2t for 2 terabytes etc. Best effort for each
+ directory, with faults reported if N is neither zero nor a positive
+ integer, the directory does not exist or it is a file, or the
+ directory would immediately exceed the new quota.
+
+ * <<<dfsadmin -clrSpaceQuota <directory>...<director> >>>
+
+ Remove any space quota for each directory. Best effort for each
+ directory, with faults reported if the directory does not exist or
+ it is a file. It is not a fault if the directory has no quota.
+
+* Reporting Command
+
+ An an extension to the count command of the HDFS shell reports quota
+ values and the current count of names and bytes in use.
+
+ * <<<fs -count -q <directory>...<directory> >>>
+
+ With the -q option, also report the name quota value set for each
+ directory, the available name quota remaining, the space quota
+ value set, and the available space quota remaining. If the
+ directory does not have a quota set, the reported values are <<<none>>>
+ and <<<inf>>>.
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,499 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ HDFS Users Guide
+ ---
+ ---
+ ${maven.build.timestamp}
+
+HDFS Users Guide
+
+%{toc|section=1|fromDepth=0}
+
+* Purpose
+
+ This document is a starting point for users working with Hadoop
+ Distributed File System (HDFS) either as a part of a Hadoop cluster or
+ as a stand-alone general purpose distributed file system. While HDFS is
+ designed to "just work" in many environments, a working knowledge of
+ HDFS helps greatly with configuration improvements and diagnostics on a
+ specific cluster.
+
+* Overview
+
+ HDFS is the primary distributed storage used by Hadoop applications. A
+ HDFS cluster primarily consists of a NameNode that manages the file
+ system metadata and DataNodes that store the actual data. The HDFS
+ Architecture Guide describes HDFS in detail. This user guide primarily
+ deals with the interaction of users and administrators with HDFS
+ clusters. The HDFS architecture diagram depicts basic interactions
+ among NameNode, the DataNodes, and the clients. Clients contact
+ NameNode for file metadata or file modifications and perform actual
+ file I/O directly with the DataNodes.
+
+ The following are some of the salient features that could be of
+ interest to many users.
+
+ * Hadoop, including HDFS, is well suited for distributed storage and
+ distributed processing using commodity hardware. It is fault
+ tolerant, scalable, and extremely simple to expand. MapReduce, well
+ known for its simplicity and applicability for large set of
+ distributed applications, is an integral part of Hadoop.
+
+ * HDFS is highly configurable with a default configuration well
+ suited for many installations. Most of the time, configuration
+ needs to be tuned only for very large clusters.
+
+ * Hadoop is written in Java and is supported on all major platforms.
+
+ * Hadoop supports shell-like commands to interact with HDFS directly.
+
+ * The NameNode and Datanodes have built in web servers that makes it
+ easy to check current status of the cluster.
+
+ * New features and improvements are regularly implemented in HDFS.
+ The following is a subset of useful features in HDFS:
+
+ * File permissions and authentication.
+
+ * Rack awareness: to take a node's physical location into
+ account while scheduling tasks and allocating storage.
+
+ * Safemode: an administrative mode for maintenance.
+
+ * <<<fsck>>>: a utility to diagnose health of the file system, to find
+ missing files or blocks.
+
+ * <<<fetchdt>>>: a utility to fetch DelegationToken and store it in a
+ file on the local system.
+
+ * Rebalancer: tool to balance the cluster when the data is
+ unevenly distributed among DataNodes.
+
+ * Upgrade and rollback: after a software upgrade, it is possible
+ to rollback to HDFS' state before the upgrade in case of
+ unexpected problems.
+
+ * Secondary NameNode: performs periodic checkpoints of the
+ namespace and helps keep the size of file containing log of
+ HDFS modifications within certain limits at the NameNode.
+
+ * Checkpoint node: performs periodic checkpoints of the
+ namespace and helps minimize the size of the log stored at the
+ NameNode containing changes to the HDFS. Replaces the role
+ previously filled by the Secondary NameNode, though is not yet
+ battle hardened. The NameNode allows multiple Checkpoint nodes
+ simultaneously, as long as there are no Backup nodes
+ registered with the system.
+
+ * Backup node: An extension to the Checkpoint node. In addition
+ to checkpointing it also receives a stream of edits from the
+ NameNode and maintains its own in-memory copy of the
+ namespace, which is always in sync with the active NameNode
+ namespace state. Only one Backup node may be registered with
+ the NameNode at once.
+
+* Prerequisites
+
+ The following documents describe how to install and set up a Hadoop
+ cluster:
+
+ * {{Single Node Setup}} for first-time users.
+
+ * {{Cluster Setup}} for large, distributed clusters.
+
+ The rest of this document assumes the user is able to set up and run a
+ HDFS with at least one DataNode. For the purpose of this document, both
+ the NameNode and DataNode could be running on the same physical
+ machine.
+
+* Web Interface
+
+ NameNode and DataNode each run an internal web server in order to
+ display basic information about the current status of the cluster. With
+ the default configuration, the NameNode front page is at
+ <<<http://namenode-name:50070/>>>. It lists the DataNodes in the cluster and
+ basic statistics of the cluster. The web interface can also be used to
+ browse the file system (using "Browse the file system" link on the
+ NameNode front page).
+
+* Shell Commands
+
+ Hadoop includes various shell-like commands that directly interact with
+ HDFS and other file systems that Hadoop supports. The command <<<bin/hdfs dfs -help>>>
+ lists the commands supported by Hadoop shell. Furthermore,
+ the command <<<bin/hdfs dfs -help command-name>>> displays more detailed help
+ for a command. These commands support most of the normal files system
+ operations like copying files, changing file permissions, etc. It also
+ supports a few HDFS specific operations like changing replication of
+ files. For more information see {{{File System Shell Guide}}}.
+
+** DFSAdmin Command
+
+ The <<<bin/hadoop dfsadmin>>> command supports a few HDFS administration
+ related operations. The <<<bin/hadoop dfsadmin -help>>> command lists all the
+ commands currently supported. For e.g.:
+
+ * <<<-report>>>: reports basic statistics of HDFS. Some of this
+ information is also available on the NameNode front page.
+
+ * <<<-safemode>>>: though usually not required, an administrator can
+ manually enter or leave Safemode.
+
+ * <<<-finalizeUpgrade>>>: removes previous backup of the cluster made
+ during last upgrade.
+
+ * <<<-refreshNodes>>>: Updates the namenode with the set of datanodes
+ allowed to connect to the namenode. Namenodes re-read datanode
+ hostnames in the file defined by <<<dfs.hosts>>>, <<<dfs.hosts.exclude>>>.
+ Hosts defined in <<<dfs.hosts>>> are the datanodes that are part of the
+ cluster. If there are entries in <<<dfs.hosts>>>, only the hosts in it
+ are allowed to register with the namenode. Entries in
+ <<<dfs.hosts.exclude>>> are datanodes that need to be decommissioned.
+ Datanodes complete decommissioning when all the replicas from them
+ are replicated to other datanodes. Decommissioned nodes are not
+ automatically shutdown and are not chosen for writing for new
+ replicas.
+
+ * <<<-printTopology>>> : Print the topology of the cluster. Display a tree
+ of racks and datanodes attached to the tracks as viewed by the
+ NameNode.
+
+ For command usage, see {{{dfsadmin}}}.
+
+* Secondary NameNode
+
+ The NameNode stores modifications to the file system as a log appended
+ to a native file system file, edits. When a NameNode starts up, it
+ reads HDFS state from an image file, fsimage, and then applies edits
+ from the edits log file. It then writes new HDFS state to the fsimage
+ and starts normal operation with an empty edits file. Since NameNode
+ merges fsimage and edits files only during start up, the edits log file
+ could get very large over time on a busy cluster. Another side effect
+ of a larger edits file is that next restart of NameNode takes longer.
+
+ The secondary NameNode merges the fsimage and the edits log files
+ periodically and keeps edits log size within a limit. It is usually run
+ on a different machine than the primary NameNode since its memory
+ requirements are on the same order as the primary NameNode.
+
+ The start of the checkpoint process on the secondary NameNode is
+ controlled by two configuration parameters.
+
+ * <<<dfs.namenode.checkpoint.period>>>, set to 1 hour by default, specifies
+ the maximum delay between two consecutive checkpoints, and
+
+ * <<<dfs.namenode.checkpoint.txns>>>, set to 40000 default, defines the
+ number of uncheckpointed transactions on the NameNode which will
+ force an urgent checkpoint, even if the checkpoint period has not
+ been reached.
+
+ The secondary NameNode stores the latest checkpoint in a directory
+ which is structured the same way as the primary NameNode's directory.
+ So that the check pointed image is always ready to be read by the
+ primary NameNode if necessary.
+
+ For command usage, see {{{secondarynamenode}}}.
+
+* Checkpoint Node
+
+ NameNode persists its namespace using two files: fsimage, which is the
+ latest checkpoint of the namespace and edits, a journal (log) of
+ changes to the namespace since the checkpoint. When a NameNode starts
+ up, it merges the fsimage and edits journal to provide an up-to-date
+ view of the file system metadata. The NameNode then overwrites fsimage
+ with the new HDFS state and begins a new edits journal.
+
+ The Checkpoint node periodically creates checkpoints of the namespace.
+ It downloads fsimage and edits from the active NameNode, merges them
+ locally, and uploads the new image back to the active NameNode. The
+ Checkpoint node usually runs on a different machine than the NameNode
+ since its memory requirements are on the same order as the NameNode.
+ The Checkpoint node is started by bin/hdfs namenode -checkpoint on the
+ node specified in the configuration file.
+
+ The location of the Checkpoint (or Backup) node and its accompanying
+ web interface are configured via the <<<dfs.namenode.backup.address>>> and
+ <<<dfs.namenode.backup.http-address>>> configuration variables.
+
+ The start of the checkpoint process on the Checkpoint node is
+ controlled by two configuration parameters.
+
+ * <<<dfs.namenode.checkpoint.period>>>, set to 1 hour by default, specifies
+ the maximum delay between two consecutive checkpoints
+
+ * <<<dfs.namenode.checkpoint.txns>>>, set to 40000 default, defines the
+ number of uncheckpointed transactions on the NameNode which will
+ force an urgent checkpoint, even if the checkpoint period has not
+ been reached.
+
+ The Checkpoint node stores the latest checkpoint in a directory that is
+ structured the same as the NameNode's directory. This allows the
+ checkpointed image to be always available for reading by the NameNode
+ if necessary. See Import checkpoint.
+
+ Multiple checkpoint nodes may be specified in the cluster configuration
+ file.
+
+ For command usage, see {{{namenode}}}.
+
+* Backup Node
+
+ The Backup node provides the same checkpointing functionality as the
+ Checkpoint node, as well as maintaining an in-memory, up-to-date copy
+ of the file system namespace that is always synchronized with the
+ active NameNode state. Along with accepting a journal stream of file
+ system edits from the NameNode and persisting this to disk, the Backup
+ node also applies those edits into its own copy of the namespace in
+ memory, thus creating a backup of the namespace.
+
+ The Backup node does not need to download fsimage and edits files from
+ the active NameNode in order to create a checkpoint, as would be
+ required with a Checkpoint node or Secondary NameNode, since it already
+ has an up-to-date state of the namespace state in memory. The Backup
+ node checkpoint process is more efficient as it only needs to save the
+ namespace into the local fsimage file and reset edits.
+
+ As the Backup node maintains a copy of the namespace in memory, its RAM
+ requirements are the same as the NameNode.
+
+ The NameNode supports one Backup node at a time. No Checkpoint nodes
+ may be registered if a Backup node is in use. Using multiple Backup
+ nodes concurrently will be supported in the future.
+
+ The Backup node is configured in the same manner as the Checkpoint
+ node. It is started with <<<bin/hdfs namenode -backup>>>.
+
+ The location of the Backup (or Checkpoint) node and its accompanying
+ web interface are configured via the <<<dfs.namenode.backup.address>>> and
+ <<<dfs.namenode.backup.http-address>>> configuration variables.
+
+ Use of a Backup node provides the option of running the NameNode with
+ no persistent storage, delegating all responsibility for persisting the
+ state of the namespace to the Backup node. To do this, start the
+ NameNode with the <<<-importCheckpoint>>> option, along with specifying no
+ persistent storage directories of type edits <<<dfs.namenode.edits.dir>>> for
+ the NameNode configuration.
+
+ For a complete discussion of the motivation behind the creation of the
+ Backup node and Checkpoint node, see {{{https://issues.apache.org/jira/browse/HADOOP-4539}HADOOP-4539}}.
+ For command usage, see {{{namenode}}}.
+
+* Import Checkpoint
+
+ The latest checkpoint can be imported to the NameNode if all other
+ copies of the image and the edits files are lost. In order to do that
+ one should:
+
+ * Create an empty directory specified in the <<<dfs.namenode.name.dir>>>
+ configuration variable;
+
+ * Specify the location of the checkpoint directory in the
+ configuration variable <<<dfs.namenode.checkpoint.dir>>>;
+
+ * and start the NameNode with <<<-importCheckpoint>>> option.
+
+ The NameNode will upload the checkpoint from the
+ <<<dfs.namenode.checkpoint.dir>>> directory and then save it to the NameNode
+ directory(s) set in <<<dfs.namenode.name.dir>>>. The NameNode will fail if a
+ legal image is contained in <<<dfs.namenode.name.dir>>>. The NameNode
+ verifies that the image in <<<dfs.namenode.checkpoint.dir>>> is consistent,
+ but does not modify it in any way.
+
+ For command usage, see {{{namenode}}}.
+
+* Rebalancer
+
+ HDFS data might not always be be placed uniformly across the DataNode.
+ One common reason is addition of new DataNodes to an existing cluster.
+ While placing new blocks (data for a file is stored as a series of
+ blocks), NameNode considers various parameters before choosing the
+ DataNodes to receive these blocks. Some of the considerations are:
+
+ * Policy to keep one of the replicas of a block on the same node as
+ the node that is writing the block.
+
+ * Need to spread different replicas of a block across the racks so
+ that cluster can survive loss of whole rack.
+
+ * One of the replicas is usually placed on the same rack as the node
+ writing to the file so that cross-rack network I/O is reduced.
+
+ * Spread HDFS data uniformly across the DataNodes in the cluster.
+
+ Due to multiple competing considerations, data might not be uniformly
+ placed across the DataNodes. HDFS provides a tool for administrators
+ that analyzes block placement and rebalanaces data across the DataNode.
+ A brief administrator's guide for rebalancer as a PDF is attached to
+ {{{https://issues.apache.org/jira/browse/HADOOP-1652}HADOOP-1652}}.
+
+ For command usage, see {{{balancer}}}.
+
+* Rack Awareness
+
+ Typically large Hadoop clusters are arranged in racks and network
+ traffic between different nodes with in the same rack is much more
+ desirable than network traffic across the racks. In addition NameNode
+ tries to place replicas of block on multiple racks for improved fault
+ tolerance. Hadoop lets the cluster administrators decide which rack a
+ node belongs to through configuration variable
+ <<<net.topology.script.file.name>>>. When this script is configured, each
+ node runs the script to determine its rack id. A default installation
+ assumes all the nodes belong to the same rack. This feature and
+ configuration is further described in PDF attached to
+ {{{https://issues.apache.org/jira/browse/HADOOP-692}HADOOP-692}}.
+
+* Safemode
+
+ During start up the NameNode loads the file system state from the
+ fsimage and the edits log file. It then waits for DataNodes to report
+ their blocks so that it does not prematurely start replicating the
+ blocks though enough replicas already exist in the cluster. During this
+ time NameNode stays in Safemode. Safemode for the NameNode is
+ essentially a read-only mode for the HDFS cluster, where it does not
+ allow any modifications to file system or blocks. Normally the NameNode
+ leaves Safemode automatically after the DataNodes have reported that
+ most file system blocks are available. If required, HDFS could be
+ placed in Safemode explicitly using <<<bin/hadoop dfsadmin -safemode>>>
+ command. NameNode front page shows whether Safemode is on or off. A
+ more detailed description and configuration is maintained as JavaDoc
+ for <<<setSafeMode()>>>.
+
+* fsck
+
+ HDFS supports the fsck command to check for various inconsistencies. It
+ it is designed for reporting problems with various files, for example,
+ missing blocks for a file or under-replicated blocks. Unlike a
+ traditional fsck utility for native file systems, this command does not
+ correct the errors it detects. Normally NameNode automatically corrects
+ most of the recoverable failures. By default fsck ignores open files
+ but provides an option to select all files during reporting. The HDFS
+ fsck command is not a Hadoop shell command. It can be run as
+ <<<bin/hadoop fsck>>>. For command usage, see {{{fsck}}}. fsck can be run on the
+ whole file system or on a subset of files.
+
+* fetchdt
+
+ HDFS supports the fetchdt command to fetch Delegation Token and store
+ it in a file on the local system. This token can be later used to
+ access secure server (NameNode for example) from a non secure client.
+ Utility uses either RPC or HTTPS (over Kerberos) to get the token, and
+ thus requires kerberos tickets to be present before the run (run kinit
+ to get the tickets). The HDFS fetchdt command is not a Hadoop shell
+ command. It can be run as <<<bin/hadoop fetchdt DTfile>>>. After you got
+ the token you can run an HDFS command without having Kerberos tickets,
+ by pointing <<<HADOOP_TOKEN_FILE_LOCATION>>> environmental variable to the
+ delegation token file. For command usage, see {{{fetchdt}}} command.
+
+* Recovery Mode
+
+ Typically, you will configure multiple metadata storage locations.
+ Then, if one storage location is corrupt, you can read the metadata
+ from one of the other storage locations.
+
+ However, what can you do if the only storage locations available are
+ corrupt? In this case, there is a special NameNode startup mode called
+ Recovery mode that may allow you to recover most of your data.
+
+ You can start the NameNode in recovery mode like so: <<<namenode -recover>>>
+
+ When in recovery mode, the NameNode will interactively prompt you at
+ the command line about possible courses of action you can take to
+ recover your data.
+
+ If you don't want to be prompted, you can give the <<<-force>>> option. This
+ option will force recovery mode to always select the first choice.
+ Normally, this will be the most reasonable choice.
+
+ Because Recovery mode can cause you to lose data, you should always
+ back up your edit log and fsimage before using it.
+
+* Upgrade and Rollback
+
+ When Hadoop is upgraded on an existing cluster, as with any software
+ upgrade, it is possible there are new bugs or incompatible changes that
+ affect existing applications and were not discovered earlier. In any
+ non-trivial HDFS installation, it is not an option to loose any data,
+ let alone to restart HDFS from scratch. HDFS allows administrators to
+ go back to earlier version of Hadoop and rollback the cluster to the
+ state it was in before the upgrade. HDFS upgrade is described in more
+ detail in {{{Hadoop Upgrade}}} Wiki page. HDFS can have one such backup at a
+ time. Before upgrading, administrators need to remove existing backup
+ using bin/hadoop dfsadmin <<<-finalizeUpgrade>>> command. The following
+ briefly describes the typical upgrade procedure:
+
+ * Before upgrading Hadoop software, finalize if there an existing
+ backup. <<<dfsadmin -upgradeProgress>>> status can tell if the cluster
+ needs to be finalized.
+
+ * Stop the cluster and distribute new version of Hadoop.
+
+ * Run the new version with <<<-upgrade>>> option (<<<bin/start-dfs.sh -upgrade>>>).
+
+ * Most of the time, cluster works just fine. Once the new HDFS is
+ considered working well (may be after a few days of operation),
+ finalize the upgrade. Note that until the cluster is finalized,
+ deleting the files that existed before the upgrade does not free up
+ real disk space on the DataNodes.
+
+ * If there is a need to move back to the old version,
+
+ * stop the cluster and distribute earlier version of Hadoop.
+
+ * start the cluster with rollback option. (<<<bin/start-dfs.h -rollback>>>).
+
+* File Permissions and Security
+
+ The file permissions are designed to be similar to file permissions on
+ other familiar platforms like Linux. Currently, security is limited to
+ simple file permissions. The user that starts NameNode is treated as
+ the superuser for HDFS. Future versions of HDFS will support network
+ authentication protocols like Kerberos for user authentication and
+ encryption of data transfers. The details are discussed in the
+ Permissions Guide.
+
+* Scalability
+
+ Hadoop currently runs on clusters with thousands of nodes. The
+ {{{PoweredBy}}} Wiki page lists some of the organizations that deploy Hadoop
+ on large clusters. HDFS has one NameNode for each cluster. Currently
+ the total memory available on NameNode is the primary scalability
+ limitation. On very large clusters, increasing average size of files
+ stored in HDFS helps with increasing cluster size without increasing
+ memory requirements on NameNode. The default configuration may not
+ suite very large clustes. The {{{FAQ}}} Wiki page lists suggested
+ configuration improvements for large Hadoop clusters.
+
+* Related Documentation
+
+ This user guide is a good starting point for working with HDFS. While
+ the user guide continues to improve, there is a large wealth of
+ documentation about Hadoop and HDFS. The following list is a starting
+ point for further exploration:
+
+ * {{{Hadoop Site}}}: The home page for the Apache Hadoop site.
+
+ * {{{Hadoop Wiki}}}: The home page (FrontPage) for the Hadoop Wiki. Unlike
+ the released documentation, which is part of Hadoop source tree,
+ Hadoop Wiki is regularly edited by Hadoop Community.
+
+ * {{{FAQ}}}: The FAQ Wiki page.
+
+ * {{{Hadoop JavaDoc API}}}.
+
+ * {{{Hadoop User Mailing List}}}: core-user[at]hadoop.apache.org.
+
+ * Explore {{{src/hdfs/hdfs-default.xml}}}. It includes brief description of
+ most of the configuration variables available.
+
+ * {{{Hadoop Commands Guide}}}: Hadoop commands usage.
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,60 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ HFTP Guide
+ ---
+ ---
+ ${maven.build.timestamp}
+
+HFTP Guide
+
+ \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Introduction
+
+ HFTP is a Hadoop filesystem implementation that lets you read data from
+ a remote Hadoop HDFS cluster. The reads are done via HTTP, and data is
+ sourced from DataNodes. HFTP is a read-only filesystem, and will throw
+ exceptions if you try to use it to write data or modify the filesystem
+ state.
+
+ HFTP is primarily useful if you have multiple HDFS clusters with
+ different versions and you need to move data from one to another. HFTP
+ is wire-compatible even between different versions of HDFS. For
+ example, you can do things like: <<<hadoop distcp -i hftp://sourceFS:50070/src hdfs://destFS:50070/dest>>>.
+ Note that HFTP is read-only so the destination must be an HDFS filesystem.
+ (Also, in this example, the distcp should be run using the configuraton of
+ the new filesystem.)
+
+ An extension, HSFTP, uses HTTPS by default. This means that data will
+ be encrypted in transit.
+
+* Implementation
+
+ The code for HFTP lives in the Java class
+ <<<org.apache.hadoop.hdfs.HftpFileSystem>>>. Likewise, HSFTP is implemented
+ in <<<org.apache.hadoop.hdfs.HsftpFileSystem>>>.
+
+* Configuration Options
+
+*-----------------------:-----------------------------------+
+| <<Name>> | <<Description>> |
+*-----------------------:-----------------------------------+
+| <<<dfs.hftp.https.port>>> | the HTTPS port on the remote cluster. If not set,
+| | HFTP will fall back on <<<dfs.https.port>>>.
+*-----------------------:-----------------------------------+
+| <<<hdfs.service.host_ip:port>>> | Specifies the service name (for the security
+| | subsystem) associated with the HFTP filesystem running at ip:port.
+*-----------------------:-----------------------------------+
Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,94 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+ ---
+ C API libhdfs
+ ---
+ ---
+ ${maven.build.timestamp}
+
+C API libhdfs
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+ libhdfs is a JNI based C API for Hadoop's Distributed File System
+ (HDFS). It provides C APIs to a subset of the HDFS APIs to manipulate
+ HDFS files and the filesystem. libhdfs is part of the Hadoop
+ distribution and comes pre-compiled in
+ <<<${HADOOP_PREFIX}/libhdfs/libhdfs.so>>> .
+
+* The APIs
+
+ The libhdfs APIs are a subset of: {{{hadoop fs APIs}}}.
+
+ The header file for libhdfs describes each API in detail and is
+ available in <<<${HADOOP_PREFIX}/src/c++/libhdfs/hdfs.h>>>
+
+* A Sample Program
+
+----
+ \#include "hdfs.h"
+
+ int main(int argc, char **argv) {
+
+ hdfsFS fs = hdfsConnect("default", 0);
+ const char* writePath = "/tmp/testfile.txt";
+ hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0);
+ if(!writeFile) {
+ fprintf(stderr, "Failed to open %s for writing!\n", writePath);
+ exit(-1);
+ }
+ char* buffer = "Hello, World!";
+ tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
+ if (hdfsFlush(fs, writeFile)) {
+ fprintf(stderr, "Failed to 'flush' %s\n", writePath);
+ exit(-1);
+ }
+ hdfsCloseFile(fs, writeFile);
+ }
+----
+
+* How To Link With The Library
+
+ See the Makefile for <<<hdfs_test.c>>> in the libhdfs source directory
+ (<<<${HADOOP_PREFIX}/src/c++/libhdfs/Makefile>>>) or something like:
+ <<<gcc above_sample.c -I${HADOOP_PREFIX}/src/c++/libhdfs -L${HADOOP_PREFIX}/libhdfs -lhdfs -o above_sample>>>
+
+* Common Problems
+
+ The most common problem is the <<<CLASSPATH>>> is not set properly when
+ calling a program that uses libhdfs. Make sure you set it to all the
+ Hadoop jars needed to run Hadoop itself. Currently, there is no way to
+ programmatically generate the classpath, but a good bet is to include
+ all the jar files in <<<${HADOOP_PREFIX}>>> and <<<${HADOOP_PREFIX}/lib>>> as well
+ as the right configuration directory containing <<<hdfs-site.xml>>>
+
+* Thread Safe
+
+ libdhfs is thread safe.
+
+ * Concurrency and Hadoop FS "handles"
+
+ The Hadoop FS implementation includes a FS handle cache which
+ caches based on the URI of the namenode along with the user
+ connecting. So, all calls to <<<hdfsConnect>>> will return the same
+ handle but calls to <<<hdfsConnectAsUser>>> with different users will
+ return different handles. But, since HDFS client handles are
+ completely thread safe, this has no bearing on concurrency.
+
+ * Concurrency and libhdfs/JNI
+
+ The libhdfs calls to JNI should always be creating thread local
+ storage, so (in theory), libhdfs should be as thread safe as the
+ underlying calls to the Hadoop FS.