You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-commits@hadoop.apache.org by aa...@apache.org on 2015/03/01 02:09:36 UTC

[1/2] hadoop git commit: HADOOP-10976. Backport "moving the source code of hadoop-tools docs to the directory under hadoop-tools" to branch-2. Contributed by Masatake Iwasaki.

Repository: hadoop
Updated Branches:
  refs/heads/branch-2 0b0be0056 -> f7a724ca9


http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
----------------------------------------------------------------------
diff --git a/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm b/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
new file mode 100644
index 0000000..447e515
--- /dev/null
+++ b/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
@@ -0,0 +1,512 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+#set ( $H3 = '###' )
+
+DistCp Version2 Guide
+=====================
+
+---
+
+ - [Overview](#Overview)
+ - [Usage](#Usage)
+     - [Basic Usage](#Basic_Usage)
+     - [Update and Overwrite](#Update_and_Overwrite)
+ - [Command Line Options](#Command_Line_Options)
+ - [Architecture of DistCp](#Architecture_of_DistCp)
+     - [DistCp Driver](#DistCp_Driver)
+     - [Copy-listing Generator](#Copy-listing_Generator)
+     - [InputFormats and MapReduce Components](#InputFormats_and_MapReduce_Components)
+ - [Appendix](#Appendix)
+     - [Map sizing](#Map_sizing)
+     - [Copying Between Versions of HDFS](#Copying_Between_Versions_of_HDFS)
+     - [MapReduce and other side-effects](#MapReduce_and_other_side-effects)
+     - [SSL Configurations for HSFTP sources](#SSL_Configurations_for_HSFTP_sources)
+ - [Frequently Asked Questions](#Frequently_Asked_Questions)
+
+---
+
+Overview
+--------
+
+  DistCp Version 2 (distributed copy) is a tool used for large
+  inter/intra-cluster copying. It uses MapReduce to effect its distribution,
+  error handling and recovery, and reporting. It expands a list of files and
+  directories into input to map tasks, each of which will copy a partition of
+  the files specified in the source list.
+
+  [The erstwhile implementation of DistCp]
+  (http://hadoop.apache.org/docs/r1.2.1/distcp.html) has its share of quirks
+  and drawbacks, both in its usage, as well as its extensibility and
+  performance. The purpose of the DistCp refactor was to fix these
+  shortcomings, enabling it to be used and extended programmatically. New
+  paradigms have been introduced to improve runtime and setup performance,
+  while simultaneously retaining the legacy behaviour as default.
+
+  This document aims to describe the design of the new DistCp, its spanking new
+  features, their optimal use, and any deviance from the legacy implementation.
+
+Usage
+-----
+
+$H3 Basic Usage
+
+  The most common invocation of DistCp is an inter-cluster copy:
+
+    bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
+    hdfs://nn2:8020/bar/foo
+
+  This will expand the namespace under `/foo/bar` on nn1 into a temporary file,
+  partition its contents among a set of map tasks, and start a copy on each
+  NodeManager from nn1 to nn2.
+
+  One can also specify multiple source directories on the command line:
+
+    bash$ hadoop distcp hdfs://nn1:8020/foo/a \
+    hdfs://nn1:8020/foo/b \
+    hdfs://nn2:8020/bar/foo
+
+  Or, equivalently, from a file using the -f option:
+
+    bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
+    hdfs://nn2:8020/bar/foo
+
+  Where `srclist` contains
+
+    hdfs://nn1:8020/foo/a
+    hdfs://nn1:8020/foo/b
+
+  When copying from multiple sources, DistCp will abort the copy with an error
+  message if two sources collide, but collisions at the destination are
+  resolved per the [options](#Command_Line_Options) specified. By default,
+  files already existing at the destination are skipped (i.e. not replaced by
+  the source file). A count of skipped files is reported at the end of each
+  job, but it may be inaccurate if a copier failed for some subset of its
+  files, but succeeded on a later attempt.
+
+  It is important that each NodeManager can reach and communicate with both the
+  source and destination file systems. For HDFS, both the source and
+  destination must be running the same version of the protocol or use a
+  backwards-compatible protocol; see [Copying Between Versions]
+  (#Copying_Between_Versions_of_HDFS).
+
+  After a copy, it is recommended that one generates and cross-checks a listing
+  of the source and destination to verify that the copy was truly successful.
+  Since DistCp employs both Map/Reduce and the FileSystem API, issues in or
+  between any of the three could adversely and silently affect the copy. Some
+  have had success running with `-update` enabled to perform a second pass, but
+  users should be acquainted with its semantics before attempting this.
+
+  It's also worth noting that if another client is still writing to a source
+  file, the copy will likely fail. Attempting to overwrite a file being written
+  at the destination should also fail on HDFS. If a source file is (re)moved
+  before it is copied, the copy will fail with a FileNotFoundException.
+
+  Please refer to the detailed Command Line Reference for information on all
+  the options available in DistCp.
+
+$H3 Update and Overwrite
+
+  `-update` is used to copy files from source that don't exist at the target
+  or differ from the target version. `-overwrite` overwrites target-files that
+  exist at the target.
+
+  The Update and Overwrite options warrant special attention since their
+  handling of source-paths varies from the defaults in a very subtle manner.
+  Consider a copy from `/source/first/` and `/source/second/` to `/target/`,
+  where the source paths have the following contents:
+
+    hdfs://nn1:8020/source/first/1
+    hdfs://nn1:8020/source/first/2
+    hdfs://nn1:8020/source/second/10
+    hdfs://nn1:8020/source/second/20
+
+  When DistCp is invoked without `-update` or `-overwrite`, the DistCp defaults
+  would create directories `first/` and `second/`, under `/target`. Thus:
+
+    distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
+
+  would yield the following contents in `/target`:
+
+    hdfs://nn2:8020/target/first/1
+    hdfs://nn2:8020/target/first/2
+    hdfs://nn2:8020/target/second/10
+    hdfs://nn2:8020/target/second/20
+
+  When either `-update` or `-overwrite` is specified, the **contents** of the
+  source-directories are copied to target, and not the source directories
+  themselves. Thus:
+
+    distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
+
+  would yield the following contents in `/target`:
+
+    hdfs://nn2:8020/target/1
+    hdfs://nn2:8020/target/2
+    hdfs://nn2:8020/target/10
+    hdfs://nn2:8020/target/20
+
+  By extension, if both source folders contained a file with the same name
+  (say, `0`), then both sources would map an entry to `/target/0` at the
+  destination. Rather than to permit this conflict, DistCp will abort.
+
+  Now, consider the following copy operation:
+
+    distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
+
+  With sources/sizes:
+
+    hdfs://nn1:8020/source/first/1 32
+    hdfs://nn1:8020/source/first/2 32
+    hdfs://nn1:8020/source/second/10 64
+    hdfs://nn1:8020/source/second/20 32
+
+  And destination/sizes:
+
+    hdfs://nn2:8020/target/1 32
+    hdfs://nn2:8020/target/10 32
+    hdfs://nn2:8020/target/20 64
+
+  Will effect:
+
+    hdfs://nn2:8020/target/1 32
+    hdfs://nn2:8020/target/2 32
+    hdfs://nn2:8020/target/10 64
+    hdfs://nn2:8020/target/20 32
+
+  `1` is skipped because the file-length and contents match. `2` is copied
+  because it doesn't exist at the target. `10` and `20` are overwritten since
+  the contents don't match the source.
+
+  If `-update` is used, `1` is overwritten as well.
+
+$H3 raw Namespace Extended Attribute Preservation
+
+  This section only applies to HDFS.
+
+  If the target and all of the source pathnames are in the /.reserved/raw
+  hierarchy, then 'raw' namespace extended attributes will be preserved.
+  'raw' xattrs are used by the system for internal functions such as encryption
+  meta data. They are only visible to users when accessed through the
+  /.reserved/raw hierarchy.
+
+  raw xattrs are preserved based solely on whether /.reserved/raw prefixes are
+  supplied. The -p (preserve, see below) flag does not impact preservation of
+  raw xattrs.
+
+  To prevent raw xattrs from being preserved, simply do not use the
+  /.reserved/raw prefix on any of the source and target paths.
+
+  If the /.reserved/raw prefix is specified on only a subset of the source and
+  target paths, an error will be displayed and a non-0 exit code returned.
+
+Command Line Options
+--------------------
+
+Flag              | Description                          | Notes
+----------------- | ------------------------------------ | --------
+`-p[rbugpcax]` | Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr | Modification times are not preserved. Also, when `-update` is specified, status updates will **not** be synchronized unless the file sizes also differ (i.e. unless the file is re-created). If -pa is specified, DistCp preserves the permissions also because ACLs are a super-set of permissions.
+`-i` | Ignore failures | As explained in the Appendix, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted.
+`-log <logdir>` | Write logs to \<logdir\> | DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed.
+`-m <num_maps>` | Maximum number of simultaneous copies | Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput.
+`-overwrite` | Overwrite destination | If a map fails and `-i` is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
+`-update` | Overwrite if source and destination differ in size, blocksize, or checksum | As noted in the preceding, this is not a "sync" operation. The criteria examined are the source and destination file sizes, blocksizes, and checksums; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
+`-f <urilist_uri>` | Use list at \<urilist_uri\> as src list | This is equivalent to listing each source on the command line. The `urilist_uri` list should be a fully qualified URI.
+`-filelimit <n>` | Limit the total number of files to be <= n | **Deprecated!** Ignored in the new DistCp.
+`-sizelimit <n>` | Limit the total size to be <= n bytes | **Deprecated!** Ignored in the new DistCp.
+`-delete` | Delete the files existing in the dst but not in src | The deletion is done by FS Shell. So the trash will be used, if it is enable.
+`-strategy {dynamic|uniformsize}` | Choose the copy-strategy to be used in DistCp. | By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If "dynamic" is specified, `DynamicInputFormat` is used instead. (This is described in the Architecture section, under InputFormats.)
+`-bandwidth` | Specify bandwidth per map, in MB/second. | Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the **net** bandwidth used tends towards the specified value.
+`-atomic {-tmp <tmp_dir>}` | Specify atomic commit, with optional tmp directory. | `-atomic` instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final-location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, `-tmp` may be used to specify the location of the tmp-target. If not specified, a default is chosen. **Note:** tmp_dir must be on the final target cluster.
+`-mapredSslConf <ssl_conf_file>` | Specify SSL Config file, to be used with HSFTP source | When using the hsftp protocol with a source, the security- related properties may be specified in a config-file and passed to DistCp. \<ssl_conf_file\> needs to be in the classpath.
+`-async` | Run DistCp asynchronously. Quits as soon as the Hadoop Job is launched. | The Hadoop Job-id is logged, for tracking.
+
+Architecture of DistCp
+----------------------
+
+  The components of the new DistCp may be classified into the following
+  categories:
+
+  * DistCp Driver
+  * Copy-listing generator
+  * Input-formats and Map-Reduce components
+
+$H3 DistCp Driver
+
+  The DistCp Driver components are responsible for:
+
+  * Parsing the arguments passed to the DistCp command on the command-line,
+    via:
+
+     * OptionsParser, and
+     * DistCpOptionsSwitch
+
+  * Assembling the command arguments into an appropriate DistCpOptions object,
+    and initializing DistCp. These arguments include:
+
+     * Source-paths
+     * Target location
+     * Copy options (e.g. whether to update-copy, overwrite, which
+       file-attributes to preserve, etc.)
+
+  * Orchestrating the copy operation by:
+
+     * Invoking the copy-listing-generator to create the list of files to be
+       copied.
+     * Setting up and launching the Hadoop Map-Reduce Job to carry out the
+       copy.
+     * Based on the options, either returning a handle to the Hadoop MR Job
+       immediately, or waiting till completion.
+
+  The parser-elements are exercised only from the command-line (or if
+  DistCp::run() is invoked). The DistCp class may also be used
+  programmatically, by constructing the DistCpOptions object, and initializing
+  a DistCp object appropriately.
+
+$H3 Copy-listing Generator
+
+  The copy-listing-generator classes are responsible for creating the list of
+  files/directories to be copied from source. They examine the contents of the
+  source-paths (files/directories, including wild-cards), and record all paths
+  that need copy into a SequenceFile, for consumption by the DistCp Hadoop
+  Job. The main classes in this module include:
+
+  1. CopyListing: The interface that should be implemented by any
+     copy-listing-generator implementation. Also provides the factory method by
+     which the concrete CopyListing implementation is chosen.
+  2. SimpleCopyListing: An implementation of CopyListing that accepts multiple
+     source paths (files/directories), and recursively lists all the individual
+     files and directories under each, for copy.
+  3. GlobbedCopyListing: Another implementation of CopyListing that expands
+     wild-cards in the source paths.
+  4. FileBasedCopyListing: An implementation of CopyListing that reads the
+     source-path list from a specified file.
+
+  Based on whether a source-file-list is specified in the DistCpOptions, the
+  source-listing is generated in one of the following ways:
+
+  1. If there's no source-file-list, the GlobbedCopyListing is used. All
+     wild-cards are expanded, and all the expansions are forwarded to the
+     SimpleCopyListing, which in turn constructs the listing (via recursive
+     descent of each path).
+  2. If a source-file-list is specified, the FileBasedCopyListing is used.
+     Source-paths are read from the specified file, and then forwarded to the
+     GlobbedCopyListing. The listing is then constructed as described above.
+
+  One may customize the method by which the copy-listing is constructed by
+  providing a custom implementation of the CopyListing interface. The behaviour
+  of DistCp differs here from the legacy DistCp, in how paths are considered
+  for copy.
+
+  The legacy implementation only lists those paths that must definitely be
+  copied on to target. E.g. if a file already exists at the target (and
+  `-overwrite` isn't specified), the file isn't even considered in the
+  MapReduce Copy Job. Determining this during setup (i.e. before the MapReduce
+  Job) involves file-size and checksum-comparisons that are potentially
+  time-consuming.
+
+  The new DistCp postpones such checks until the MapReduce Job, thus reducing
+  setup time. Performance is enhanced further since these checks are
+  parallelized across multiple maps.
+
+$H3 InputFormats and MapReduce Components
+
+  The InputFormats and MapReduce components are responsible for the actual copy
+  of files and directories from the source to the destination path. The
+  listing-file created during copy-listing generation is consumed at this
+  point, when the copy is carried out. The classes of interest here include:
+
+  * **UniformSizeInputFormat:**
+    This implementation of org.apache.hadoop.mapreduce.InputFormat provides
+    equivalence with Legacy DistCp in balancing load across maps. The aim of
+    the UniformSizeInputFormat is to make each map copy roughly the same number
+    of bytes. Apropos, the listing file is split into groups of paths, such
+    that the sum of file-sizes in each InputSplit is nearly equal to every
+    other map. The splitting isn't always perfect, but its trivial
+    implementation keeps the setup-time low.
+
+  * **DynamicInputFormat and DynamicRecordReader:**
+    The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
+    and is new to DistCp. The listing-file is split into several "chunk-files",
+    the exact number of chunk-files being a multiple of the number of maps
+    requested for in the Hadoop Job. Each map task is "assigned" one of the
+    chunk-files (by renaming the chunk to the task's id), before the Job is
+    launched.
+    Paths are read from each chunk using the DynamicRecordReader, and
+    processed in the CopyMapper. After all the paths in a chunk are processed,
+    the current chunk is deleted and a new chunk is acquired. The process
+    continues until no more chunks are available.
+    This "dynamic" approach allows faster map-tasks to consume more paths than
+    slower ones, thus speeding up the DistCp job overall.
+
+  * **CopyMapper:**
+    This class implements the physical file-copy. The input-paths are checked
+    against the input-options (specified in the Job's Configuration), to
+    determine whether a file needs copy. A file will be copied only if at least
+    one of the following is true:
+
+     * A file with the same name doesn't exist at target.
+     * A file with the same name exists at target, but has a different file
+       size.
+     * A file with the same name exists at target, but has a different
+       checksum, and `-skipcrccheck` isn't mentioned.
+     * A file with the same name exists at target, but `-overwrite` is
+       specified.
+     * A file with the same name exists at target, but differs in block-size
+       (and block-size needs to be preserved.
+
+  * **CopyCommitter:** This class is responsible for the commit-phase of the
+    DistCp job, including:
+
+     * Preservation of directory-permissions (if specified in the options)
+     * Clean-up of temporary-files, work-directories, etc.
+
+Appendix
+--------
+
+$H3 Map sizing
+
+  By default, DistCp makes an attempt to size each map comparably so that each
+  copies roughly the same number of bytes. Note that files are the finest level
+  of granularity, so increasing the number of simultaneous copiers (i.e. maps)
+  may not always increase the number of simultaneous copies nor the overall
+  throughput.
+
+  The new DistCp also provides a strategy to "dynamically" size maps, allowing
+  faster data-nodes to copy more bytes than slower nodes. Using `-strategy
+  dynamic` (explained in the Architecture), rather than to assign a fixed set
+  of source-files to each map-task, files are instead split into several sets.
+  The number of sets exceeds the number of maps, usually by a factor of 2-3.
+  Each map picks up and copies all files listed in a chunk. When a chunk is
+  exhausted, a new chunk is acquired and processed, until no more chunks
+  remain.
+
+  By not assigning a source-path to a fixed map, faster map-tasks (i.e.
+  data-nodes) are able to consume more chunks, and thus copy more data, than
+  slower nodes. While this distribution isn't uniform, it is fair with regard
+  to each mapper's capacity.
+
+  The dynamic-strategy is implemented by the DynamicInputFormat. It provides
+  superior performance under most conditions.
+
+  Tuning the number of maps to the size of the source and destination clusters,
+  the size of the copy, and the available bandwidth is recommended for
+  long-running and regularly run jobs.
+
+$H3 Copying Between Versions of HDFS
+
+  For copying between two different versions of Hadoop, one will usually use
+  HftpFileSystem. This is a read-only FileSystem, so DistCp must be run on the
+  destination cluster (more specifically, on NodeManagers that can write to the
+  destination cluster). Each source is specified as
+  `hftp://<dfs.http.address>/<path>` (the default `dfs.http.address` is
+  `<namenode>:50070`).
+
+$H3 MapReduce and other side-effects
+
+  As has been mentioned in the preceding, should a map fail to copy one of its
+  inputs, there will be several side-effects.
+
+  * Unless `-overwrite` is specified, files successfully copied by a previous
+    map on a re-execution will be marked as "skipped".
+  * If a map fails `mapreduce.map.maxattempts` times, the remaining map tasks
+    will be killed (unless `-i` is set).
+  * If `mapreduce.map.speculative` is set set final and true, the result of the
+    copy is undefined.
+
+$H3 SSL Configurations for HSFTP sources
+
+  To use an HSFTP source (i.e. using the hsftp protocol), a SSL configuration
+  file needs to be specified (via the `-mapredSslConf` option). This must
+  specify 3 parameters:
+
+  * `ssl.client.truststore.location`: The local-filesystem location of the
+    trust-store file, containing the certificate for the NameNode.
+  * `ssl.client.truststore.type`: (Optional) The format of the trust-store
+    file.
+  * `ssl.client.truststore.password`: (Optional) Password for the trust-store
+    file.
+
+  The following is an example of the contents of the contents of a SSL
+  Configuration file:
+
+    <configuration>
+      <property>
+        <name>ssl.client.truststore.location</name>
+        <value>/work/keystore.jks</value>
+        <description>Truststore to be used by clients like distcp. Must be specified.</description>
+      </property>
+
+      <property>
+        <name>ssl.client.truststore.password</name>
+        <value>changeme</value>
+        <description>Optional. Default value is "".</description>
+      </property>
+
+      <property>
+        <name>ssl.client.truststore.type</name>
+        <value>jks</value>
+        <description>Optional. Default value is "jks".</description>
+      </property>
+    </configuration>
+
+  The SSL configuration file must be in the class-path of the DistCp program.
+
+Frequently Asked Questions
+--------------------------
+
+  1. **Why does -update not create the parent source-directory under a pre-existing target directory?**
+     The behaviour of `-update` and `-overwrite` is described in detail in the
+     Usage section of this document. In short, if either option is used with a
+     pre-existing destination directory, the **contents** of each source
+     directory is copied over, rather than the source-directory itself. This
+     behaviour is consistent with the legacy DistCp implementation as well.
+
+  2. **How does the new DistCp differ in semantics from the Legacy DistCp?**
+
+     * Files that are skipped during copy used to also have their
+       file-attributes (permissions, owner/group info, etc.) unchanged, when
+       copied with Legacy DistCp. These are now updated, even if the file-copy
+       is skipped.
+     * Empty root directories among the source-path inputs were not created at
+       the target, in Legacy DistCp. These are now created.
+
+  3. **Why does the new DistCp use more maps than legacy DistCp?**
+     Legacy DistCp works by figuring out what files need to be actually copied
+     to target before the copy-job is launched, and then launching as many maps
+     as required for copy. So if a majority of the files need to be skipped
+     (because they already exist, for example), fewer maps will be needed. As a
+     consequence, the time spent in setup (i.e. before the M/R job) is higher.
+     The new DistCp calculates only the contents of the source-paths. It
+     doesn't try to filter out what files can be skipped. That decision is put
+     off till the M/R job runs. This is much faster (vis-a-vis execution-time),
+     but the number of maps launched will be as specified in the `-m` option,
+     or 20 (default) if unspecified.
+
+  4. **Why does DistCp not run faster when more maps are specified?**
+     At present, the smallest unit of work for DistCp is a file. i.e., a file
+     is processed by only one map. Increasing the number of maps to a value
+     exceeding the number of files would yield no performance benefit. The
+     number of maps launched would equal the number of files.
+
+  5. **Why does DistCp run out of memory?**
+     If the number of individual files/directories being copied from the source
+     path(s) is extremely large (e.g. 1,000,000 paths), DistCp might run out of
+     memory while determining the list of paths for copy. This is not unique to
+     the new DistCp implementation.
+     To get around this, consider changing the `-Xmx` JVM heap-size parameters,
+     as follows:
+
+         bash$ export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m"
+         bash$ hadoop distcp /source /target

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-tools/hadoop-distcp/src/site/resources/css/site.css
----------------------------------------------------------------------
diff --git a/hadoop-tools/hadoop-distcp/src/site/resources/css/site.css b/hadoop-tools/hadoop-distcp/src/site/resources/css/site.css
new file mode 100644
index 0000000..f830baa
--- /dev/null
+++ b/hadoop-tools/hadoop-distcp/src/site/resources/css/site.css
@@ -0,0 +1,30 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+#banner {
+  height: 93px;
+  background: none;
+}
+
+#bannerLeft img {
+  margin-left: 30px;
+  margin-top: 10px;
+}
+
+#bannerRight img {
+  margin: 17px;
+}
+

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-tools/hadoop-streaming/src/site/apt/HadoopStreaming.apt.vm
----------------------------------------------------------------------
diff --git a/hadoop-tools/hadoop-streaming/src/site/apt/HadoopStreaming.apt.vm b/hadoop-tools/hadoop-streaming/src/site/apt/HadoopStreaming.apt.vm
new file mode 100644
index 0000000..8be92b5
--- /dev/null
+++ b/hadoop-tools/hadoop-streaming/src/site/apt/HadoopStreaming.apt.vm
@@ -0,0 +1,792 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  Hadoop Streaming
+  ---
+  ---
+  ${maven.build.timestamp}
+
+Hadoop Streaming
+
+%{toc|section=1|fromDepth=0|toDepth=4}
+
+* Hadoop Streaming
+
+  Hadoop streaming is a utility that comes with the Hadoop distribution. The
+  utility allows you to create and run Map/Reduce jobs with any executable or
+  script as the mapper and/or the reducer. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper /bin/cat \
+    -reducer /usr/bin/wc
++---+
+
+* How Streaming Works
+
+  In the above example, both the mapper and the reducer are executables that
+  read the input from stdin (line by line) and emit the output to stdout. The
+  utility will create a Map/Reduce job, submit the job to an appropriate
+  cluster, and monitor the progress of the job until it completes.
+
+  When an executable is specified for mappers, each mapper task will launch the
+  executable as a separate process when the mapper is initialized. As the
+  mapper task runs, it converts its inputs into lines and feed the lines to the
+  stdin of the process. In the meantime, the mapper collects the line oriented
+  outputs from the stdout of the process and converts each line into a
+  key/value pair, which is collected as the output of the mapper. By default,
+  the <prefix of a line up to the first tab character> is the <<<key>>> and the
+  rest of the line (excluding the tab character) will be the <<<value>>>. If
+  there is no tab character in the line, then entire line is considered as key
+  and the value is null. However, this can be customized by setting
+  <<<-inputformat>>> command option, as discussed later.
+
+  When an executable is specified for reducers, each reducer task will launch
+  the executable as a separate process then the reducer is initialized. As the
+  reducer task runs, it converts its input key/values pairs into lines and
+  feeds the lines to the stdin of the process. In the meantime, the reducer
+  collects the line oriented outputs from the stdout of the process, converts
+  each line into a key/value pair, which is collected as the output of the
+  reducer. By default, the prefix of a line up to the first tab character is
+  the key and the rest of the line (excluding the tab character) is the value.
+  However, this can be customized by setting <<<-outputformat>>> command
+  option, as discussed later.
+
+  This is the basis for the communication protocol between the Map/Reduce
+  framework and the streaming mapper/reducer.
+
+  User can specify <<<stream.non.zero.exit.is.failure>>> as <<<true>>> or
+  <<<false>>> to make a streaming task that exits with a non-zero status to be
+  <<<Failure>>> or <<<Success>>> respectively. By default, streaming tasks
+  exiting with non-zero status are considered to be failed tasks.
+
+* Streaming Command Options
+
+  Streaming supports streaming command options as well as
+  {{{Generic_Command_Options}generic command options}}. The general command
+  line syntax is shown below.
+
+  <<Note:>> Be sure to place the generic options before the streaming options,
+  otherwise the command will fail. For an example, see
+  {{{Making_Archives_Available_to_Tasks}Making Archives Available to Tasks}}.
+
++---+
+hadoop command [genericOptions] [streamingOptions]
++---+
+
+  The Hadoop streaming command options are listed here:
+
+*-------------*--------------------*------------------------------------------*
+|| Parameter  || Optional/Required || Description                             |
+*-------------+--------------------+------------------------------------------+
+| -input directoryname or filename | Required | Input location for mapper
+*-------------+--------------------+------------------------------------------+
+| -output directoryname | Required | Output location for reducer
+*-------------+--------------------+------------------------------------------+
+| -mapper executable or JavaClassName | Required | Mapper executable
+*-------------+--------------------+------------------------------------------+
+| -reducer executable or JavaClassName | Required | Reducer executable
+*-------------+--------------------+------------------------------------------+
+| -file filename | Optional | Make the mapper, reducer, or combiner executable
+|                |          | available locally on the compute nodes
+*-------------+--------------------+------------------------------------------+
+| -inputformat JavaClassName | Optional | Class you supply should return
+|                            |          | key/value pairs of Text class. If not
+|                            |          | specified, TextInputFormat is used as
+|                            |          | the default
+*-------------+--------------------+------------------------------------------+
+| -outputformat JavaClassName | Optional | Class you supply should take
+|                             |          | key/value pairs of Text class. If
+|                             |          | not specified, TextOutputformat is
+|                             |          | used as the default
+*-------------+--------------------+------------------------------------------+
+| -partitioner JavaClassName | Optional | Class that determines which reduce a
+|                            |          | key is sent to
+*-------------+--------------------+------------------------------------------+
+| -combiner streamingCommand | Optional | Combiner executable for map output
+| or JavaClassName           |          |
+*-------------+--------------------+------------------------------------------+
+| -cmdenv name=value | Optional | Pass environment variable to streaming
+|                    |          | commands
+*-------------+--------------------+------------------------------------------+
+| -inputreader | Optional | For backwards-compatibility: specifies a record
+|              |          | reader class (instead of an input format class)
+*-------------+--------------------+------------------------------------------+
+| -verbose | Optional | Verbose output
+*-------------+--------------------+------------------------------------------+
+| -lazyOutput | Optional | Create output lazily. For example, if the output
+|             |          | format is based on FileOutputFormat, the output file
+|             |          | is created only on the first call to Context.write
+*-------------+--------------------+------------------------------------------+
+| -numReduceTasks | Optional | Specify the number of reducers
+*-------------+--------------------+------------------------------------------+
+| -mapdebug | Optional | Script to call when map task fails
+*-------------+--------------------+------------------------------------------+
+| -reducedebug | Optional | Script to call when reduce task fails
+*-------------+--------------------+------------------------------------------+
+
+** Specifying a Java Class as the Mapper/Reducer
+
+  You can supply a Java class as the mapper and/or the reducer.
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -input myInputDirs \
+    -output myOutputDir \
+    -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
+    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
+    -reducer /usr/bin/wc
++---+
+
+  You can specify <<<stream.non.zero.exit.is.failure>>> as <<<true>>> or
+  <<<false>>> to make a streaming task that exits with a non-zero status to be
+  <<<Failure>>> or <<<Success>>> respectively. By default, streaming tasks
+  exiting with non-zero status are considered to be failed tasks.
+
+** Packaging Files With Job Submissions
+
+  You can specify any executable as the mapper and/or the reducer. The
+  executables do not need to pre-exist on the machines in the cluster; however,
+  if they don't, you will need to use "-file" option to tell the framework to
+  pack your executable files as a part of job submission. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper myPythonScript.py \
+    -reducer /usr/bin/wc \
+    -file myPythonScript.py
++---+
+
+  The above example specifies a user defined Python executable as the mapper.
+  The option "-file myPythonScript.py" causes the python executable shipped
+  to the cluster machines as a part of job submission.
+
+  In addition to executable files, you can also package other auxiliary files
+  (such as dictionaries, configuration files, etc) that may be used by the
+  mapper and/or the reducer. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper myPythonScript.py \
+    -reducer /usr/bin/wc \
+    -file myPythonScript.py \
+    -file myDictionary.txt
++---+
+
+** Specifying Other Plugins for Jobs
+
+  Just as with a normal Map/Reduce job, you can specify other plugins for a
+  streaming job:
+
++---+
+   -inputformat JavaClassName
+   -outputformat JavaClassName
+   -partitioner JavaClassName
+   -combiner streamingCommand or JavaClassName
++---+
+
+  The class you supply for the input format should return key/value pairs of
+  Text class. If you do not specify an input format class, the TextInputFormat
+  is used as the default. Since the TextInputFormat returns keys of
+  LongWritable class, which are actually not part of the input data, the keys
+  will be discarded; only the values will be piped to the streaming mapper.
+
+  The class you supply for the output format is expected to take key/value
+  pairs of Text class. If you do not specify an output format class, the
+  TextOutputFormat is used as the default.
+
+** Setting Environment Variables
+
+  To set an environment variable in a streaming command use:
+
++---+
+   -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
++---+
+
+* Generic Command Options
+
+  Streaming supports {{{Streaming_Command_Options}streaming command options}}
+  as well as generic command options. The general command line syntax is shown
+  below.
+
+  <<Note:>> Be sure to place the generic options before the streaming options,
+  otherwise the command will fail. For an example, see
+  {{{Making_Archives_Available_to_Tasks}Making Archives Available to Tasks}}.
+
++---+
+hadoop command [genericOptions] [streamingOptions]
++---+
+
+  The Hadoop generic command options you can use with streaming are listed
+  here:
+
+*-------------*--------------------*------------------------------------------*
+|| Parameter  || Optional/Required || Description                             |
+*-------------+--------------------+------------------------------------------+
+| -conf configuration_file | Optional | Specify an application configuration
+|                          |          | file
+*-------------+--------------------+------------------------------------------+
+| -D property=value | Optional | Use value for given property
+*-------------+--------------------+------------------------------------------+
+| -fs host:port or local | Optional | Specify a namenode
+*-------------+--------------------+------------------------------------------+
+| -files | Optional | Specify comma-separated files to be copied to the
+|        |          | Map/Reduce cluster
+*-------------+--------------------+------------------------------------------+
+| -libjars | Optional | Specify comma-separated jar files to include in the
+|          |          | classpath
+*-------------+--------------------+------------------------------------------+
+| -archives | Optional | Specify comma-separated archives to be unarchived on
+|           |          | the compute machines
+*-------------+--------------------+------------------------------------------+
+
+** Specifying Configuration Variables with the -D Option
+
+  You can specify additional configuration variables by using
+  "-D \<property\>=\<value\>".
+
+*** Specifying Directories
+
+  To change the local temp directory use:
+
++---+
+   -D dfs.data.dir=/tmp
++---+
+
+  To specify additional local temp directories use:
+
++---+
+   -D mapred.local.dir=/tmp/local
+   -D mapred.system.dir=/tmp/system
+   -D mapred.temp.dir=/tmp/temp
++---+
+
+  <<Note:>> For more details on job configuration parameters see:
+  {{{./mapred-default.xml}mapred-default.xml}}
+
+*** Specifying Map-Only Jobs
+
+  Often, you may want to process input data using a map function only. To do
+  this, simply set <<<mapreduce.job.reduces>>> to zero. The Map/Reduce
+  framework will not create any reducer tasks. Rather, the outputs of the
+  mapper tasks will be the final output of the job.
+
++---+
+   -D mapreduce.job.reduces=0
++---+
+
+  To be backward compatible, Hadoop Streaming also supports the "-reducer NONE"
+  option, which is equivalent to "-D mapreduce.job.reduces=0".
+
+*** Specifying the Number of Reducers
+
+  To specify the number of reducers, for example two, use:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -D mapreduce.job.reduces=2 \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper /bin/cat \
+    -reducer /usr/bin/wc
++---+
+
+*** Customizing How Lines are Split into Key/Value Pairs
+
+  As noted earlier, when the Map/Reduce framework reads a line from the stdout
+  of the mapper, it splits the line into a key/value pair. By default, the
+  prefix of the line up to the first tab character is the key and the rest of
+  the line (excluding the tab character) is the value.
+
+  However, you can customize this default. You can specify a field separator
+  other than the tab character (the default), and you can specify the nth
+  (n >= 1) character rather than the first character in a line (the default) as
+  the separator between the key and value. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -D stream.map.output.field.separator=. \
+    -D stream.num.map.output.key.fields=4 \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper /bin/cat \
+    -reducer /bin/cat
++---+
+
+  In the above example, "-D stream.map.output.field.separator=." specifies "."
+  as the field separator for the map outputs, and the prefix up to the fourth
+  "." in a line will be the key and the rest of the line (excluding the fourth
+  ".") will be the value. If a line has less than four "."s, then the whole
+  line will be the key and the value will be an empty Text object (like the one
+  created by new Text("")).
+
+  Similarly, you can use "-D stream.reduce.output.field.separator=SEP" and
+  "-D stream.num.reduce.output.fields=NUM" to specify the nth field separator
+  in a line of the reduce outputs as the separator between the key and the
+  value.
+
+  Similarly, you can specify "stream.map.input.field.separator" and
+  "stream.reduce.input.field.separator" as the input separator for Map/Reduce
+  inputs. By default the separator is the tab character.
+
+** Working with Large Files and Archives
+
+  The -files and -archives options allow you to make files and archives
+  available to the tasks. The argument is a URI to the file or archive that you
+  have already uploaded to HDFS. These files and archives are cached across
+  jobs. You can retrieve the host and fs_port values from the fs.default.name
+  config variable.
+
+  <<Note:>> The -files and -archives options are generic options. Be sure to
+  place the generic options before the command options, otherwise the command
+  will fail.
+
+*** Making Files Available to Tasks
+
+  The -files option creates a symlink in the current working directory of the
+  tasks that points to the local copy of the file.
+
+  In this example, Hadoop automatically creates a symlink named testfile.txt in
+  the current working directory of the tasks. This symlink points to the local
+  copy of testfile.txt.
+
++---+
+-files hdfs://host:fs_port/user/testfile.txt
++---+
+
+  User can specify a different symlink name for -files using #.
+
++---+
+-files hdfs://host:fs_port/user/testfile.txt#testfile
++---+
+
+  Multiple entries can be specified like this:
+
++---+
+-files hdfs://host:fs_port/user/testfile1.txt,hdfs://host:fs_port/user/testfile2.txt
++---+
+
+*** Making Archives Available to Tasks
+
+  The -archives option allows you to copy jars locally to the current working
+  directory of tasks and automatically unjar the files.
+
+  In this example, Hadoop automatically creates a symlink named testfile.jar in
+  the current working directory of tasks. This symlink points to the directory
+  that stores the unjarred contents of the uploaded jar file.
+
++---+
+-archives hdfs://host:fs_port/user/testfile.jar
++---+
+
+  User can specify a different symlink name for -archives using #.
+
++---+
+-archives hdfs://host:fs_port/user/testfile.tgz#tgzdir
++---+
+
+  In this example, the input.txt file has two lines specifying the names of the
+  two files: cachedir.jar/cache.txt and cachedir.jar/cache2.txt. "cachedir.jar"
+  is a symlink to the archived directory, which has the files "cache.txt" and
+  "cache2.txt".
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+                  -archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar' \
+                  -D mapreduce.job.maps=1 \
+                  -D mapreduce.job.reduces=1 \
+                  -D mapreduce.job.name="Experiment" \
+                  -input "/user/me/samples/cachefile/input.txt" \
+                  -output "/user/me/samples/cachefile/out" \
+                  -mapper "xargs cat" \
+                  -reducer "cat"
+
+$ ls test_jar/
+cache.txt  cache2.txt
+
+$ jar cvf cachedir.jar -C test_jar/ .
+added manifest
+adding: cache.txt(in = 30) (out= 29)(deflated 3%)
+adding: cache2.txt(in = 37) (out= 35)(deflated 5%)
+
+$ hdfs dfs -put cachedir.jar samples/cachefile
+
+$ hdfs dfs -cat /user/me/samples/cachefile/input.txt
+cachedir.jar/cache.txt
+cachedir.jar/cache2.txt
+
+$ cat test_jar/cache.txt
+This is just the cache string
+
+$ cat test_jar/cache2.txt
+This is just the second cache string
+
+$ hdfs dfs -ls /user/me/samples/cachefile/out
+Found 2 items
+-rw-r--r--   1 me supergroup        0 2013-11-14 17:00 /user/me/samples/cachefile/out/_SUCCESS
+-rw-r--r--   1 me supergroup       69 2013-11-14 17:00 /user/me/samples/cachefile/out/part-00000
+
+$ hdfs dfs -cat /user/me/samples/cachefile/out/part-00000
+This is just the cache string
+This is just the second cache string
++---+
+
+* More Usage Examples
+
+** Hadoop Partitioner Class
+
+  Hadoop has a library class,
+  {{{../../api/org/apache/hadoop/mapred/lib/KeyFieldBasedPartitioner.html}
+  KeyFieldBasedPartitioner}}, that is useful for many applications. This class
+  allows the Map/Reduce framework to partition the map outputs based on certain
+  key fields, not the whole keys. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -D stream.map.output.field.separator=. \
+    -D stream.num.map.output.key.fields=4 \
+    -D map.output.key.field.separator=. \
+    -D mapreduce.partition.keypartitioner.options=-k1,2 \
+    -D mapreduce.job.reduces=12 \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper /bin/cat \
+    -reducer /bin/cat \
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
++---+
+
+  Here, <-D stream.map.output.field.separator=.> and
+  <-D stream.num.map.output.key.fields=4> are as explained in previous example.
+  The two variables are used by streaming to identify the key/value pair of
+  mapper.
+
+  The map output keys of the above Map/Reduce job normally have four fields
+  separated by ".". However, the Map/Reduce framework will partition the map
+  outputs by the first two fields of the keys using the
+  <-D mapred.text.key.partitioner.options=-k1,2> option. Here,
+  <-D map.output.key.field.separator=.> specifies the separator for the
+  partition. This guarantees that all the key/value pairs with the same first
+  two fields in the keys will be partitioned into the same reducer.
+
+  <This is effectively equivalent to specifying the first two fields as the
+  primary key and the next two fields as the secondary. The primary key is used
+  for partitioning, and the combination of the primary and secondary keys is
+  used for sorting.> A simple illustration is shown here:
+
+  Output of map (the keys)
+
++---+
+11.12.1.2
+11.14.2.3
+11.11.4.1
+11.12.1.1
+11.14.2.2
++---+
+
+  Partition into 3 reducers (the first 2 fields are used as keys for partition)
+
++---+
+11.11.4.1
+-----------
+11.12.1.2
+11.12.1.1
+-----------
+11.14.2.3
+11.14.2.2
++---+
+
+  Sorting within each partition for the reducer(all 4 fields used for sorting)
+
++---+
+11.11.4.1
+-----------
+11.12.1.1
+11.12.1.2
+-----------
+11.14.2.2
+11.14.2.3
++---+
+
+** Hadoop Comparator Class
+
+  Hadoop has a library class,
+  {{{../../api/org/apache/hadoop/mapreduce/lib/partition/KeyFieldBasedComparator.html}
+  KeyFieldBasedComparator}}, that is useful for many applications. This class
+  provides a subset of features provided by the Unix/GNU Sort. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
+    -D stream.map.output.field.separator=. \
+    -D stream.num.map.output.key.fields=4 \
+    -D mapreduce.map.output.key.field.separator=. \
+    -D mapreduce.partition.keycomparator.options=-k2,2nr \
+    -D mapreduce.job.reduces=1 \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper /bin/cat \
+    -reducer /bin/cat
++---+
+
+  The map output keys of the above Map/Reduce job normally have four fields
+  separated by ".". However, the Map/Reduce framework will sort the outputs by
+  the second field of the keys using the
+  <-D mapreduce.partition.keycomparator.options=-k2,2nr> option. Here, <-n>
+  specifies that the sorting is numerical sorting and <-r> specifies that the
+  result should be reversed. A simple illustration is shown below:
+
+  Output of map (the keys)
+
++---+
+11.12.1.2
+11.14.2.3
+11.11.4.1
+11.12.1.1
+11.14.2.2
++---+
+
+  Sorting output for the reducer (where second field used for sorting)
+
++---+
+11.14.2.3
+11.14.2.2
+11.12.1.2
+11.12.1.1
+11.11.4.1
++---+
+
+** Hadoop Aggregate Package
+
+  Hadoop has a library package called
+  {{{../../org/apache/hadoop/mapred/lib/aggregate/package-summary.html}
+  Aggregate}}. Aggregate provides a special reducer class and a special
+  combiner class, and a list of simple aggregators that perform aggregations
+  such as "sum", "max", "min" and so on over a sequence of values. Aggregate
+  allows you to define a mapper plugin class that is expected to generate
+  "aggregatable items" for each input key/value pair of the mappers. The
+  combiner/reducer will aggregate those aggregatable items by invoking the
+  appropriate aggregators.
+
+  To use Aggregate, simply specify "-reducer aggregate":
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper myAggregatorForKeyCount.py \
+    -reducer aggregate \
+    -file myAggregatorForKeyCount.py \
++---+
+
+  The python program myAggregatorForKeyCount.py looks like:
+
++---+
+#!/usr/bin/python
+
+import sys;
+
+def generateLongCountToken(id):
+    return "LongValueSum:" + id + "\t" + "1"
+
+def main(argv):
+    line = sys.stdin.readline();
+    try:
+        while line:
+            line = line&#91;:-1];
+            fields = line.split("\t");
+            print generateLongCountToken(fields&#91;0]);
+            line = sys.stdin.readline();
+    except "end of file":
+        return None
+if __name__ == "__main__":
+     main(sys.argv)
++---+
+
+** Hadoop Field Selection Class
+
+  Hadoop has a library class,
+  {{{../../api/org/apache/hadoop/mapred/lib/FieldSelectionMapReduce.html}
+  FieldSelectionMapReduce}}, that effectively allows you to process text data
+  like the unix "cut" utility. The map function defined in the class treats
+  each input key/value pair as a list of fields. You can specify the field
+  separator (the default is the tab character). You can select an arbitrary
+  list of fields as the map output key, and an arbitrary list of fields as the
+  map output value. Similarly, the reduce function defined in the class treats
+  each input key/value pair as a list of fields. You can select an arbitrary
+  list of fields as the reduce output key, and an arbitrary list of fields as
+  the reduce output value. For example:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -D mapreduce.map.output.key.field.separator=. \
+    -D mapreduce.partition.keypartitioner.options=-k1,2 \
+    -D mapreduce.fieldsel.data.field.separator=. \
+    -D mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0- \
+    -D mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5- \
+    -D mapreduce.map.output.key.class=org.apache.hadoop.io.Text \
+    -D mapreduce.job.reduces=12 \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
+    -reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
+    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
++---+
+
+  The option "-D
+  mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0-" specifies
+  key/value selection for the map outputs. Key selection spec and value
+  selection spec are separated by ":". In this case, the map output key will
+  consist of fields 6, 5, 1, 2, and 3. The map output value will consist of all
+  fields (0- means field 0 and all the subsequent fields).
+
+  The option "-D mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5-"
+  specifies key/value selection for the reduce outputs. In this case, the
+  reduce output key will consist of fields 0, 1, 2 (corresponding to the
+  original fields 6, 5, 1). The reduce output value will consist of all fields
+  starting from field 5 (corresponding to all the original fields).
+
+* Frequently Asked Questions
+
+** How do I use Hadoop Streaming to run an arbitrary set of (semi) independent
+   tasks?
+
+  Often you do not need the full power of Map Reduce, but only need to run
+  multiple instances of the same program - either on different parts of the
+  data, or on the same data, but with different parameters. You can use Hadoop
+  Streaming to do this.
+
+** How do I process files, one per map?
+
+  As an example, consider the problem of zipping (compressing) a set of files
+  across the hadoop cluster. You can achieve this by using Hadoop Streaming
+  and custom mapper script:
+
+   * Generate a file containing the full HDFS path of the input files. Each map
+     task would get one file name as input.
+
+   * Create a mapper script which, given a filename, will get the file to local
+     disk, gzip the file and put it back in the desired output directory.
+
+** How many reducers should I use?
+
+  See MapReduce Tutorial for details: {{{./MapReduceTutorial.html#Reducer}
+  Reducer}}
+
+** If I set up an alias in my shell script, will that work after -mapper?
+
+  For example, say I do: alias c1='cut -f1'. Will -mapper "c1" work?
+
+  Using an alias will not work, but variable substitution is allowed as shown
+  in this example:
+
++---+
+$ hdfs dfs -cat /user/me/samples/student_marks
+alice   50
+bruce   70
+charlie 80
+dan     75
+
+$ c2='cut -f2'; hadoop jar hadoop-streaming-${project.version}.jar \
+    -D mapreduce.job.name='Experiment' \
+    -input /user/me/samples/student_marks \
+    -output /user/me/samples/student_out \
+    -mapper "$c2" -reducer 'cat'
+
+$ hdfs dfs -cat /user/me/samples/student_out/part-00000
+50
+70
+75
+80
++---+
+
+** Can I use UNIX pipes?
+
+  For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
+
+  Currently this does not work and gives an "java.io.IOException: Broken pipe"
+  error. This is probably a bug that needs to be investigated.
+
+** What do I do if I get the "No space left on device" error?
+
+  For example, when I run a streaming job by distributing large executables
+  (for example, 3.6G) through the -file option, I get a "No space left on
+  device" error.
+
+  The jar packaging happens in a directory pointed to by the configuration
+  variable stream.tmpdir. The default value of stream.tmpdir is /tmp. Set the
+  value to a directory with more space:
+
++---+
+-D stream.tmpdir=/export/bigspace/...
++---+
+
+** How do I specify multiple input directories?
+
+  You can specify multiple input directories with multiple '-input' options:
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -input '/user/foo/dir1' -input '/user/foo/dir2' \
+    (rest of the command)
++---+
+
+** How do I generate output files with gzip format?
+
+  Instead of plain text files, you can generate gzip files as your generated
+  output. Pass '-D mapreduce.output.fileoutputformat.compress=true -D
+  mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec'
+  as option to your streaming job.
+
+** How do I provide my own input/output format with streaming?
+
+  You can specify your own custom class by packing them and putting the custom
+  jar to \$\{HADOOP_CLASSPATH\}.
+
+** How do I parse XML documents using streaming?
+
+  You can use the record reader StreamXmlRecordReader to process XML documents.
+
++---+
+hadoop jar hadoop-streaming-${project.version}.jar \
+    -inputreader "StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" \
+    (rest of the command)
++---+
+
+  Anything found between BEGIN_STRING and END_STRING would be treated as one
+  record for map tasks.
+
+** How do I update counters in streaming applications?
+
+  A streaming process can use the stderr to emit counter information.
+  <<<reporter:counter:\<group\>,\<counter\>,\<amount\>>>> should be sent to
+  stderr to update the counter.
+
+** How do I update status in streaming applications?
+
+  A streaming process can use the stderr to emit status information. To set a
+  status, <<<reporter:status:\<message\>>>> should be sent to stderr.
+
+** How do I get the Job variables in a streaming job's mapper/reducer?
+
+  See {{{./MapReduceTutorial.html#Configured_Parameters}
+  Configured Parameters}}. During the execution of a streaming job, the names
+  of the "mapred" parameters are transformed. The dots ( . ) become underscores
+  ( _ ). For example, mapreduce.job.id becomes mapreduce_job_id and
+  mapreduce.job.jar becomes mapreduce_job_jar. In your code, use the parameter
+  names with the underscores.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-tools/hadoop-streaming/src/site/resources/css/site.css
----------------------------------------------------------------------
diff --git a/hadoop-tools/hadoop-streaming/src/site/resources/css/site.css b/hadoop-tools/hadoop-streaming/src/site/resources/css/site.css
new file mode 100644
index 0000000..f830baa
--- /dev/null
+++ b/hadoop-tools/hadoop-streaming/src/site/resources/css/site.css
@@ -0,0 +1,30 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+#banner {
+  height: 93px;
+  background: none;
+}
+
+#bannerLeft img {
+  margin-left: 30px;
+  margin-top: 10px;
+}
+
+#bannerRight img {
+  margin: 17px;
+}
+

[2/2] hadoop git commit: HADOOP-10976. Backport "moving the source code of hadoop-tools docs to the directory under hadoop-tools" to branch-2. Contributed by Masatake Iwasaki.

Posted by aa...@apache.org.

HADOOP-10976. Backport "moving the source code of hadoop-tools docs to the directory under hadoop-tools" to branch-2. Contributed by Masatake Iwasaki.

(cherry picked from commit 9112f093cde0b8242324aa5267c9beb21c38bf6b)

Conflicts:
	hadoop-common-project/hadoop-common/CHANGES.txt


Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/f7a724ca
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/f7a724ca
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/f7a724ca

Branch: refs/heads/branch-2
Commit: f7a724ca9e5b1c2ea52415cc3ac0dc66a3df2f61
Parents: 0b0be00
Author: Akira Ajisaka <aa...@apache.org>
Authored: Wed Feb 4 17:57:34 2015 -0800
Committer: Akira Ajisaka <aa...@apache.org>
Committed: Sat Feb 28 17:06:58 2015 -0800

----------------------------------------------------------------------
 hadoop-common-project/hadoop-common/CHANGES.txt |   3 +
 .../src/site/apt/HadoopStreaming.apt.vm         | 792 -------------------
 .../src/site/markdown/DistCp.md.vm              | 512 ------------
 .../src/site/markdown/HadoopArchives.md.vm      | 162 ----
 hadoop-project/src/site/site.xml                |  17 +-
 .../src/site/markdown/HadoopArchives.md.vm      | 162 ++++
 .../src/site/resources/css/site.css             |  30 +
 .../src/site/markdown/DistCp.md.vm              | 512 ++++++++++++
 .../src/site/resources/css/site.css             |  30 +
 .../src/site/apt/HadoopStreaming.apt.vm         | 792 +++++++++++++++++++
 .../src/site/resources/css/site.css             |  30 +
 11 files changed, 1569 insertions(+), 1473 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-common-project/hadoop-common/CHANGES.txt
----------------------------------------------------------------------
diff --git a/hadoop-common-project/hadoop-common/CHANGES.txt b/hadoop-common-project/hadoop-common/CHANGES.txt
index c3a54df..881dc02 100644
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@@ -243,6 +243,9 @@ Release 2.7.0 - UNRELEASED
     HADOOP-11620. Add support for load balancing across a group of KMS for HA.
     (Arun Suresh via wang)
 
+    HADOOP-10976. moving the source code of hadoop-tools docs to the
+    directory under hadoop-tools (Masatake Iwasaki via aw)
+
   BUG FIXES
 
     HADOOP-11512. Use getTrimmedStrings when reading serialization keys

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/HadoopStreaming.apt.vm
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/HadoopStreaming.apt.vm b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/HadoopStreaming.apt.vm
deleted file mode 100644
index 8be92b5..0000000
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/HadoopStreaming.apt.vm
+++ /dev/null
@@ -1,792 +0,0 @@
-~~ Licensed under the Apache License, Version 2.0 (the "License");
-~~ you may not use this file except in compliance with the License.
-~~ You may obtain a copy of the License at
-~~
-~~   http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License. See accompanying LICENSE file.
-
-  ---
-  Hadoop Streaming
-  ---
-  ---
-  ${maven.build.timestamp}
-
-Hadoop Streaming
-
-%{toc|section=1|fromDepth=0|toDepth=4}
-
-* Hadoop Streaming
-
-  Hadoop streaming is a utility that comes with the Hadoop distribution. The
-  utility allows you to create and run Map/Reduce jobs with any executable or
-  script as the mapper and/or the reducer. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper /bin/cat \
-    -reducer /usr/bin/wc
-+---+
-
-* How Streaming Works
-
-  In the above example, both the mapper and the reducer are executables that
-  read the input from stdin (line by line) and emit the output to stdout. The
-  utility will create a Map/Reduce job, submit the job to an appropriate
-  cluster, and monitor the progress of the job until it completes.
-
-  When an executable is specified for mappers, each mapper task will launch the
-  executable as a separate process when the mapper is initialized. As the
-  mapper task runs, it converts its inputs into lines and feed the lines to the
-  stdin of the process. In the meantime, the mapper collects the line oriented
-  outputs from the stdout of the process and converts each line into a
-  key/value pair, which is collected as the output of the mapper. By default,
-  the <prefix of a line up to the first tab character> is the <<<key>>> and the
-  rest of the line (excluding the tab character) will be the <<<value>>>. If
-  there is no tab character in the line, then entire line is considered as key
-  and the value is null. However, this can be customized by setting
-  <<<-inputformat>>> command option, as discussed later.
-
-  When an executable is specified for reducers, each reducer task will launch
-  the executable as a separate process then the reducer is initialized. As the
-  reducer task runs, it converts its input key/values pairs into lines and
-  feeds the lines to the stdin of the process. In the meantime, the reducer
-  collects the line oriented outputs from the stdout of the process, converts
-  each line into a key/value pair, which is collected as the output of the
-  reducer. By default, the prefix of a line up to the first tab character is
-  the key and the rest of the line (excluding the tab character) is the value.
-  However, this can be customized by setting <<<-outputformat>>> command
-  option, as discussed later.
-
-  This is the basis for the communication protocol between the Map/Reduce
-  framework and the streaming mapper/reducer.
-
-  User can specify <<<stream.non.zero.exit.is.failure>>> as <<<true>>> or
-  <<<false>>> to make a streaming task that exits with a non-zero status to be
-  <<<Failure>>> or <<<Success>>> respectively. By default, streaming tasks
-  exiting with non-zero status are considered to be failed tasks.
-
-* Streaming Command Options
-
-  Streaming supports streaming command options as well as
-  {{{Generic_Command_Options}generic command options}}. The general command
-  line syntax is shown below.
-
-  <<Note:>> Be sure to place the generic options before the streaming options,
-  otherwise the command will fail. For an example, see
-  {{{Making_Archives_Available_to_Tasks}Making Archives Available to Tasks}}.
-
-+---+
-hadoop command [genericOptions] [streamingOptions]
-+---+
-
-  The Hadoop streaming command options are listed here:
-
-*-------------*--------------------*------------------------------------------*
-|| Parameter  || Optional/Required || Description                             |
-*-------------+--------------------+------------------------------------------+
-| -input directoryname or filename | Required | Input location for mapper
-*-------------+--------------------+------------------------------------------+
-| -output directoryname | Required | Output location for reducer
-*-------------+--------------------+------------------------------------------+
-| -mapper executable or JavaClassName | Required | Mapper executable
-*-------------+--------------------+------------------------------------------+
-| -reducer executable or JavaClassName | Required | Reducer executable
-*-------------+--------------------+------------------------------------------+
-| -file filename | Optional | Make the mapper, reducer, or combiner executable
-|                |          | available locally on the compute nodes
-*-------------+--------------------+------------------------------------------+
-| -inputformat JavaClassName | Optional | Class you supply should return
-|                            |          | key/value pairs of Text class. If not
-|                            |          | specified, TextInputFormat is used as
-|                            |          | the default
-*-------------+--------------------+------------------------------------------+
-| -outputformat JavaClassName | Optional | Class you supply should take
-|                             |          | key/value pairs of Text class. If
-|                             |          | not specified, TextOutputformat is
-|                             |          | used as the default
-*-------------+--------------------+------------------------------------------+
-| -partitioner JavaClassName | Optional | Class that determines which reduce a
-|                            |          | key is sent to
-*-------------+--------------------+------------------------------------------+
-| -combiner streamingCommand | Optional | Combiner executable for map output
-| or JavaClassName           |          |
-*-------------+--------------------+------------------------------------------+
-| -cmdenv name=value | Optional | Pass environment variable to streaming
-|                    |          | commands
-*-------------+--------------------+------------------------------------------+
-| -inputreader | Optional | For backwards-compatibility: specifies a record
-|              |          | reader class (instead of an input format class)
-*-------------+--------------------+------------------------------------------+
-| -verbose | Optional | Verbose output
-*-------------+--------------------+------------------------------------------+
-| -lazyOutput | Optional | Create output lazily. For example, if the output
-|             |          | format is based on FileOutputFormat, the output file
-|             |          | is created only on the first call to Context.write
-*-------------+--------------------+------------------------------------------+
-| -numReduceTasks | Optional | Specify the number of reducers
-*-------------+--------------------+------------------------------------------+
-| -mapdebug | Optional | Script to call when map task fails
-*-------------+--------------------+------------------------------------------+
-| -reducedebug | Optional | Script to call when reduce task fails
-*-------------+--------------------+------------------------------------------+
-
-** Specifying a Java Class as the Mapper/Reducer
-
-  You can supply a Java class as the mapper and/or the reducer.
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -input myInputDirs \
-    -output myOutputDir \
-    -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-    -reducer /usr/bin/wc
-+---+
-
-  You can specify <<<stream.non.zero.exit.is.failure>>> as <<<true>>> or
-  <<<false>>> to make a streaming task that exits with a non-zero status to be
-  <<<Failure>>> or <<<Success>>> respectively. By default, streaming tasks
-  exiting with non-zero status are considered to be failed tasks.
-
-** Packaging Files With Job Submissions
-
-  You can specify any executable as the mapper and/or the reducer. The
-  executables do not need to pre-exist on the machines in the cluster; however,
-  if they don't, you will need to use "-file" option to tell the framework to
-  pack your executable files as a part of job submission. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper myPythonScript.py \
-    -reducer /usr/bin/wc \
-    -file myPythonScript.py
-+---+
-
-  The above example specifies a user defined Python executable as the mapper.
-  The option "-file myPythonScript.py" causes the python executable shipped
-  to the cluster machines as a part of job submission.
-
-  In addition to executable files, you can also package other auxiliary files
-  (such as dictionaries, configuration files, etc) that may be used by the
-  mapper and/or the reducer. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper myPythonScript.py \
-    -reducer /usr/bin/wc \
-    -file myPythonScript.py \
-    -file myDictionary.txt
-+---+
-
-** Specifying Other Plugins for Jobs
-
-  Just as with a normal Map/Reduce job, you can specify other plugins for a
-  streaming job:
-
-+---+
-   -inputformat JavaClassName
-   -outputformat JavaClassName
-   -partitioner JavaClassName
-   -combiner streamingCommand or JavaClassName
-+---+
-
-  The class you supply for the input format should return key/value pairs of
-  Text class. If you do not specify an input format class, the TextInputFormat
-  is used as the default. Since the TextInputFormat returns keys of
-  LongWritable class, which are actually not part of the input data, the keys
-  will be discarded; only the values will be piped to the streaming mapper.
-
-  The class you supply for the output format is expected to take key/value
-  pairs of Text class. If you do not specify an output format class, the
-  TextOutputFormat is used as the default.
-
-** Setting Environment Variables
-
-  To set an environment variable in a streaming command use:
-
-+---+
-   -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
-+---+
-
-* Generic Command Options
-
-  Streaming supports {{{Streaming_Command_Options}streaming command options}}
-  as well as generic command options. The general command line syntax is shown
-  below.
-
-  <<Note:>> Be sure to place the generic options before the streaming options,
-  otherwise the command will fail. For an example, see
-  {{{Making_Archives_Available_to_Tasks}Making Archives Available to Tasks}}.
-
-+---+
-hadoop command [genericOptions] [streamingOptions]
-+---+
-
-  The Hadoop generic command options you can use with streaming are listed
-  here:
-
-*-------------*--------------------*------------------------------------------*
-|| Parameter  || Optional/Required || Description                             |
-*-------------+--------------------+------------------------------------------+
-| -conf configuration_file | Optional | Specify an application configuration
-|                          |          | file
-*-------------+--------------------+------------------------------------------+
-| -D property=value | Optional | Use value for given property
-*-------------+--------------------+------------------------------------------+
-| -fs host:port or local | Optional | Specify a namenode
-*-------------+--------------------+------------------------------------------+
-| -files | Optional | Specify comma-separated files to be copied to the
-|        |          | Map/Reduce cluster
-*-------------+--------------------+------------------------------------------+
-| -libjars | Optional | Specify comma-separated jar files to include in the
-|          |          | classpath
-*-------------+--------------------+------------------------------------------+
-| -archives | Optional | Specify comma-separated archives to be unarchived on
-|           |          | the compute machines
-*-------------+--------------------+------------------------------------------+
-
-** Specifying Configuration Variables with the -D Option
-
-  You can specify additional configuration variables by using
-  "-D \<property\>=\<value\>".
-
-*** Specifying Directories
-
-  To change the local temp directory use:
-
-+---+
-   -D dfs.data.dir=/tmp
-+---+
-
-  To specify additional local temp directories use:
-
-+---+
-   -D mapred.local.dir=/tmp/local
-   -D mapred.system.dir=/tmp/system
-   -D mapred.temp.dir=/tmp/temp
-+---+
-
-  <<Note:>> For more details on job configuration parameters see:
-  {{{./mapred-default.xml}mapred-default.xml}}
-
-*** Specifying Map-Only Jobs
-
-  Often, you may want to process input data using a map function only. To do
-  this, simply set <<<mapreduce.job.reduces>>> to zero. The Map/Reduce
-  framework will not create any reducer tasks. Rather, the outputs of the
-  mapper tasks will be the final output of the job.
-
-+---+
-   -D mapreduce.job.reduces=0
-+---+
-
-  To be backward compatible, Hadoop Streaming also supports the "-reducer NONE"
-  option, which is equivalent to "-D mapreduce.job.reduces=0".
-
-*** Specifying the Number of Reducers
-
-  To specify the number of reducers, for example two, use:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -D mapreduce.job.reduces=2 \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper /bin/cat \
-    -reducer /usr/bin/wc
-+---+
-
-*** Customizing How Lines are Split into Key/Value Pairs
-
-  As noted earlier, when the Map/Reduce framework reads a line from the stdout
-  of the mapper, it splits the line into a key/value pair. By default, the
-  prefix of the line up to the first tab character is the key and the rest of
-  the line (excluding the tab character) is the value.
-
-  However, you can customize this default. You can specify a field separator
-  other than the tab character (the default), and you can specify the nth
-  (n >= 1) character rather than the first character in a line (the default) as
-  the separator between the key and value. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -D stream.map.output.field.separator=. \
-    -D stream.num.map.output.key.fields=4 \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper /bin/cat \
-    -reducer /bin/cat
-+---+
-
-  In the above example, "-D stream.map.output.field.separator=." specifies "."
-  as the field separator for the map outputs, and the prefix up to the fourth
-  "." in a line will be the key and the rest of the line (excluding the fourth
-  ".") will be the value. If a line has less than four "."s, then the whole
-  line will be the key and the value will be an empty Text object (like the one
-  created by new Text("")).
-
-  Similarly, you can use "-D stream.reduce.output.field.separator=SEP" and
-  "-D stream.num.reduce.output.fields=NUM" to specify the nth field separator
-  in a line of the reduce outputs as the separator between the key and the
-  value.
-
-  Similarly, you can specify "stream.map.input.field.separator" and
-  "stream.reduce.input.field.separator" as the input separator for Map/Reduce
-  inputs. By default the separator is the tab character.
-
-** Working with Large Files and Archives
-
-  The -files and -archives options allow you to make files and archives
-  available to the tasks. The argument is a URI to the file or archive that you
-  have already uploaded to HDFS. These files and archives are cached across
-  jobs. You can retrieve the host and fs_port values from the fs.default.name
-  config variable.
-
-  <<Note:>> The -files and -archives options are generic options. Be sure to
-  place the generic options before the command options, otherwise the command
-  will fail.
-
-*** Making Files Available to Tasks
-
-  The -files option creates a symlink in the current working directory of the
-  tasks that points to the local copy of the file.
-
-  In this example, Hadoop automatically creates a symlink named testfile.txt in
-  the current working directory of the tasks. This symlink points to the local
-  copy of testfile.txt.
-
-+---+
--files hdfs://host:fs_port/user/testfile.txt
-+---+
-
-  User can specify a different symlink name for -files using #.
-
-+---+
--files hdfs://host:fs_port/user/testfile.txt#testfile
-+---+
-
-  Multiple entries can be specified like this:
-
-+---+
--files hdfs://host:fs_port/user/testfile1.txt,hdfs://host:fs_port/user/testfile2.txt
-+---+
-
-*** Making Archives Available to Tasks
-
-  The -archives option allows you to copy jars locally to the current working
-  directory of tasks and automatically unjar the files.
-
-  In this example, Hadoop automatically creates a symlink named testfile.jar in
-  the current working directory of tasks. This symlink points to the directory
-  that stores the unjarred contents of the uploaded jar file.
-
-+---+
--archives hdfs://host:fs_port/user/testfile.jar
-+---+
-
-  User can specify a different symlink name for -archives using #.
-
-+---+
--archives hdfs://host:fs_port/user/testfile.tgz#tgzdir
-+---+
-
-  In this example, the input.txt file has two lines specifying the names of the
-  two files: cachedir.jar/cache.txt and cachedir.jar/cache2.txt. "cachedir.jar"
-  is a symlink to the archived directory, which has the files "cache.txt" and
-  "cache2.txt".
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-                  -archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar' \
-                  -D mapreduce.job.maps=1 \
-                  -D mapreduce.job.reduces=1 \
-                  -D mapreduce.job.name="Experiment" \
-                  -input "/user/me/samples/cachefile/input.txt" \
-                  -output "/user/me/samples/cachefile/out" \
-                  -mapper "xargs cat" \
-                  -reducer "cat"
-
-$ ls test_jar/
-cache.txt  cache2.txt
-
-$ jar cvf cachedir.jar -C test_jar/ .
-added manifest
-adding: cache.txt(in = 30) (out= 29)(deflated 3%)
-adding: cache2.txt(in = 37) (out= 35)(deflated 5%)
-
-$ hdfs dfs -put cachedir.jar samples/cachefile
-
-$ hdfs dfs -cat /user/me/samples/cachefile/input.txt
-cachedir.jar/cache.txt
-cachedir.jar/cache2.txt
-
-$ cat test_jar/cache.txt
-This is just the cache string
-
-$ cat test_jar/cache2.txt
-This is just the second cache string
-
-$ hdfs dfs -ls /user/me/samples/cachefile/out
-Found 2 items
--rw-r--r--   1 me supergroup        0 2013-11-14 17:00 /user/me/samples/cachefile/out/_SUCCESS
--rw-r--r--   1 me supergroup       69 2013-11-14 17:00 /user/me/samples/cachefile/out/part-00000
-
-$ hdfs dfs -cat /user/me/samples/cachefile/out/part-00000
-This is just the cache string
-This is just the second cache string
-+---+
-
-* More Usage Examples
-
-** Hadoop Partitioner Class
-
-  Hadoop has a library class,
-  {{{../../api/org/apache/hadoop/mapred/lib/KeyFieldBasedPartitioner.html}
-  KeyFieldBasedPartitioner}}, that is useful for many applications. This class
-  allows the Map/Reduce framework to partition the map outputs based on certain
-  key fields, not the whole keys. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -D stream.map.output.field.separator=. \
-    -D stream.num.map.output.key.fields=4 \
-    -D map.output.key.field.separator=. \
-    -D mapreduce.partition.keypartitioner.options=-k1,2 \
-    -D mapreduce.job.reduces=12 \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper /bin/cat \
-    -reducer /bin/cat \
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-+---+
-
-  Here, <-D stream.map.output.field.separator=.> and
-  <-D stream.num.map.output.key.fields=4> are as explained in previous example.
-  The two variables are used by streaming to identify the key/value pair of
-  mapper.
-
-  The map output keys of the above Map/Reduce job normally have four fields
-  separated by ".". However, the Map/Reduce framework will partition the map
-  outputs by the first two fields of the keys using the
-  <-D mapred.text.key.partitioner.options=-k1,2> option. Here,
-  <-D map.output.key.field.separator=.> specifies the separator for the
-  partition. This guarantees that all the key/value pairs with the same first
-  two fields in the keys will be partitioned into the same reducer.
-
-  <This is effectively equivalent to specifying the first two fields as the
-  primary key and the next two fields as the secondary. The primary key is used
-  for partitioning, and the combination of the primary and secondary keys is
-  used for sorting.> A simple illustration is shown here:
-
-  Output of map (the keys)
-
-+---+
-11.12.1.2
-11.14.2.3
-11.11.4.1
-11.12.1.1
-11.14.2.2
-+---+
-
-  Partition into 3 reducers (the first 2 fields are used as keys for partition)
-
-+---+
-11.11.4.1
------------
-11.12.1.2
-11.12.1.1
------------
-11.14.2.3
-11.14.2.2
-+---+
-
-  Sorting within each partition for the reducer(all 4 fields used for sorting)
-
-+---+
-11.11.4.1
------------
-11.12.1.1
-11.12.1.2
------------
-11.14.2.2
-11.14.2.3
-+---+
-
-** Hadoop Comparator Class
-
-  Hadoop has a library class,
-  {{{../../api/org/apache/hadoop/mapreduce/lib/partition/KeyFieldBasedComparator.html}
-  KeyFieldBasedComparator}}, that is useful for many applications. This class
-  provides a subset of features provided by the Unix/GNU Sort. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-    -D stream.map.output.field.separator=. \
-    -D stream.num.map.output.key.fields=4 \
-    -D mapreduce.map.output.key.field.separator=. \
-    -D mapreduce.partition.keycomparator.options=-k2,2nr \
-    -D mapreduce.job.reduces=1 \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper /bin/cat \
-    -reducer /bin/cat
-+---+
-
-  The map output keys of the above Map/Reduce job normally have four fields
-  separated by ".". However, the Map/Reduce framework will sort the outputs by
-  the second field of the keys using the
-  <-D mapreduce.partition.keycomparator.options=-k2,2nr> option. Here, <-n>
-  specifies that the sorting is numerical sorting and <-r> specifies that the
-  result should be reversed. A simple illustration is shown below:
-
-  Output of map (the keys)
-
-+---+
-11.12.1.2
-11.14.2.3
-11.11.4.1
-11.12.1.1
-11.14.2.2
-+---+
-
-  Sorting output for the reducer (where second field used for sorting)
-
-+---+
-11.14.2.3
-11.14.2.2
-11.12.1.2
-11.12.1.1
-11.11.4.1
-+---+
-
-** Hadoop Aggregate Package
-
-  Hadoop has a library package called
-  {{{../../org/apache/hadoop/mapred/lib/aggregate/package-summary.html}
-  Aggregate}}. Aggregate provides a special reducer class and a special
-  combiner class, and a list of simple aggregators that perform aggregations
-  such as "sum", "max", "min" and so on over a sequence of values. Aggregate
-  allows you to define a mapper plugin class that is expected to generate
-  "aggregatable items" for each input key/value pair of the mappers. The
-  combiner/reducer will aggregate those aggregatable items by invoking the
-  appropriate aggregators.
-
-  To use Aggregate, simply specify "-reducer aggregate":
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper myAggregatorForKeyCount.py \
-    -reducer aggregate \
-    -file myAggregatorForKeyCount.py \
-+---+
-
-  The python program myAggregatorForKeyCount.py looks like:
-
-+---+
-#!/usr/bin/python
-
-import sys;
-
-def generateLongCountToken(id):
-    return "LongValueSum:" + id + "\t" + "1"
-
-def main(argv):
-    line = sys.stdin.readline();
-    try:
-        while line:
-            line = line&#91;:-1];
-            fields = line.split("\t");
-            print generateLongCountToken(fields&#91;0]);
-            line = sys.stdin.readline();
-    except "end of file":
-        return None
-if __name__ == "__main__":
-     main(sys.argv)
-+---+
-
-** Hadoop Field Selection Class
-
-  Hadoop has a library class,
-  {{{../../api/org/apache/hadoop/mapred/lib/FieldSelectionMapReduce.html}
-  FieldSelectionMapReduce}}, that effectively allows you to process text data
-  like the unix "cut" utility. The map function defined in the class treats
-  each input key/value pair as a list of fields. You can specify the field
-  separator (the default is the tab character). You can select an arbitrary
-  list of fields as the map output key, and an arbitrary list of fields as the
-  map output value. Similarly, the reduce function defined in the class treats
-  each input key/value pair as a list of fields. You can select an arbitrary
-  list of fields as the reduce output key, and an arbitrary list of fields as
-  the reduce output value. For example:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -D mapreduce.map.output.key.field.separator=. \
-    -D mapreduce.partition.keypartitioner.options=-k1,2 \
-    -D mapreduce.fieldsel.data.field.separator=. \
-    -D mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0- \
-    -D mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5- \
-    -D mapreduce.map.output.key.class=org.apache.hadoop.io.Text \
-    -D mapreduce.job.reduces=12 \
-    -input myInputDirs \
-    -output myOutputDir \
-    -mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
-    -reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
-    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-+---+
-
-  The option "-D
-  mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0-" specifies
-  key/value selection for the map outputs. Key selection spec and value
-  selection spec are separated by ":". In this case, the map output key will
-  consist of fields 6, 5, 1, 2, and 3. The map output value will consist of all
-  fields (0- means field 0 and all the subsequent fields).
-
-  The option "-D mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5-"
-  specifies key/value selection for the reduce outputs. In this case, the
-  reduce output key will consist of fields 0, 1, 2 (corresponding to the
-  original fields 6, 5, 1). The reduce output value will consist of all fields
-  starting from field 5 (corresponding to all the original fields).
-
-* Frequently Asked Questions
-
-** How do I use Hadoop Streaming to run an arbitrary set of (semi) independent
-   tasks?
-
-  Often you do not need the full power of Map Reduce, but only need to run
-  multiple instances of the same program - either on different parts of the
-  data, or on the same data, but with different parameters. You can use Hadoop
-  Streaming to do this.
-
-** How do I process files, one per map?
-
-  As an example, consider the problem of zipping (compressing) a set of files
-  across the hadoop cluster. You can achieve this by using Hadoop Streaming
-  and custom mapper script:
-
-   * Generate a file containing the full HDFS path of the input files. Each map
-     task would get one file name as input.
-
-   * Create a mapper script which, given a filename, will get the file to local
-     disk, gzip the file and put it back in the desired output directory.
-
-** How many reducers should I use?
-
-  See MapReduce Tutorial for details: {{{./MapReduceTutorial.html#Reducer}
-  Reducer}}
-
-** If I set up an alias in my shell script, will that work after -mapper?
-
-  For example, say I do: alias c1='cut -f1'. Will -mapper "c1" work?
-
-  Using an alias will not work, but variable substitution is allowed as shown
-  in this example:
-
-+---+
-$ hdfs dfs -cat /user/me/samples/student_marks
-alice   50
-bruce   70
-charlie 80
-dan     75
-
-$ c2='cut -f2'; hadoop jar hadoop-streaming-${project.version}.jar \
-    -D mapreduce.job.name='Experiment' \
-    -input /user/me/samples/student_marks \
-    -output /user/me/samples/student_out \
-    -mapper "$c2" -reducer 'cat'
-
-$ hdfs dfs -cat /user/me/samples/student_out/part-00000
-50
-70
-75
-80
-+---+
-
-** Can I use UNIX pipes?
-
-  For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
-
-  Currently this does not work and gives an "java.io.IOException: Broken pipe"
-  error. This is probably a bug that needs to be investigated.
-
-** What do I do if I get the "No space left on device" error?
-
-  For example, when I run a streaming job by distributing large executables
-  (for example, 3.6G) through the -file option, I get a "No space left on
-  device" error.
-
-  The jar packaging happens in a directory pointed to by the configuration
-  variable stream.tmpdir. The default value of stream.tmpdir is /tmp. Set the
-  value to a directory with more space:
-
-+---+
--D stream.tmpdir=/export/bigspace/...
-+---+
-
-** How do I specify multiple input directories?
-
-  You can specify multiple input directories with multiple '-input' options:
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -input '/user/foo/dir1' -input '/user/foo/dir2' \
-    (rest of the command)
-+---+
-
-** How do I generate output files with gzip format?
-
-  Instead of plain text files, you can generate gzip files as your generated
-  output. Pass '-D mapreduce.output.fileoutputformat.compress=true -D
-  mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec'
-  as option to your streaming job.
-
-** How do I provide my own input/output format with streaming?
-
-  You can specify your own custom class by packing them and putting the custom
-  jar to \$\{HADOOP_CLASSPATH\}.
-
-** How do I parse XML documents using streaming?
-
-  You can use the record reader StreamXmlRecordReader to process XML documents.
-
-+---+
-hadoop jar hadoop-streaming-${project.version}.jar \
-    -inputreader "StreamXmlRecord,begin=BEGIN_STRING,end=END_STRING" \
-    (rest of the command)
-+---+
-
-  Anything found between BEGIN_STRING and END_STRING would be treated as one
-  record for map tasks.
-
-** How do I update counters in streaming applications?
-
-  A streaming process can use the stderr to emit counter information.
-  <<<reporter:counter:\<group\>,\<counter\>,\<amount\>>>> should be sent to
-  stderr to update the counter.
-
-** How do I update status in streaming applications?
-
-  A streaming process can use the stderr to emit status information. To set a
-  status, <<<reporter:status:\<message\>>>> should be sent to stderr.
-
-** How do I get the Job variables in a streaming job's mapper/reducer?
-
-  See {{{./MapReduceTutorial.html#Configured_Parameters}
-  Configured Parameters}}. During the execution of a streaming job, the names
-  of the "mapred" parameters are transformed. The dots ( . ) become underscores
-  ( _ ). For example, mapreduce.job.id becomes mapreduce_job_id and
-  mapreduce.job.jar becomes mapreduce_job_jar. In your code, use the parameter
-  names with the underscores.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistCp.md.vm
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistCp.md.vm b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistCp.md.vm
deleted file mode 100644
index 447e515..0000000
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistCp.md.vm
+++ /dev/null
@@ -1,512 +0,0 @@
-<!---
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License. See accompanying LICENSE file.
--->
-
-#set ( $H3 = '###' )
-
-DistCp Version2 Guide
-=====================
-
----
-
- - [Overview](#Overview)
- - [Usage](#Usage)
-     - [Basic Usage](#Basic_Usage)
-     - [Update and Overwrite](#Update_and_Overwrite)
- - [Command Line Options](#Command_Line_Options)
- - [Architecture of DistCp](#Architecture_of_DistCp)
-     - [DistCp Driver](#DistCp_Driver)
-     - [Copy-listing Generator](#Copy-listing_Generator)
-     - [InputFormats and MapReduce Components](#InputFormats_and_MapReduce_Components)
- - [Appendix](#Appendix)
-     - [Map sizing](#Map_sizing)
-     - [Copying Between Versions of HDFS](#Copying_Between_Versions_of_HDFS)
-     - [MapReduce and other side-effects](#MapReduce_and_other_side-effects)
-     - [SSL Configurations for HSFTP sources](#SSL_Configurations_for_HSFTP_sources)
- - [Frequently Asked Questions](#Frequently_Asked_Questions)
-
----
-
-Overview
---------
-
-  DistCp Version 2 (distributed copy) is a tool used for large
-  inter/intra-cluster copying. It uses MapReduce to effect its distribution,
-  error handling and recovery, and reporting. It expands a list of files and
-  directories into input to map tasks, each of which will copy a partition of
-  the files specified in the source list.
-
-  [The erstwhile implementation of DistCp]
-  (http://hadoop.apache.org/docs/r1.2.1/distcp.html) has its share of quirks
-  and drawbacks, both in its usage, as well as its extensibility and
-  performance. The purpose of the DistCp refactor was to fix these
-  shortcomings, enabling it to be used and extended programmatically. New
-  paradigms have been introduced to improve runtime and setup performance,
-  while simultaneously retaining the legacy behaviour as default.
-
-  This document aims to describe the design of the new DistCp, its spanking new
-  features, their optimal use, and any deviance from the legacy implementation.
-
-Usage
------
-
-$H3 Basic Usage
-
-  The most common invocation of DistCp is an inter-cluster copy:
-
-    bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
-    hdfs://nn2:8020/bar/foo
-
-  This will expand the namespace under `/foo/bar` on nn1 into a temporary file,
-  partition its contents among a set of map tasks, and start a copy on each
-  NodeManager from nn1 to nn2.
-
-  One can also specify multiple source directories on the command line:
-
-    bash$ hadoop distcp hdfs://nn1:8020/foo/a \
-    hdfs://nn1:8020/foo/b \
-    hdfs://nn2:8020/bar/foo
-
-  Or, equivalently, from a file using the -f option:
-
-    bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
-    hdfs://nn2:8020/bar/foo
-
-  Where `srclist` contains
-
-    hdfs://nn1:8020/foo/a
-    hdfs://nn1:8020/foo/b
-
-  When copying from multiple sources, DistCp will abort the copy with an error
-  message if two sources collide, but collisions at the destination are
-  resolved per the [options](#Command_Line_Options) specified. By default,
-  files already existing at the destination are skipped (i.e. not replaced by
-  the source file). A count of skipped files is reported at the end of each
-  job, but it may be inaccurate if a copier failed for some subset of its
-  files, but succeeded on a later attempt.
-
-  It is important that each NodeManager can reach and communicate with both the
-  source and destination file systems. For HDFS, both the source and
-  destination must be running the same version of the protocol or use a
-  backwards-compatible protocol; see [Copying Between Versions]
-  (#Copying_Between_Versions_of_HDFS).
-
-  After a copy, it is recommended that one generates and cross-checks a listing
-  of the source and destination to verify that the copy was truly successful.
-  Since DistCp employs both Map/Reduce and the FileSystem API, issues in or
-  between any of the three could adversely and silently affect the copy. Some
-  have had success running with `-update` enabled to perform a second pass, but
-  users should be acquainted with its semantics before attempting this.
-
-  It's also worth noting that if another client is still writing to a source
-  file, the copy will likely fail. Attempting to overwrite a file being written
-  at the destination should also fail on HDFS. If a source file is (re)moved
-  before it is copied, the copy will fail with a FileNotFoundException.
-
-  Please refer to the detailed Command Line Reference for information on all
-  the options available in DistCp.
-
-$H3 Update and Overwrite
-
-  `-update` is used to copy files from source that don't exist at the target
-  or differ from the target version. `-overwrite` overwrites target-files that
-  exist at the target.
-
-  The Update and Overwrite options warrant special attention since their
-  handling of source-paths varies from the defaults in a very subtle manner.
-  Consider a copy from `/source/first/` and `/source/second/` to `/target/`,
-  where the source paths have the following contents:
-
-    hdfs://nn1:8020/source/first/1
-    hdfs://nn1:8020/source/first/2
-    hdfs://nn1:8020/source/second/10
-    hdfs://nn1:8020/source/second/20
-
-  When DistCp is invoked without `-update` or `-overwrite`, the DistCp defaults
-  would create directories `first/` and `second/`, under `/target`. Thus:
-
-    distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
-
-  would yield the following contents in `/target`:
-
-    hdfs://nn2:8020/target/first/1
-    hdfs://nn2:8020/target/first/2
-    hdfs://nn2:8020/target/second/10
-    hdfs://nn2:8020/target/second/20
-
-  When either `-update` or `-overwrite` is specified, the **contents** of the
-  source-directories are copied to target, and not the source directories
-  themselves. Thus:
-
-    distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
-
-  would yield the following contents in `/target`:
-
-    hdfs://nn2:8020/target/1
-    hdfs://nn2:8020/target/2
-    hdfs://nn2:8020/target/10
-    hdfs://nn2:8020/target/20
-
-  By extension, if both source folders contained a file with the same name
-  (say, `0`), then both sources would map an entry to `/target/0` at the
-  destination. Rather than to permit this conflict, DistCp will abort.
-
-  Now, consider the following copy operation:
-
-    distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
-
-  With sources/sizes:
-
-    hdfs://nn1:8020/source/first/1 32
-    hdfs://nn1:8020/source/first/2 32
-    hdfs://nn1:8020/source/second/10 64
-    hdfs://nn1:8020/source/second/20 32
-
-  And destination/sizes:
-
-    hdfs://nn2:8020/target/1 32
-    hdfs://nn2:8020/target/10 32
-    hdfs://nn2:8020/target/20 64
-
-  Will effect:
-
-    hdfs://nn2:8020/target/1 32
-    hdfs://nn2:8020/target/2 32
-    hdfs://nn2:8020/target/10 64
-    hdfs://nn2:8020/target/20 32
-
-  `1` is skipped because the file-length and contents match. `2` is copied
-  because it doesn't exist at the target. `10` and `20` are overwritten since
-  the contents don't match the source.
-
-  If `-update` is used, `1` is overwritten as well.
-
-$H3 raw Namespace Extended Attribute Preservation
-
-  This section only applies to HDFS.
-
-  If the target and all of the source pathnames are in the /.reserved/raw
-  hierarchy, then 'raw' namespace extended attributes will be preserved.
-  'raw' xattrs are used by the system for internal functions such as encryption
-  meta data. They are only visible to users when accessed through the
-  /.reserved/raw hierarchy.
-
-  raw xattrs are preserved based solely on whether /.reserved/raw prefixes are
-  supplied. The -p (preserve, see below) flag does not impact preservation of
-  raw xattrs.
-
-  To prevent raw xattrs from being preserved, simply do not use the
-  /.reserved/raw prefix on any of the source and target paths.
-
-  If the /.reserved/raw prefix is specified on only a subset of the source and
-  target paths, an error will be displayed and a non-0 exit code returned.
-
-Command Line Options
---------------------
-
-Flag              | Description                          | Notes
------------------ | ------------------------------------ | --------
-`-p[rbugpcax]` | Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr | Modification times are not preserved. Also, when `-update` is specified, status updates will **not** be synchronized unless the file sizes also differ (i.e. unless the file is re-created). If -pa is specified, DistCp preserves the permissions also because ACLs are a super-set of permissions.
-`-i` | Ignore failures | As explained in the Appendix, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted.
-`-log <logdir>` | Write logs to \<logdir\> | DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed.
-`-m <num_maps>` | Maximum number of simultaneous copies | Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput.
-`-overwrite` | Overwrite destination | If a map fails and `-i` is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-`-update` | Overwrite if source and destination differ in size, blocksize, or checksum | As noted in the preceding, this is not a "sync" operation. The criteria examined are the source and destination file sizes, blocksizes, and checksums; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-`-f <urilist_uri>` | Use list at \<urilist_uri\> as src list | This is equivalent to listing each source on the command line. The `urilist_uri` list should be a fully qualified URI.
-`-filelimit <n>` | Limit the total number of files to be <= n | **Deprecated!** Ignored in the new DistCp.
-`-sizelimit <n>` | Limit the total size to be <= n bytes | **Deprecated!** Ignored in the new DistCp.
-`-delete` | Delete the files existing in the dst but not in src | The deletion is done by FS Shell. So the trash will be used, if it is enable.
-`-strategy {dynamic|uniformsize}` | Choose the copy-strategy to be used in DistCp. | By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If "dynamic" is specified, `DynamicInputFormat` is used instead. (This is described in the Architecture section, under InputFormats.)
-`-bandwidth` | Specify bandwidth per map, in MB/second. | Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the **net** bandwidth used tends towards the specified value.
-`-atomic {-tmp <tmp_dir>}` | Specify atomic commit, with optional tmp directory. | `-atomic` instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final-location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, `-tmp` may be used to specify the location of the tmp-target. If not specified, a default is chosen. **Note:** tmp_dir must be on the final target cluster.
-`-mapredSslConf <ssl_conf_file>` | Specify SSL Config file, to be used with HSFTP source | When using the hsftp protocol with a source, the security- related properties may be specified in a config-file and passed to DistCp. \<ssl_conf_file\> needs to be in the classpath.
-`-async` | Run DistCp asynchronously. Quits as soon as the Hadoop Job is launched. | The Hadoop Job-id is logged, for tracking.
-
-Architecture of DistCp
-----------------------
-
-  The components of the new DistCp may be classified into the following
-  categories:
-
-  * DistCp Driver
-  * Copy-listing generator
-  * Input-formats and Map-Reduce components
-
-$H3 DistCp Driver
-
-  The DistCp Driver components are responsible for:
-
-  * Parsing the arguments passed to the DistCp command on the command-line,
-    via:
-
-     * OptionsParser, and
-     * DistCpOptionsSwitch
-
-  * Assembling the command arguments into an appropriate DistCpOptions object,
-    and initializing DistCp. These arguments include:
-
-     * Source-paths
-     * Target location
-     * Copy options (e.g. whether to update-copy, overwrite, which
-       file-attributes to preserve, etc.)
-
-  * Orchestrating the copy operation by:
-
-     * Invoking the copy-listing-generator to create the list of files to be
-       copied.
-     * Setting up and launching the Hadoop Map-Reduce Job to carry out the
-       copy.
-     * Based on the options, either returning a handle to the Hadoop MR Job
-       immediately, or waiting till completion.
-
-  The parser-elements are exercised only from the command-line (or if
-  DistCp::run() is invoked). The DistCp class may also be used
-  programmatically, by constructing the DistCpOptions object, and initializing
-  a DistCp object appropriately.
-
-$H3 Copy-listing Generator
-
-  The copy-listing-generator classes are responsible for creating the list of
-  files/directories to be copied from source. They examine the contents of the
-  source-paths (files/directories, including wild-cards), and record all paths
-  that need copy into a SequenceFile, for consumption by the DistCp Hadoop
-  Job. The main classes in this module include:
-
-  1. CopyListing: The interface that should be implemented by any
-     copy-listing-generator implementation. Also provides the factory method by
-     which the concrete CopyListing implementation is chosen.
-  2. SimpleCopyListing: An implementation of CopyListing that accepts multiple
-     source paths (files/directories), and recursively lists all the individual
-     files and directories under each, for copy.
-  3. GlobbedCopyListing: Another implementation of CopyListing that expands
-     wild-cards in the source paths.
-  4. FileBasedCopyListing: An implementation of CopyListing that reads the
-     source-path list from a specified file.
-
-  Based on whether a source-file-list is specified in the DistCpOptions, the
-  source-listing is generated in one of the following ways:
-
-  1. If there's no source-file-list, the GlobbedCopyListing is used. All
-     wild-cards are expanded, and all the expansions are forwarded to the
-     SimpleCopyListing, which in turn constructs the listing (via recursive
-     descent of each path).
-  2. If a source-file-list is specified, the FileBasedCopyListing is used.
-     Source-paths are read from the specified file, and then forwarded to the
-     GlobbedCopyListing. The listing is then constructed as described above.
-
-  One may customize the method by which the copy-listing is constructed by
-  providing a custom implementation of the CopyListing interface. The behaviour
-  of DistCp differs here from the legacy DistCp, in how paths are considered
-  for copy.
-
-  The legacy implementation only lists those paths that must definitely be
-  copied on to target. E.g. if a file already exists at the target (and
-  `-overwrite` isn't specified), the file isn't even considered in the
-  MapReduce Copy Job. Determining this during setup (i.e. before the MapReduce
-  Job) involves file-size and checksum-comparisons that are potentially
-  time-consuming.
-
-  The new DistCp postpones such checks until the MapReduce Job, thus reducing
-  setup time. Performance is enhanced further since these checks are
-  parallelized across multiple maps.
-
-$H3 InputFormats and MapReduce Components
-
-  The InputFormats and MapReduce components are responsible for the actual copy
-  of files and directories from the source to the destination path. The
-  listing-file created during copy-listing generation is consumed at this
-  point, when the copy is carried out. The classes of interest here include:
-
-  * **UniformSizeInputFormat:**
-    This implementation of org.apache.hadoop.mapreduce.InputFormat provides
-    equivalence with Legacy DistCp in balancing load across maps. The aim of
-    the UniformSizeInputFormat is to make each map copy roughly the same number
-    of bytes. Apropos, the listing file is split into groups of paths, such
-    that the sum of file-sizes in each InputSplit is nearly equal to every
-    other map. The splitting isn't always perfect, but its trivial
-    implementation keeps the setup-time low.
-
-  * **DynamicInputFormat and DynamicRecordReader:**
-    The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
-    and is new to DistCp. The listing-file is split into several "chunk-files",
-    the exact number of chunk-files being a multiple of the number of maps
-    requested for in the Hadoop Job. Each map task is "assigned" one of the
-    chunk-files (by renaming the chunk to the task's id), before the Job is
-    launched.
-    Paths are read from each chunk using the DynamicRecordReader, and
-    processed in the CopyMapper. After all the paths in a chunk are processed,
-    the current chunk is deleted and a new chunk is acquired. The process
-    continues until no more chunks are available.
-    This "dynamic" approach allows faster map-tasks to consume more paths than
-    slower ones, thus speeding up the DistCp job overall.
-
-  * **CopyMapper:**
-    This class implements the physical file-copy. The input-paths are checked
-    against the input-options (specified in the Job's Configuration), to
-    determine whether a file needs copy. A file will be copied only if at least
-    one of the following is true:
-
-     * A file with the same name doesn't exist at target.
-     * A file with the same name exists at target, but has a different file
-       size.
-     * A file with the same name exists at target, but has a different
-       checksum, and `-skipcrccheck` isn't mentioned.
-     * A file with the same name exists at target, but `-overwrite` is
-       specified.
-     * A file with the same name exists at target, but differs in block-size
-       (and block-size needs to be preserved.
-
-  * **CopyCommitter:** This class is responsible for the commit-phase of the
-    DistCp job, including:
-
-     * Preservation of directory-permissions (if specified in the options)
-     * Clean-up of temporary-files, work-directories, etc.
-
-Appendix
---------
-
-$H3 Map sizing
-
-  By default, DistCp makes an attempt to size each map comparably so that each
-  copies roughly the same number of bytes. Note that files are the finest level
-  of granularity, so increasing the number of simultaneous copiers (i.e. maps)
-  may not always increase the number of simultaneous copies nor the overall
-  throughput.
-
-  The new DistCp also provides a strategy to "dynamically" size maps, allowing
-  faster data-nodes to copy more bytes than slower nodes. Using `-strategy
-  dynamic` (explained in the Architecture), rather than to assign a fixed set
-  of source-files to each map-task, files are instead split into several sets.
-  The number of sets exceeds the number of maps, usually by a factor of 2-3.
-  Each map picks up and copies all files listed in a chunk. When a chunk is
-  exhausted, a new chunk is acquired and processed, until no more chunks
-  remain.
-
-  By not assigning a source-path to a fixed map, faster map-tasks (i.e.
-  data-nodes) are able to consume more chunks, and thus copy more data, than
-  slower nodes. While this distribution isn't uniform, it is fair with regard
-  to each mapper's capacity.
-
-  The dynamic-strategy is implemented by the DynamicInputFormat. It provides
-  superior performance under most conditions.
-
-  Tuning the number of maps to the size of the source and destination clusters,
-  the size of the copy, and the available bandwidth is recommended for
-  long-running and regularly run jobs.
-
-$H3 Copying Between Versions of HDFS
-
-  For copying between two different versions of Hadoop, one will usually use
-  HftpFileSystem. This is a read-only FileSystem, so DistCp must be run on the
-  destination cluster (more specifically, on NodeManagers that can write to the
-  destination cluster). Each source is specified as
-  `hftp://<dfs.http.address>/<path>` (the default `dfs.http.address` is
-  `<namenode>:50070`).
-
-$H3 MapReduce and other side-effects
-
-  As has been mentioned in the preceding, should a map fail to copy one of its
-  inputs, there will be several side-effects.
-
-  * Unless `-overwrite` is specified, files successfully copied by a previous
-    map on a re-execution will be marked as "skipped".
-  * If a map fails `mapreduce.map.maxattempts` times, the remaining map tasks
-    will be killed (unless `-i` is set).
-  * If `mapreduce.map.speculative` is set set final and true, the result of the
-    copy is undefined.
-
-$H3 SSL Configurations for HSFTP sources
-
-  To use an HSFTP source (i.e. using the hsftp protocol), a SSL configuration
-  file needs to be specified (via the `-mapredSslConf` option). This must
-  specify 3 parameters:
-
-  * `ssl.client.truststore.location`: The local-filesystem location of the
-    trust-store file, containing the certificate for the NameNode.
-  * `ssl.client.truststore.type`: (Optional) The format of the trust-store
-    file.
-  * `ssl.client.truststore.password`: (Optional) Password for the trust-store
-    file.
-
-  The following is an example of the contents of the contents of a SSL
-  Configuration file:
-
-    <configuration>
-      <property>
-        <name>ssl.client.truststore.location</name>
-        <value>/work/keystore.jks</value>
-        <description>Truststore to be used by clients like distcp. Must be specified.</description>
-      </property>
-
-      <property>
-        <name>ssl.client.truststore.password</name>
-        <value>changeme</value>
-        <description>Optional. Default value is "".</description>
-      </property>
-
-      <property>
-        <name>ssl.client.truststore.type</name>
-        <value>jks</value>
-        <description>Optional. Default value is "jks".</description>
-      </property>
-    </configuration>
-
-  The SSL configuration file must be in the class-path of the DistCp program.
-
-Frequently Asked Questions
---------------------------
-
-  1. **Why does -update not create the parent source-directory under a pre-existing target directory?**
-     The behaviour of `-update` and `-overwrite` is described in detail in the
-     Usage section of this document. In short, if either option is used with a
-     pre-existing destination directory, the **contents** of each source
-     directory is copied over, rather than the source-directory itself. This
-     behaviour is consistent with the legacy DistCp implementation as well.
-
-  2. **How does the new DistCp differ in semantics from the Legacy DistCp?**
-
-     * Files that are skipped during copy used to also have their
-       file-attributes (permissions, owner/group info, etc.) unchanged, when
-       copied with Legacy DistCp. These are now updated, even if the file-copy
-       is skipped.
-     * Empty root directories among the source-path inputs were not created at
-       the target, in Legacy DistCp. These are now created.
-
-  3. **Why does the new DistCp use more maps than legacy DistCp?**
-     Legacy DistCp works by figuring out what files need to be actually copied
-     to target before the copy-job is launched, and then launching as many maps
-     as required for copy. So if a majority of the files need to be skipped
-     (because they already exist, for example), fewer maps will be needed. As a
-     consequence, the time spent in setup (i.e. before the M/R job) is higher.
-     The new DistCp calculates only the contents of the source-paths. It
-     doesn't try to filter out what files can be skipped. That decision is put
-     off till the M/R job runs. This is much faster (vis-a-vis execution-time),
-     but the number of maps launched will be as specified in the `-m` option,
-     or 20 (default) if unspecified.
-
-  4. **Why does DistCp not run faster when more maps are specified?**
-     At present, the smallest unit of work for DistCp is a file. i.e., a file
-     is processed by only one map. Increasing the number of maps to a value
-     exceeding the number of files would yield no performance benefit. The
-     number of maps launched would equal the number of files.
-
-  5. **Why does DistCp run out of memory?**
-     If the number of individual files/directories being copied from the source
-     path(s) is extremely large (e.g. 1,000,000 paths), DistCp might run out of
-     memory while determining the list of paths for copy. This is not unique to
-     the new DistCp implementation.
-     To get around this, consider changing the `-Xmx` JVM heap-size parameters,
-     as follows:
-
-         bash$ export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m"
-         bash$ hadoop distcp /source /target

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
----------------------------------------------------------------------
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
deleted file mode 100644
index be557a7..0000000
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
+++ /dev/null
@@ -1,162 +0,0 @@
-<!---
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License. See accompanying LICENSE file.
--->
-
-#set ( $H3 = '###' )
-
-Hadoop Archives Guide
-=====================
-
- - [Overview](#Overview)
- - [How to Create an Archive](#How_to_Create_an_Archive)
- - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
- - [How to Unarchive an Archive](#How_to_Unarchive_an_Archive)
- - [Archives Examples](#Archives_Examples)
-     - [Creating an Archive](#Creating_an_Archive)
-     - [Looking Up Files](#Looking_Up_Files)
- - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
-
-Overview
---------
-
-  Hadoop archives are special format archives. A Hadoop archive maps to a file
-  system directory. A Hadoop archive always has a \*.har extension. A Hadoop
-  archive directory contains metadata (in the form of _index and _masterindex)
-  and data (part-\*) files. The _index file contains the name of the files that
-  are part of the archive and the location within the part files.
-
-How to Create an Archive
-------------------------
-
-  `Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>`
-
-  -archiveName is the name of the archive you would like to create. An example
-  would be foo.har. The name should have a \*.har extension. The parent argument
-  is to specify the relative path to which the files should be archived to.
-  Example would be :
-
-  `-p /foo/bar a/b/c e/f/g`
-
-  Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
-  parent. Note that this is a Map/Reduce job that creates the archives. You
-  would need a map reduce cluster to run this. For a detailed example the later
-  sections.
-
-  -r indicates the desired replication factor; if this optional argument is
-  not specified, a replication factor of 10 will be used.
-
-  If you just want to archive a single directory /foo/bar then you can just use
-
-  `hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir`
-
-  If you specify source files that are in an encryption zone, they will be
-  decrypted and written into the archive. If the har file is not located in an
-  encryption zone, then they will be stored in clear (decrypted) form. If the
-  har file is located in an encryption zone they will stored in encrypted form.
-
-How to Look Up Files in Archives
---------------------------------
-
-  The archive exposes itself as a file system layer. So all the fs shell
-  commands in the archives work but with a different URI. Also, note that
-  archives are immutable. So, rename's, deletes and creates return an error.
-  URI for Hadoop Archives is
-
-  `har://scheme-hostname:port/archivepath/fileinarchive`
-
-  If no scheme is provided it assumes the underlying filesystem. In that case
-  the URI would look like
-
-  `har:///archivepath/fileinarchive`
-
-How to Unarchive an Archive
----------------------------
-
-  Since all the fs shell commands in the archives work transparently,
-  unarchiving is just a matter of copying.
-
-  To unarchive sequentially:
-
-  `hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
-
-  To unarchive in parallel, use DistCp:
-
-  `hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
-
-Archives Examples
------------------
-
-$H3 Creating an Archive
-
-  `hadoop archive -archiveName foo.har -p /user/hadoop -r 3 dir1 dir2 /user/zoo`
-
-  The above example is creating an archive using /user/hadoop as the relative
-  archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
-  will be archived in the following file system directory -- /user/zoo/foo.har.
-  Archiving does not delete the input files. If you want to delete the input
-  files after creating the archives (to reduce namespace), you will have to do
-  it on your own. In this example, because `-r 3` is specified, a replication
-  factor of 3 will be used.
-
-$H3 Looking Up Files
-
-  Looking up files in hadoop archives is as easy as doing an ls on the
-  filesystem. After you have archived the directories /user/hadoop/dir1 and
-  /user/hadoop/dir2 as in the example above, to see all the files in the
-  archives you can just run:
-
-  `hdfs dfs -ls -R har:///user/zoo/foo.har/`
-
-  To understand the significance of the -p argument, lets go through the above
-  example again. If you just do an ls (not lsr) on the hadoop archive using
-
-  `hdfs dfs -ls har:///user/zoo/foo.har`
-
-  The output should be:
-
-```
-har:///user/zoo/foo.har/dir1
-har:///user/zoo/foo.har/dir2
-```
-
-  As you can recall the archives were created with the following command
-
-  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
-
-  If we were to change the command to:
-
-  `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
-
-  then a ls on the hadoop archive using
-
-  `hdfs dfs -ls har:///user/zoo/foo.har`
-
-  would give you
-
-```
-har:///user/zoo/foo.har/hadoop/dir1
-har:///user/zoo/foo.har/hadoop/dir2
-```
-
-  Notice that the archived files have been archived relative to /user/ rather
-  than /user/hadoop.
-
-Hadoop Archives and MapReduce
------------------------------
-
-  Using Hadoop Archives in MapReduce is as easy as specifying a different input
-  filesystem than the default file system. If you have a hadoop archive stored
-  in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
-  all you need to specify the input directory as har:///user/zoo/foo.har. Since
-  Hadoop Archives is exposed as a file system MapReduce will be able to use all
-  the logical input files in Hadoop Archives as input.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-project/src/site/site.xml
----------------------------------------------------------------------
diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml
index 6aa226c..68c0cff 100644
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -105,11 +105,6 @@
       <item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
       <item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
       <item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
-      <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
-      <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
-      <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
-      <item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
-      <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
     </menu>
 
     <menu name="MapReduce REST APIs" inherit="top">
@@ -128,7 +123,6 @@
       <item name="YARN Timeline Server" href="hadoop-yarn/hadoop-yarn-site/TimelineServer.html"/>
       <item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/>
       <item name="YARN Commands" href="hadoop-yarn/hadoop-yarn-site/YarnCommands.html"/>
-      <item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/>
       <item name="NodeManager Restart" href="hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html"/>
       <item name="DockerContainerExecutor" href="hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html"/>
       <item name="Using CGroups" href="hadoop-yarn/hadoop-yarn-site/NodeManagerCGroups.html"/>
@@ -154,7 +148,16 @@
       <item name="Configuration" href="hadoop-auth/Configuration.html"/>
       <item name="Building" href="hadoop-auth/BuildingIt.html"/>
     </menu>
-    
+
+    <menu name="Tools" inherit="top">
+      <item name="Hadoop Streaming" href="hadoop-streaming/HadoopStreaming.html"/>
+      <item name="Hadoop Archives" href="hadoop-archives/HadoopArchives.html"/>
+      <item name="DistCp" href="hadoop-distcp/DistCp.html"/>
+      <item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
+      <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
+      <item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/>
+    </menu>
+
     <menu name="Reference" inherit="top">
       <item name="Release Notes" href="hadoop-project-dist/hadoop-common/releasenotes.html"/>
       <item name="API docs" href="api/index.html"/>

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-tools/hadoop-archives/src/site/markdown/HadoopArchives.md.vm
----------------------------------------------------------------------
diff --git a/hadoop-tools/hadoop-archives/src/site/markdown/HadoopArchives.md.vm b/hadoop-tools/hadoop-archives/src/site/markdown/HadoopArchives.md.vm
new file mode 100644
index 0000000..be557a7
--- /dev/null
+++ b/hadoop-tools/hadoop-archives/src/site/markdown/HadoopArchives.md.vm
@@ -0,0 +1,162 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+#set ( $H3 = '###' )
+
+Hadoop Archives Guide
+=====================
+
+ - [Overview](#Overview)
+ - [How to Create an Archive](#How_to_Create_an_Archive)
+ - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
+ - [How to Unarchive an Archive](#How_to_Unarchive_an_Archive)
+ - [Archives Examples](#Archives_Examples)
+     - [Creating an Archive](#Creating_an_Archive)
+     - [Looking Up Files](#Looking_Up_Files)
+ - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
+
+Overview
+--------
+
+  Hadoop archives are special format archives. A Hadoop archive maps to a file
+  system directory. A Hadoop archive always has a \*.har extension. A Hadoop
+  archive directory contains metadata (in the form of _index and _masterindex)
+  and data (part-\*) files. The _index file contains the name of the files that
+  are part of the archive and the location within the part files.
+
+How to Create an Archive
+------------------------
+
+  `Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>`
+
+  -archiveName is the name of the archive you would like to create. An example
+  would be foo.har. The name should have a \*.har extension. The parent argument
+  is to specify the relative path to which the files should be archived to.
+  Example would be :
+
+  `-p /foo/bar a/b/c e/f/g`
+
+  Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
+  parent. Note that this is a Map/Reduce job that creates the archives. You
+  would need a map reduce cluster to run this. For a detailed example the later
+  sections.
+
+  -r indicates the desired replication factor; if this optional argument is
+  not specified, a replication factor of 10 will be used.
+
+  If you just want to archive a single directory /foo/bar then you can just use
+
+  `hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir`
+
+  If you specify source files that are in an encryption zone, they will be
+  decrypted and written into the archive. If the har file is not located in an
+  encryption zone, then they will be stored in clear (decrypted) form. If the
+  har file is located in an encryption zone they will stored in encrypted form.
+
+How to Look Up Files in Archives
+--------------------------------
+
+  The archive exposes itself as a file system layer. So all the fs shell
+  commands in the archives work but with a different URI. Also, note that
+  archives are immutable. So, rename's, deletes and creates return an error.
+  URI for Hadoop Archives is
+
+  `har://scheme-hostname:port/archivepath/fileinarchive`
+
+  If no scheme is provided it assumes the underlying filesystem. In that case
+  the URI would look like
+
+  `har:///archivepath/fileinarchive`
+
+How to Unarchive an Archive
+---------------------------
+
+  Since all the fs shell commands in the archives work transparently,
+  unarchiving is just a matter of copying.
+
+  To unarchive sequentially:
+
+  `hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
+
+  To unarchive in parallel, use DistCp:
+
+  `hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
+
+Archives Examples
+-----------------
+
+$H3 Creating an Archive
+
+  `hadoop archive -archiveName foo.har -p /user/hadoop -r 3 dir1 dir2 /user/zoo`
+
+  The above example is creating an archive using /user/hadoop as the relative
+  archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
+  will be archived in the following file system directory -- /user/zoo/foo.har.
+  Archiving does not delete the input files. If you want to delete the input
+  files after creating the archives (to reduce namespace), you will have to do
+  it on your own. In this example, because `-r 3` is specified, a replication
+  factor of 3 will be used.
+
+$H3 Looking Up Files
+
+  Looking up files in hadoop archives is as easy as doing an ls on the
+  filesystem. After you have archived the directories /user/hadoop/dir1 and
+  /user/hadoop/dir2 as in the example above, to see all the files in the
+  archives you can just run:
+
+  `hdfs dfs -ls -R har:///user/zoo/foo.har/`
+
+  To understand the significance of the -p argument, lets go through the above
+  example again. If you just do an ls (not lsr) on the hadoop archive using
+
+  `hdfs dfs -ls har:///user/zoo/foo.har`
+
+  The output should be:
+
+```
+har:///user/zoo/foo.har/dir1
+har:///user/zoo/foo.har/dir2
+```
+
+  As you can recall the archives were created with the following command
+
+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
+
+  If we were to change the command to:
+
+  `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
+
+  then a ls on the hadoop archive using
+
+  `hdfs dfs -ls har:///user/zoo/foo.har`
+
+  would give you
+
+```
+har:///user/zoo/foo.har/hadoop/dir1
+har:///user/zoo/foo.har/hadoop/dir2
+```
+
+  Notice that the archived files have been archived relative to /user/ rather
+  than /user/hadoop.
+
+Hadoop Archives and MapReduce
+-----------------------------
+
+  Using Hadoop Archives in MapReduce is as easy as specifying a different input
+  filesystem than the default file system. If you have a hadoop archive stored
+  in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
+  all you need to specify the input directory as har:///user/zoo/foo.har. Since
+  Hadoop Archives is exposed as a file system MapReduce will be able to use all
+  the logical input files in Hadoop Archives as input.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/f7a724ca/hadoop-tools/hadoop-archives/src/site/resources/css/site.css
----------------------------------------------------------------------
diff --git a/hadoop-tools/hadoop-archives/src/site/resources/css/site.css b/hadoop-tools/hadoop-archives/src/site/resources/css/site.css
new file mode 100644
index 0000000..f830baa
--- /dev/null
+++ b/hadoop-tools/hadoop-archives/src/site/resources/css/site.css
@@ -0,0 +1,30 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+#banner {
+  height: 93px;
+  background: none;
+}
+
+#bannerLeft img {
+  margin-left: 30px;
+  margin-top: 10px;
+}
+
+#bannerRight img {
+  margin: 17px;
+}
+