You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by st...@apache.org on 2019/03/26 18:43:59 UTC

[hadoop] branch trunk updated: HADOOP-16037. DistCp: Document usage of Sync (-diff option) in detail.

This is an automated email from the ASF dual-hosted git repository.

stevel pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/hadoop.git


The following commit(s) were added to refs/heads/trunk by this push:
     new ce4bafd  HADOOP-16037. DistCp: Document usage of Sync (-diff option) in detail.
ce4bafd is described below

commit ce4bafdf442c004b6deb25eaa2fa7e947b8ad269
Author: Siyao Meng <sm...@cloudera.com>
AuthorDate: Tue Mar 26 18:42:54 2019 +0000

    HADOOP-16037. DistCp: Document usage of Sync (-diff option) in detail.
    
    Contributed by Siyao Meng
---
 .../hadoop-distcp/src/site/markdown/DistCp.md.vm   | 120 +++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm b/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
index 25ea7e2..3b7737b 100644
--- a/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
+++ b/hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
@@ -13,6 +13,7 @@
 -->
 
 #set ( $H3 = '###' )
+#set ( $H4 = '####' )
 
 DistCp Guide
 =====================
@@ -23,6 +24,7 @@ DistCp Guide
  - [Usage](#Usage)
      - [Basic Usage](#Basic_Usage)
      - [Update and Overwrite](#Update_and_Overwrite)
+     - [Sync](#Sync)
  - [Command Line Options](#Command_Line_Options)
  - [Architecture of DistCp](#Architecture_of_DistCp)
      - [DistCp Driver](#DistCp_Driver)
@@ -192,6 +194,124 @@ $H3 Update and Overwrite
 
   If `-overwrite` is used, `1` is overwritten as well.
 
+$H3 Sync
+
+  `-diff` option syncs files from a source cluster to a target cluster with a
+  snapshot diff. It copies, renames and removes files in the snapshot diff list.
+
+  `-update` option must be included when `-diff` option is in use.
+
+  Most cloud providers don't work well with sync at the moment.
+
+  Usage:
+
+    hadoop distcp -update -diff <from_snapshot> <to_snapshot> <source> <destination>
+
+  Example:
+
+    hadoop distcp -update -diff snap1 snap2 /src/ /dst/
+
+  The command above applies changes from snapshot `snap1` to `snap2`
+  (i.e. snapshot diff from `snap1` to `snap2`) in `/src/` to `/dst/`.
+  Obviously, it requires `/src/` to have both snapshots `snap1` and `snap2`.
+  But the destination  `/dst/` must also have a snapshot with the same
+  name as `<from_snapshot>`, in this case `snap1`. The destination `/dst/`
+  should not have new file operations (create, rename, delete) since `snap1`.
+  Note that when this command finishes, a new snapshot `snap2` will NOT be
+  created at `/dst/`.
+
+  `-update` is required to use `-diff` option.
+
+  For instance, in `/src/`, if `1.txt` is added and `2.txt` is deleted after
+  the creation of `snap1` and before creation of `snap2`, the command above
+  will copy `1.txt` from `/src/` to `/dst/` and delete `2.txt` from `/dst/`.
+
+  Sync behavior will be elaborated using experiments below.
+
+$H4 Experiment 1: Syncing diff of two adjacent snapshots
+
+  Some preparations before we start.
+
+    # Create source and destination directories
+    hdfs dfs -mkdir /src/ /dst/
+    # Allow snapshot on source
+    hdfs dfsadmin -allowSnapshot /src/
+    # Create a snapshot (empty one)
+    hdfs dfs -createSnapshot /src/ snap1
+    # Allow snapshot on destination
+    hdfs dfsadmin -allowSnapshot /dst/
+    # Create a from_snapshot with the same name
+    hdfs dfs -createSnapshot /dst/ snap1
+
+    # Put one text file under /src/
+    echo "This is the 1st text file." > 1.txt
+    hdfs dfs -put 1.txt /src/
+    # Create the second snapshot
+    hdfs dfs -createSnapshot /src/ snap2
+
+    # Put another text file under /src/
+    echo "This is the 2nd text file." > 2.txt
+    hdfs dfs -put 2.txt /src/
+    # Create the third snapshot
+    hdfs dfs -createSnapshot /src/ snap3
+
+  Then we run distcp sync:
+
+    hadoop distcp -update -diff snap1 snap2 /src/ /dst/
+
+  The command above should succeed. `1.txt` will be copied from `/src/` to
+  `/dst/`. Again, `-update` option is required.
+
+  If we run the same command again, we will get `DistCp sync failed` exception
+  because the destination has added a new file `1.txt` since `snap1`. That
+  being said, if we remove `1.txt` manually from `/dst/` and run the sync, the
+  command will succeed.
+
+$H4 Experiment 2: syncing diff of two non-adjacent snapshots
+
+  First do a clean up from Experiment 1.
+
+    hdfs dfs -rm -skipTrash /dst/1.txt
+
+  Run sync command, note the `<to_snapshot>` has been changed from `snap2` in
+  Experiment 1 to `snap3`.
+
+    hadoop distcp -update -diff snap1 snap3 /src/ /dst/
+
+  Both `1.txt` and `2.txt` will be copied to `/dst/`.
+
+$H4 Experiment 3: syncing file delete operation
+
+  Continuing from the end of Experiment 2:
+
+    hdfs dfs -rm -skipTrash /dst/2.txt
+    # Create snap2 at destination, it contains 1.txt
+    hdfs dfs -createSnapshot /dst/ snap2
+
+    # Delete 1.txt from source
+    hdfs dfs -rm -skipTrash /src/1.txt
+    # Create snap4 at source, it only contains 2.txt
+    hdfs dfs -createSnapshot /src/ snap4
+
+  Run sync command now:
+
+    hadoop distcp -update -diff snap2 snap4 /src/ /dst/
+
+  `2.txt` is copied and `1.txt` is deleted under `/dst/`.
+
+  Note that, though both `/src/` and `/dst/` have snapshot with the same name
+  `snap2`, the snapshots don't need to have the same content.
+  That means, if you have a `1.txt` in `/dst/`'s `snap2` but they have different
+  contents, `1.txt` will still be removed from `/dst/`.
+  The sync command doesn't check the contents of the files that is going to
+  be deleted. It simply follows the snapshot diff list between `<from_snapshot>`
+    and <to_snapshot>.
+
+  Also, if we delete `1.txt` from `/dst/` before creating `snap2` on `/dst/`
+  in the steps above, so that `/dst/`'s `snap2` doesn't have `1.txt` before
+  running sync command, the command will still succeed. It won't throw exception
+  while trying to delete `1.txt` from `/dst/` which doesn't exist.
+
 $H3 raw Namespace Extended Attribute Preservation
 
   This section only applies to HDFS.


---------------------------------------------------------------------
To unsubscribe, e-mail: common-commits-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-commits-help@hadoop.apache.org