You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by rahul gidwani <ra...@gmail.com> on 2016/12/16 18:59:24 UTC

distcp filters and snapshots

I had a quick question for the group. I wanted to backup data to a DR hdfs
cluster.

With "HDFS-8828 - Utilize Snapshot diff report to build diff copy list in
distcp" - this makes it much easier to get the copyFileListing and keep the
DR side n sync.

I know there are filters which can be applied to Distcp to exclude the copy
of certain files.  But the question I had was that suppose I have hbase
running on the cluster as well.  HBase handles its replication separately,
so I want to still snapshot the root directory and exclude /hbase from the
Distcp.


Suppose we do the following:

   - take a snapshot on the source side (s0)  and then  distcp that data to
   the destination cluster with the exclude filter on /hbase.
   - then create a snapshot on the target cluster after shipping the data
   also (s0)
   - We make changes on the source side both in /hbase and elsewhere.
   - take another snapshot on the source side (s1) and distcp that over
   with the exclude filter on /hbase


I know Distcp checks the snapshot on the target cluster to see if anything
has changed.  This seems it would not work as the  before we do any copy we
check to see if anything has changed between snapshots and it would look as
if things changed on the target side

Is this correct or am I missing something?

Thank you
rahul