You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Dhaval Shah <pr...@yahoo.co.in> on 2013/01/29 17:40:53 UTC

Using distcp with Hadoop HA

Hello everyone.. I am trying to use distcp with Hadoop HA configuration (using CDH4.0.0 at the moment).. Here is my problem:
- I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out.. 
- If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. 

What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks
 
Regards,
Dhaval

Re: Using distcp with Hadoop HA

Posted by Dhaval Shah <pr...@yahoo.co.in>.

No the datanodes are running on different sets of machines. The configuration looks like this:
The problem is that datanodes in clusterA are trying to connect to namenodes in clusterB (and this seems random.. like it trying to randomly select from the 4 namenodes)

<property>
<name>dfs.nameservices</name>
<value>clusterA,clusterB</value>
<description>
Comma-separated list of nameservices.
</description>
<final>true</final>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>clusterA</value>
<description>
The ID of this nameservice. If the nameservice ID is not
configured or more than one nameservice is configured for
dfs.nameservices it is determined automatically by
matching the local node's address with the configured address.
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterA</name>
<value>clusterAnn1,clusterAnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn1</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn2</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterB</name>
<value>clusterBnn1,clusterBnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn1</name>
<value>clusterBnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn2</name>
<value>clusterBnn2:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterA</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterB</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
 
Regards,
Dhaval


________________________________
 From: Suresh Srinivas <su...@hortonworks.com>
To: "hdfs-user@hadoop.apache.org" <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Tuesday, 29 January 2013 6:03 PM
Subject: Re: Using distcp with Hadoop HA
 

Currently, as you have pointed out, client side configuration based failover is used in HA setup. The configuration must define namenode addresses  for the nameservices of both the clusters. Are the datanodes belonging to the two clusters running on the same set of nodes? Can you share the configuration you are using, to diagnose the problem? 

- I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out.
 
- If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. 
>
>
>What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks
> 
>Regards,
>Dhaval


-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Dhaval Shah <pr...@yahoo.co.in>.

No the datanodes are running on different sets of machines. The configuration looks like this:
The problem is that datanodes in clusterA are trying to connect to namenodes in clusterB (and this seems random.. like it trying to randomly select from the 4 namenodes)

<property>
<name>dfs.nameservices</name>
<value>clusterA,clusterB</value>
<description>
Comma-separated list of nameservices.
</description>
<final>true</final>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>clusterA</value>
<description>
The ID of this nameservice. If the nameservice ID is not
configured or more than one nameservice is configured for
dfs.nameservices it is determined automatically by
matching the local node's address with the configured address.
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterA</name>
<value>clusterAnn1,clusterAnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn1</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn2</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterB</name>
<value>clusterBnn1,clusterBnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn1</name>
<value>clusterBnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn2</name>
<value>clusterBnn2:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterA</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterB</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
 
Regards,
Dhaval


________________________________
 From: Suresh Srinivas <su...@hortonworks.com>
To: "hdfs-user@hadoop.apache.org" <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Tuesday, 29 January 2013 6:03 PM
Subject: Re: Using distcp with Hadoop HA
 

Currently, as you have pointed out, client side configuration based failover is used in HA setup. The configuration must define namenode addresses  for the nameservices of both the clusters. Are the datanodes belonging to the two clusters running on the same set of nodes? Can you share the configuration you are using, to diagnose the problem? 

- I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out.
 
- If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. 
>
>
>What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks
> 
>Regards,
>Dhaval


-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Dhaval Shah <pr...@yahoo.co.in>.

No the datanodes are running on different sets of machines. The configuration looks like this:
The problem is that datanodes in clusterA are trying to connect to namenodes in clusterB (and this seems random.. like it trying to randomly select from the 4 namenodes)

<property>
<name>dfs.nameservices</name>
<value>clusterA,clusterB</value>
<description>
Comma-separated list of nameservices.
</description>
<final>true</final>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>clusterA</value>
<description>
The ID of this nameservice. If the nameservice ID is not
configured or more than one nameservice is configured for
dfs.nameservices it is determined automatically by
matching the local node's address with the configured address.
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterA</name>
<value>clusterAnn1,clusterAnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn1</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn2</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterB</name>
<value>clusterBnn1,clusterBnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn1</name>
<value>clusterBnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn2</name>
<value>clusterBnn2:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterA</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterB</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
 
Regards,
Dhaval


________________________________
 From: Suresh Srinivas <su...@hortonworks.com>
To: "hdfs-user@hadoop.apache.org" <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Tuesday, 29 January 2013 6:03 PM
Subject: Re: Using distcp with Hadoop HA
 

Currently, as you have pointed out, client side configuration based failover is used in HA setup. The configuration must define namenode addresses  for the nameservices of both the clusters. Are the datanodes belonging to the two clusters running on the same set of nodes? Can you share the configuration you are using, to diagnose the problem? 

- I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out.
 
- If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. 
>
>
>What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks
> 
>Regards,
>Dhaval


-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Dhaval Shah <pr...@yahoo.co.in>.

No the datanodes are running on different sets of machines. The configuration looks like this:
The problem is that datanodes in clusterA are trying to connect to namenodes in clusterB (and this seems random.. like it trying to randomly select from the 4 namenodes)

<property>
<name>dfs.nameservices</name>
<value>clusterA,clusterB</value>
<description>
Comma-separated list of nameservices.
</description>
<final>true</final>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>clusterA</value>
<description>
The ID of this nameservice. If the nameservice ID is not
configured or more than one nameservice is configured for
dfs.nameservices it is determined automatically by
matching the local node's address with the configured address.
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterA</name>
<value>clusterAnn1,clusterAnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn1</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterA.clusterAnn2</name>
<value>clusterAnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.clusterB</name>
<value>clusterBnn1,clusterBnn2</value>
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn1</name>
<value>clusterBnn1:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.clusterB.clusterBnn2</name>
<value>clusterBnn2:8000</value>
<description>
Set the full address and IPC port of the NameNode process
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterA</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
<property>
<name>dfs.client.failover.proxy.provider.clusterB</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
</description>
<final>true</final>
</property>
 
Regards,
Dhaval


________________________________
 From: Suresh Srinivas <su...@hortonworks.com>
To: "hdfs-user@hadoop.apache.org" <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Tuesday, 29 January 2013 6:03 PM
Subject: Re: Using distcp with Hadoop HA
 

Currently, as you have pointed out, client side configuration based failover is used in HA setup. The configuration must define namenode addresses  for the nameservices of both the clusters. Are the datanodes belonging to the two clusters running on the same set of nodes? Can you share the configuration you are using, to diagnose the problem? 

- I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out.
 
- If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. 
>
>
>What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks
> 
>Regards,
>Dhaval


-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Suresh Srinivas <su...@hortonworks.com>.

Currently, as you have pointed out, client side configuration based
failover is used in HA setup. The configuration must define namenode
addresses  for the nameservices of both the clusters. Are the datanodes
belonging to the two clusters running on the same set of nodes? Can you
share the configuration you are using, to diagnose the problem?

- I am trying to do a distcp from cluster A to cluster B. Since no
> operations are supported on the standby namenode, I need to specify either
> the active namenode while using distcp or use the failover proxy provider
> (dfs.client.failover.proxy.provider.clusterA) where I can specify the two
> namenodes for cluster B and the failover code inside HDFS will figure it
> out.
>


> - If I use the failover proxy provider, some of my datanodes on cluster A
> would connect to the namenode on cluster B and vice versa. I am assuming
> that is because I have configured both nameservices in my hdfs-site.xml for
> distcp to work.. I have configured dfs.nameservice.id to be the right one
> but the datanodes do not seem to respect that.
>
> What is the best way to use distcp with Hadoop HA configuration without
> having the datanodes to connect to the remote namenode? Thanks
>
> Regards,
> Dhaval
>



-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Suresh Srinivas <su...@hortonworks.com>.

Currently, as you have pointed out, client side configuration based
failover is used in HA setup. The configuration must define namenode
addresses  for the nameservices of both the clusters. Are the datanodes
belonging to the two clusters running on the same set of nodes? Can you
share the configuration you are using, to diagnose the problem?

- I am trying to do a distcp from cluster A to cluster B. Since no
> operations are supported on the standby namenode, I need to specify either
> the active namenode while using distcp or use the failover proxy provider
> (dfs.client.failover.proxy.provider.clusterA) where I can specify the two
> namenodes for cluster B and the failover code inside HDFS will figure it
> out.
>


> - If I use the failover proxy provider, some of my datanodes on cluster A
> would connect to the namenode on cluster B and vice versa. I am assuming
> that is because I have configured both nameservices in my hdfs-site.xml for
> distcp to work.. I have configured dfs.nameservice.id to be the right one
> but the datanodes do not seem to respect that.
>
> What is the best way to use distcp with Hadoop HA configuration without
> having the datanodes to connect to the remote namenode? Thanks
>
> Regards,
> Dhaval
>



-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Suresh Srinivas <su...@hortonworks.com>.

Currently, as you have pointed out, client side configuration based
failover is used in HA setup. The configuration must define namenode
addresses  for the nameservices of both the clusters. Are the datanodes
belonging to the two clusters running on the same set of nodes? Can you
share the configuration you are using, to diagnose the problem?

- I am trying to do a distcp from cluster A to cluster B. Since no
> operations are supported on the standby namenode, I need to specify either
> the active namenode while using distcp or use the failover proxy provider
> (dfs.client.failover.proxy.provider.clusterA) where I can specify the two
> namenodes for cluster B and the failover code inside HDFS will figure it
> out.
>


> - If I use the failover proxy provider, some of my datanodes on cluster A
> would connect to the namenode on cluster B and vice versa. I am assuming
> that is because I have configured both nameservices in my hdfs-site.xml for
> distcp to work.. I have configured dfs.nameservice.id to be the right one
> but the datanodes do not seem to respect that.
>
> What is the best way to use distcp with Hadoop HA configuration without
> having the datanodes to connect to the remote namenode? Thanks
>
> Regards,
> Dhaval
>



-- 
http://hortonworks.com/download/

Re: Using distcp with Hadoop HA

Posted by Suresh Srinivas <su...@hortonworks.com>.

Currently, as you have pointed out, client side configuration based
failover is used in HA setup. The configuration must define namenode
addresses  for the nameservices of both the clusters. Are the datanodes
belonging to the two clusters running on the same set of nodes? Can you
share the configuration you are using, to diagnose the problem?

- I am trying to do a distcp from cluster A to cluster B. Since no
> operations are supported on the standby namenode, I need to specify either
> the active namenode while using distcp or use the failover proxy provider
> (dfs.client.failover.proxy.provider.clusterA) where I can specify the two
> namenodes for cluster B and the failover code inside HDFS will figure it
> out.
>


> - If I use the failover proxy provider, some of my datanodes on cluster A
> would connect to the namenode on cluster B and vice versa. I am assuming
> that is because I have configured both nameservices in my hdfs-site.xml for
> distcp to work.. I have configured dfs.nameservice.id to be the right one
> but the datanodes do not seem to respect that.
>
> What is the best way to use distcp with Hadoop HA configuration without
> having the datanodes to connect to the remote namenode? Thanks
>
> Regards,
> Dhaval
>



-- 
http://hortonworks.com/download/