You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by david marion <dl...@hotmail.com> on 2014/04/18 01:42:53 UTC

Client usage with multiple clusters

 I'm having an issue in client code where there are multiple clusters with HA namenodes involved. Example setup using Hadoop 2.3.0:

Cluster A with the following properties defined in core, hdfs, etc:

dfs.nameservices=clusterA
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=
dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Cluster B has similar properties defined in its core-site.xml, hdfs-site.xml, etc.

Now, I want to be able to distcp from clusterA to clusterB. Regardless of which cluster I am executing this from, neither has all of the information. Looking at DFSClient and DataNode:

  - if I put both clusterA and clusterB into dfs.nameservices, then the datanodes will try to federate the blocks from both nameservices.
  - if I don't put both clusterA and clusterB into dfs.nameservices, then the client won't know how to resolve both namenodes for the nameservices in the distcp command.

 I'm wondering if I am missing a property or something that will allow me to define both nameservice on both clusters and have the datanodes for the cluster *not* try and federate. Looking at DataNode, it appears that it tries to connect to all namenodes defined and the first one that sets the clusterid wins. It seems that there should be a dfs.datanode.clusterid property that the datanode uses. This seems to line up with 'namenode -format -clusterid <cluster>' command when you have multiple nameservices. Am I missing something in the configuration that will allow me to do what I want? To get distcp to work I had to create a 3 set of configuration files just for the client to use.
 		 	   		  

Re: Client usage with multiple clusters

Posted by Stanley Shi <ss...@gopivotal.com>.
My guess is to put two set of this
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=

to the client setting, and then access it like hdfs://clusterA/tmp ...

Regards,
*Stanley Shi,*



On Fri, Apr 18, 2014 at 7:42 AM, david marion <dl...@hotmail.com> wrote:

>  I'm having an issue in client code where there are multiple clusters with
> HA namenodes involved. Example setup using Hadoop 2.3.0:
>
> Cluster A with the following properties defined in core, hdfs, etc:
>
> dfs.nameservices=clusterA
> dfs.ha.namenodes.clusterA=nn1,nn2
> dfs.namenode.rpc-address.clusterA.nn1=
> dfs.namenode.http-address.clusterA.nn1=
> dfs.namenode.rpc-address.clusterA.nn2=
> dfs.namenode.http-address.clusterA.nn2=
>
> dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>
> Cluster B has similar properties defined in its core-site.xml,
> hdfs-site.xml, etc.
>
> Now, I want to be able to distcp from clusterA to clusterB. Regardless of
> which cluster I am executing this from, neither has all of the information.
> Looking at DFSClient and DataNode:
>
>   - if I put both clusterA and clusterB into dfs.nameservices, then the
> datanodes will try to federate the blocks from both nameservices.
>   - if I don't put both clusterA and clusterB into dfs.nameservices, then
> the client won't know how to resolve both namenodes for the nameservices in
> the distcp command.
>
>  I'm wondering if I am missing a property or something that will allow me
> to define both nameservice on both clusters and have the datanodes for the
> cluster *not* try and federate. Looking at DataNode, it appears that it
> tries to connect to all namenodes defined and the first one that sets the
> clusterid wins. It seems that there should be a dfs.datanode.clusterid
> property that the datanode uses. This seems to line up with 'namenode
> -format -clusterid <cluster>' command when you have multiple nameservices.
> Am I missing something in the configuration that will allow me to do what I
> want? To get distcp to work I had to create a 3 set of configuration files
> just for the client to use.
>

Re: Client usage with multiple clusters

Posted by Stanley Shi <ss...@gopivotal.com>.
My guess is to put two set of this
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=

to the client setting, and then access it like hdfs://clusterA/tmp ...

Regards,
*Stanley Shi,*



On Fri, Apr 18, 2014 at 7:42 AM, david marion <dl...@hotmail.com> wrote:

>  I'm having an issue in client code where there are multiple clusters with
> HA namenodes involved. Example setup using Hadoop 2.3.0:
>
> Cluster A with the following properties defined in core, hdfs, etc:
>
> dfs.nameservices=clusterA
> dfs.ha.namenodes.clusterA=nn1,nn2
> dfs.namenode.rpc-address.clusterA.nn1=
> dfs.namenode.http-address.clusterA.nn1=
> dfs.namenode.rpc-address.clusterA.nn2=
> dfs.namenode.http-address.clusterA.nn2=
>
> dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>
> Cluster B has similar properties defined in its core-site.xml,
> hdfs-site.xml, etc.
>
> Now, I want to be able to distcp from clusterA to clusterB. Regardless of
> which cluster I am executing this from, neither has all of the information.
> Looking at DFSClient and DataNode:
>
>   - if I put both clusterA and clusterB into dfs.nameservices, then the
> datanodes will try to federate the blocks from both nameservices.
>   - if I don't put both clusterA and clusterB into dfs.nameservices, then
> the client won't know how to resolve both namenodes for the nameservices in
> the distcp command.
>
>  I'm wondering if I am missing a property or something that will allow me
> to define both nameservice on both clusters and have the datanodes for the
> cluster *not* try and federate. Looking at DataNode, it appears that it
> tries to connect to all namenodes defined and the first one that sets the
> clusterid wins. It seems that there should be a dfs.datanode.clusterid
> property that the datanode uses. This seems to line up with 'namenode
> -format -clusterid <cluster>' command when you have multiple nameservices.
> Am I missing something in the configuration that will allow me to do what I
> want? To get distcp to work I had to create a 3 set of configuration files
> just for the client to use.
>

Re: Client usage with multiple clusters

Posted by Stanley Shi <ss...@gopivotal.com>.
My guess is to put two set of this
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=

to the client setting, and then access it like hdfs://clusterA/tmp ...

Regards,
*Stanley Shi,*



On Fri, Apr 18, 2014 at 7:42 AM, david marion <dl...@hotmail.com> wrote:

>  I'm having an issue in client code where there are multiple clusters with
> HA namenodes involved. Example setup using Hadoop 2.3.0:
>
> Cluster A with the following properties defined in core, hdfs, etc:
>
> dfs.nameservices=clusterA
> dfs.ha.namenodes.clusterA=nn1,nn2
> dfs.namenode.rpc-address.clusterA.nn1=
> dfs.namenode.http-address.clusterA.nn1=
> dfs.namenode.rpc-address.clusterA.nn2=
> dfs.namenode.http-address.clusterA.nn2=
>
> dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>
> Cluster B has similar properties defined in its core-site.xml,
> hdfs-site.xml, etc.
>
> Now, I want to be able to distcp from clusterA to clusterB. Regardless of
> which cluster I am executing this from, neither has all of the information.
> Looking at DFSClient and DataNode:
>
>   - if I put both clusterA and clusterB into dfs.nameservices, then the
> datanodes will try to federate the blocks from both nameservices.
>   - if I don't put both clusterA and clusterB into dfs.nameservices, then
> the client won't know how to resolve both namenodes for the nameservices in
> the distcp command.
>
>  I'm wondering if I am missing a property or something that will allow me
> to define both nameservice on both clusters and have the datanodes for the
> cluster *not* try and federate. Looking at DataNode, it appears that it
> tries to connect to all namenodes defined and the first one that sets the
> clusterid wins. It seems that there should be a dfs.datanode.clusterid
> property that the datanode uses. This seems to line up with 'namenode
> -format -clusterid <cluster>' command when you have multiple nameservices.
> Am I missing something in the configuration that will allow me to do what I
> want? To get distcp to work I had to create a 3 set of configuration files
> just for the client to use.
>

Re: Client usage with multiple clusters

Posted by Stanley Shi <ss...@gopivotal.com>.
My guess is to put two set of this
dfs.ha.namenodes.clusterA=nn1,nn2
dfs.namenode.rpc-address.clusterA.nn1=
dfs.namenode.http-address.clusterA.nn1=
dfs.namenode.rpc-address.clusterA.nn2=
dfs.namenode.http-address.clusterA.nn2=

to the client setting, and then access it like hdfs://clusterA/tmp ...

Regards,
*Stanley Shi,*



On Fri, Apr 18, 2014 at 7:42 AM, david marion <dl...@hotmail.com> wrote:

>  I'm having an issue in client code where there are multiple clusters with
> HA namenodes involved. Example setup using Hadoop 2.3.0:
>
> Cluster A with the following properties defined in core, hdfs, etc:
>
> dfs.nameservices=clusterA
> dfs.ha.namenodes.clusterA=nn1,nn2
> dfs.namenode.rpc-address.clusterA.nn1=
> dfs.namenode.http-address.clusterA.nn1=
> dfs.namenode.rpc-address.clusterA.nn2=
> dfs.namenode.http-address.clusterA.nn2=
>
> dfs.client.failover.proxy.provider.clusterA=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>
> Cluster B has similar properties defined in its core-site.xml,
> hdfs-site.xml, etc.
>
> Now, I want to be able to distcp from clusterA to clusterB. Regardless of
> which cluster I am executing this from, neither has all of the information.
> Looking at DFSClient and DataNode:
>
>   - if I put both clusterA and clusterB into dfs.nameservices, then the
> datanodes will try to federate the blocks from both nameservices.
>   - if I don't put both clusterA and clusterB into dfs.nameservices, then
> the client won't know how to resolve both namenodes for the nameservices in
> the distcp command.
>
>  I'm wondering if I am missing a property or something that will allow me
> to define both nameservice on both clusters and have the datanodes for the
> cluster *not* try and federate. Looking at DataNode, it appears that it
> tries to connect to all namenodes defined and the first one that sets the
> clusterid wins. It seems that there should be a dfs.datanode.clusterid
> property that the datanode uses. This seems to line up with 'namenode
> -format -clusterid <cluster>' command when you have multiple nameservices.
> Am I missing something in the configuration that will allow me to do what I
> want? To get distcp to work I had to create a 3 set of configuration files
> just for the client to use.
>