You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Jameel Al-Aziz <ja...@6sense.com> on 2014/09/20 02:45:56 UTC

Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi all,

We’re in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We’re running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we’ve tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting “fs.default.name” to the public DNS name of the new name node.

And on both clusters:
- Setting the “dfs.datanode.use.datanode.hostname” and “dfs.client.use.datanode.hostname” to “true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven’t tried running distcp in the other direction (I’m about to), but I suspect I’ll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

Thanks for the info.

However, we are not using EMR. We are using our own cluster. We have tried everything listed before in an attempt to get the nodes to register with their public DNS name. The frustration comes from the fact that Hadoop/HDFS seemingly ignores out attempts and still sends out the private hostname. I’m not even sure where it’s even getting the private hostname from anymore unless it’s looking at the instance’s metadata or doing a reverse DNS lookup within the cluster. In either case, the behavior seems incorrect (my settings should be honored without needed to setup /etc/hosts). Mesos handles this with a very simple and straightforward "--hostname” flag or config file and it works just fine.

That being said, we have decided to go through S3 and have worked out the issues we were having with the directory layout.


jameel al-aziz

software engineer |  6Sense

p  818.458.4846<tel:+18184584846> | jameel@6sense.com<ma...@6sense.com>


On September 25, 2014 at 9:10:59 AM, Ankit Singhal (ankitsinghal59@gmail.com<ma...@gmail.com>) wrote:

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and private hostname to Ip can only be resolved within the VPC.(through default local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the internal_hostname:port (socket) of datanodes for data transfer and client will resolve this hostname to Ip from local DNS only and that's why you need to update your local /etc/hosts to resolve it to public IP, so that you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS always use internal hostname and Ip for any data transfer or communication for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>>
Sent: Sep 20, 2014 1:12 PM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We’re in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We’re running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we’ve tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting “fs.default.name<http://fs.default.name>” to the public DNS name of the new name node.

And on both clusters:
- Setting the “dfs.datanode.use.datanode.hostname” and “dfs.client.use.datanode.hostname” to “true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven’t tried running distcp in the other direction (I’m about to), but I suspect I’ll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

Thanks for the info.

However, we are not using EMR. We are using our own cluster. We have tried everything listed before in an attempt to get the nodes to register with their public DNS name. The frustration comes from the fact that Hadoop/HDFS seemingly ignores out attempts and still sends out the private hostname. I’m not even sure where it’s even getting the private hostname from anymore unless it’s looking at the instance’s metadata or doing a reverse DNS lookup within the cluster. In either case, the behavior seems incorrect (my settings should be honored without needed to setup /etc/hosts). Mesos handles this with a very simple and straightforward "--hostname” flag or config file and it works just fine.

That being said, we have decided to go through S3 and have worked out the issues we were having with the directory layout.


jameel al-aziz

software engineer |  6Sense

p  818.458.4846<tel:+18184584846> | jameel@6sense.com<ma...@6sense.com>


On September 25, 2014 at 9:10:59 AM, Ankit Singhal (ankitsinghal59@gmail.com<ma...@gmail.com>) wrote:

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and private hostname to Ip can only be resolved within the VPC.(through default local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the internal_hostname:port (socket) of datanodes for data transfer and client will resolve this hostname to Ip from local DNS only and that's why you need to update your local /etc/hosts to resolve it to public IP, so that you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS always use internal hostname and Ip for any data transfer or communication for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>>
Sent: Sep 20, 2014 1:12 PM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We’re in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We’re running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we’ve tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting “fs.default.name<http://fs.default.name>” to the public DNS name of the new name node.

And on both clusters:
- Setting the “dfs.datanode.use.datanode.hostname” and “dfs.client.use.datanode.hostname” to “true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven’t tried running distcp in the other direction (I’m about to), but I suspect I’ll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

Thanks for the info.

However, we are not using EMR. We are using our own cluster. We have tried everything listed before in an attempt to get the nodes to register with their public DNS name. The frustration comes from the fact that Hadoop/HDFS seemingly ignores out attempts and still sends out the private hostname. I’m not even sure where it’s even getting the private hostname from anymore unless it’s looking at the instance’s metadata or doing a reverse DNS lookup within the cluster. In either case, the behavior seems incorrect (my settings should be honored without needed to setup /etc/hosts). Mesos handles this with a very simple and straightforward "--hostname” flag or config file and it works just fine.

That being said, we have decided to go through S3 and have worked out the issues we were having with the directory layout.


jameel al-aziz

software engineer |  6Sense

p  818.458.4846<tel:+18184584846> | jameel@6sense.com<ma...@6sense.com>


On September 25, 2014 at 9:10:59 AM, Ankit Singhal (ankitsinghal59@gmail.com<ma...@gmail.com>) wrote:

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and private hostname to Ip can only be resolved within the VPC.(through default local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the internal_hostname:port (socket) of datanodes for data transfer and client will resolve this hostname to Ip from local DNS only and that's why you need to update your local /etc/hosts to resolve it to public IP, so that you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS always use internal hostname and Ip for any data transfer or communication for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>>
Sent: Sep 20, 2014 1:12 PM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We’re in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We’re running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we’ve tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting “fs.default.name<http://fs.default.name>” to the public DNS name of the new name node.

And on both clusters:
- Setting the “dfs.datanode.use.datanode.hostname” and “dfs.client.use.datanode.hostname” to “true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven’t tried running distcp in the other direction (I’m about to), but I suspect I’ll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

Thanks for the info.

However, we are not using EMR. We are using our own cluster. We have tried everything listed before in an attempt to get the nodes to register with their public DNS name. The frustration comes from the fact that Hadoop/HDFS seemingly ignores out attempts and still sends out the private hostname. I’m not even sure where it’s even getting the private hostname from anymore unless it’s looking at the instance’s metadata or doing a reverse DNS lookup within the cluster. In either case, the behavior seems incorrect (my settings should be honored without needed to setup /etc/hosts). Mesos handles this with a very simple and straightforward "--hostname” flag or config file and it works just fine.

That being said, we have decided to go through S3 and have worked out the issues we were having with the directory layout.


jameel al-aziz

software engineer |  6Sense

p  818.458.4846<tel:+18184584846> | jameel@6sense.com<ma...@6sense.com>


On September 25, 2014 at 9:10:59 AM, Ankit Singhal (ankitsinghal59@gmail.com<ma...@gmail.com>) wrote:

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and private hostname to Ip can only be resolved within the VPC.(through default local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the internal_hostname:port (socket) of datanodes for data transfer and client will resolve this hostname to Ip from local DNS only and that's why you need to update your local /etc/hosts to resolve it to public IP, so that you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS always use internal hostname and Ip for any data transfer or communication for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>>
Sent: Sep 20, 2014 1:12 PM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We’re in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We’re running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we’ve tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting “fs.default.name<http://fs.default.name>” to the public DNS name of the new name node.

And on both clusters:
- Setting the “dfs.datanode.use.datanode.hostname” and “dfs.client.use.datanode.hostname” to “true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven’t tried running distcp in the other direction (I’m about to), but I suspect I’ll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and
private hostname to Ip can only be resolved within the VPC.(through default
local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the
internal_hostname:port (socket) of datanodes for data transfer and client
will resolve this hostname to Ip from local DNS only and that's why you
need to update your local /etc/hosts to resolve it to public IP, so that
you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS
always use internal hostname and Ip for any data transfer or communication
for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Also, out of curiosity. Why would we need to update the /etc/hosts if
> the nodes appear to be registered with their public hostname? Where would
> they be getting the private hostname from?
>
>  The only thing I can think of would be if the namenode resolved the data
> node hostname to an ip, then did a reverse DNS lookup, and then reported
> that. However, that seems completely illogical.
>
>  Jameel Al-Aziz
>
>  *From:* Jameel Al-Aziz <ja...@6sense.com>
> *Sent:* Sep 20, 2014 1:12 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Ankit,
>
>  We originally tried to copy to S3 and back. In fact, it is actually our
> fallback plan. We were having issues with the copy to S3 not maintaining
> the directory layout, so we decided to try and do a direct copy.
>
>  I'll give it another shot though!
>
>  Jameel Al-Aziz
>
>  *From:* Ankit Singhal <an...@gmail.com>
> *Sent:* Sep 20, 2014 8:25 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Jameel,
>
>  As Peyman said, best approach is to do distcp from your old cluster to
> s3 and making MR job reading directly from s3 on new cluster.
>
>  but If you still need to do distcp from hdfs to hdfs then update
> /etc/hosts or DNS of all the nodes of your old cluster with "publicIp
> internalAWSDNSName" of all nodes of new cluster.
> for eq:-
> /etc/hosts of all nodes of old cluster should have entry of all the nodes
> of new cluster in below format.
> 54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
> 54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
>  54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal
>
>  Regards,
> Ankit Singhal
>
> On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> It maybe easier to copy the data to s3 and then from s3 to the new
>> cluster.
>>
>> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>
>> wrote:
>>
>>>  Hi all,
>>>
>>>  We’re in the process of migrating from EC2-Classic to VPC and needed
>>> to transfer our HDFS data. We setup a new cluster inside the VPC, and
>>> assigned the name node and data node temporary public IPs. Initially, we
>>> had a lot of trouble getting the name node to redirect to the public
>>> hostname instead of private IPs. After some fiddling around, we finally got
>>> webhdfs and dfs -cp to work using public hostnames. However, distcp simply
>>> refuses to use the public hostnames when connecting to the data nodes.
>>>
>>>  We’re running distcp on the old cluster, copying data into the new
>>> cluster.
>>>
>>>  The old hadoop cluster is running 1.0.4 and the new one is running
>>> 1.2.1.
>>>
>>>  So far, on the new cluster, we’ve tried:
>>>  - Using public DNS hostnames in the master and slaves files (on both
>>> the name node and data nodes)
>>>  - Setting the hostname of all the boxes to their public DNS name
>>>  - Setting “fs.default.name” to the public DNS name of the new name
>>> node.
>>>
>>>  And on both clusters:
>>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>>> cluster.
>>>
>>>  Even though webhdfs is finally redirecting to data nodes using the
>>> public hostname, we keep seeing errors when running distcp. The errors are
>>> all similar to: http://pastebin.com/ZYR07Fvm
>>>
>>>  What do we need to do to get distcp to use the public hostname of the
>>> new machines? I haven’t tried running distcp in the other direction (I’m
>>> about to), but I suspect I’ll run into the same problem.
>>>
>>>  Thanks!
>>>  Jameel
>>>
>>
>>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and
private hostname to Ip can only be resolved within the VPC.(through default
local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the
internal_hostname:port (socket) of datanodes for data transfer and client
will resolve this hostname to Ip from local DNS only and that's why you
need to update your local /etc/hosts to resolve it to public IP, so that
you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS
always use internal hostname and Ip for any data transfer or communication
for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Also, out of curiosity. Why would we need to update the /etc/hosts if
> the nodes appear to be registered with their public hostname? Where would
> they be getting the private hostname from?
>
>  The only thing I can think of would be if the namenode resolved the data
> node hostname to an ip, then did a reverse DNS lookup, and then reported
> that. However, that seems completely illogical.
>
>  Jameel Al-Aziz
>
>  *From:* Jameel Al-Aziz <ja...@6sense.com>
> *Sent:* Sep 20, 2014 1:12 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Ankit,
>
>  We originally tried to copy to S3 and back. In fact, it is actually our
> fallback plan. We were having issues with the copy to S3 not maintaining
> the directory layout, so we decided to try and do a direct copy.
>
>  I'll give it another shot though!
>
>  Jameel Al-Aziz
>
>  *From:* Ankit Singhal <an...@gmail.com>
> *Sent:* Sep 20, 2014 8:25 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Jameel,
>
>  As Peyman said, best approach is to do distcp from your old cluster to
> s3 and making MR job reading directly from s3 on new cluster.
>
>  but If you still need to do distcp from hdfs to hdfs then update
> /etc/hosts or DNS of all the nodes of your old cluster with "publicIp
> internalAWSDNSName" of all nodes of new cluster.
> for eq:-
> /etc/hosts of all nodes of old cluster should have entry of all the nodes
> of new cluster in below format.
> 54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
> 54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
>  54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal
>
>  Regards,
> Ankit Singhal
>
> On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> It maybe easier to copy the data to s3 and then from s3 to the new
>> cluster.
>>
>> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>
>> wrote:
>>
>>>  Hi all,
>>>
>>>  We’re in the process of migrating from EC2-Classic to VPC and needed
>>> to transfer our HDFS data. We setup a new cluster inside the VPC, and
>>> assigned the name node and data node temporary public IPs. Initially, we
>>> had a lot of trouble getting the name node to redirect to the public
>>> hostname instead of private IPs. After some fiddling around, we finally got
>>> webhdfs and dfs -cp to work using public hostnames. However, distcp simply
>>> refuses to use the public hostnames when connecting to the data nodes.
>>>
>>>  We’re running distcp on the old cluster, copying data into the new
>>> cluster.
>>>
>>>  The old hadoop cluster is running 1.0.4 and the new one is running
>>> 1.2.1.
>>>
>>>  So far, on the new cluster, we’ve tried:
>>>  - Using public DNS hostnames in the master and slaves files (on both
>>> the name node and data nodes)
>>>  - Setting the hostname of all the boxes to their public DNS name
>>>  - Setting “fs.default.name” to the public DNS name of the new name
>>> node.
>>>
>>>  And on both clusters:
>>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>>> cluster.
>>>
>>>  Even though webhdfs is finally redirecting to data nodes using the
>>> public hostname, we keep seeing errors when running distcp. The errors are
>>> all similar to: http://pastebin.com/ZYR07Fvm
>>>
>>>  What do we need to do to get distcp to use the public hostname of the
>>> new machines? I haven’t tried running distcp in the other direction (I’m
>>> about to), but I suspect I’ll run into the same problem.
>>>
>>>  Thanks!
>>>  Jameel
>>>
>>
>>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and
private hostname to Ip can only be resolved within the VPC.(through default
local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the
internal_hostname:port (socket) of datanodes for data transfer and client
will resolve this hostname to Ip from local DNS only and that's why you
need to update your local /etc/hosts to resolve it to public IP, so that
you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS
always use internal hostname and Ip for any data transfer or communication
for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Also, out of curiosity. Why would we need to update the /etc/hosts if
> the nodes appear to be registered with their public hostname? Where would
> they be getting the private hostname from?
>
>  The only thing I can think of would be if the namenode resolved the data
> node hostname to an ip, then did a reverse DNS lookup, and then reported
> that. However, that seems completely illogical.
>
>  Jameel Al-Aziz
>
>  *From:* Jameel Al-Aziz <ja...@6sense.com>
> *Sent:* Sep 20, 2014 1:12 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Ankit,
>
>  We originally tried to copy to S3 and back. In fact, it is actually our
> fallback plan. We were having issues with the copy to S3 not maintaining
> the directory layout, so we decided to try and do a direct copy.
>
>  I'll give it another shot though!
>
>  Jameel Al-Aziz
>
>  *From:* Ankit Singhal <an...@gmail.com>
> *Sent:* Sep 20, 2014 8:25 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Jameel,
>
>  As Peyman said, best approach is to do distcp from your old cluster to
> s3 and making MR job reading directly from s3 on new cluster.
>
>  but If you still need to do distcp from hdfs to hdfs then update
> /etc/hosts or DNS of all the nodes of your old cluster with "publicIp
> internalAWSDNSName" of all nodes of new cluster.
> for eq:-
> /etc/hosts of all nodes of old cluster should have entry of all the nodes
> of new cluster in below format.
> 54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
> 54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
>  54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal
>
>  Regards,
> Ankit Singhal
>
> On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> It maybe easier to copy the data to s3 and then from s3 to the new
>> cluster.
>>
>> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>
>> wrote:
>>
>>>  Hi all,
>>>
>>>  We’re in the process of migrating from EC2-Classic to VPC and needed
>>> to transfer our HDFS data. We setup a new cluster inside the VPC, and
>>> assigned the name node and data node temporary public IPs. Initially, we
>>> had a lot of trouble getting the name node to redirect to the public
>>> hostname instead of private IPs. After some fiddling around, we finally got
>>> webhdfs and dfs -cp to work using public hostnames. However, distcp simply
>>> refuses to use the public hostnames when connecting to the data nodes.
>>>
>>>  We’re running distcp on the old cluster, copying data into the new
>>> cluster.
>>>
>>>  The old hadoop cluster is running 1.0.4 and the new one is running
>>> 1.2.1.
>>>
>>>  So far, on the new cluster, we’ve tried:
>>>  - Using public DNS hostnames in the master and slaves files (on both
>>> the name node and data nodes)
>>>  - Setting the hostname of all the boxes to their public DNS name
>>>  - Setting “fs.default.name” to the public DNS name of the new name
>>> node.
>>>
>>>  And on both clusters:
>>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>>> cluster.
>>>
>>>  Even though webhdfs is finally redirecting to data nodes using the
>>> public hostname, we keep seeing errors when running distcp. The errors are
>>> all similar to: http://pastebin.com/ZYR07Fvm
>>>
>>>  What do we need to do to get distcp to use the public hostname of the
>>> new machines? I haven’t tried running distcp in the other direction (I’m
>>> about to), but I suspect I’ll run into the same problem.
>>>
>>>  Thanks!
>>>  Jameel
>>>
>>
>>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

Datanodes on EMR are registered with namenode with their private DNS and
private hostname to Ip can only be resolved within the VPC.(through default
local DNS server attach to a aws VPC).
And, Client connecting to namenode will always get the
internal_hostname:port (socket) of datanodes for data transfer and client
will resolve this hostname to Ip from local DNS only and that's why you
need to update your local /etc/hosts to resolve it to public IP, so that
you can transfer data over the internet.


Logic behind this:-
AWS charges for any data coming in/out through internet. so thats why AWS
always use internal hostname and Ip for any data transfer or communication
for an EMR.

Let me know if you need more info around this.

Regards,
Ankit Singhal

On Sun, Sep 21, 2014 at 1:48 AM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Also, out of curiosity. Why would we need to update the /etc/hosts if
> the nodes appear to be registered with their public hostname? Where would
> they be getting the private hostname from?
>
>  The only thing I can think of would be if the namenode resolved the data
> node hostname to an ip, then did a reverse DNS lookup, and then reported
> that. However, that seems completely illogical.
>
>  Jameel Al-Aziz
>
>  *From:* Jameel Al-Aziz <ja...@6sense.com>
> *Sent:* Sep 20, 2014 1:12 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Ankit,
>
>  We originally tried to copy to S3 and back. In fact, it is actually our
> fallback plan. We were having issues with the copy to S3 not maintaining
> the directory layout, so we decided to try and do a direct copy.
>
>  I'll give it another shot though!
>
>  Jameel Al-Aziz
>
>  *From:* Ankit Singhal <an...@gmail.com>
> *Sent:* Sep 20, 2014 8:25 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Unable to use transfer data using distcp between
> EC2-classic cluster and VPC cluster
>
>  Hi Jameel,
>
>  As Peyman said, best approach is to do distcp from your old cluster to
> s3 and making MR job reading directly from s3 on new cluster.
>
>  but If you still need to do distcp from hdfs to hdfs then update
> /etc/hosts or DNS of all the nodes of your old cluster with "publicIp
> internalAWSDNSName" of all nodes of new cluster.
> for eq:-
> /etc/hosts of all nodes of old cluster should have entry of all the nodes
> of new cluster in below format.
> 54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
> 54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
>  54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal
>
>  Regards,
> Ankit Singhal
>
> On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> It maybe easier to copy the data to s3 and then from s3 to the new
>> cluster.
>>
>> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>
>> wrote:
>>
>>>  Hi all,
>>>
>>>  We’re in the process of migrating from EC2-Classic to VPC and needed
>>> to transfer our HDFS data. We setup a new cluster inside the VPC, and
>>> assigned the name node and data node temporary public IPs. Initially, we
>>> had a lot of trouble getting the name node to redirect to the public
>>> hostname instead of private IPs. After some fiddling around, we finally got
>>> webhdfs and dfs -cp to work using public hostnames. However, distcp simply
>>> refuses to use the public hostnames when connecting to the data nodes.
>>>
>>>  We’re running distcp on the old cluster, copying data into the new
>>> cluster.
>>>
>>>  The old hadoop cluster is running 1.0.4 and the new one is running
>>> 1.2.1.
>>>
>>>  So far, on the new cluster, we’ve tried:
>>>  - Using public DNS hostnames in the master and slaves files (on both
>>> the name node and data nodes)
>>>  - Setting the hostname of all the boxes to their public DNS name
>>>  - Setting “fs.default.name” to the public DNS name of the new name
>>> node.
>>>
>>>  And on both clusters:
>>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>>> cluster.
>>>
>>>  Even though webhdfs is finally redirecting to data nodes using the
>>> public hostname, we keep seeing errors when running distcp. The errors are
>>> all similar to: http://pastebin.com/ZYR07Fvm
>>>
>>>  What do we need to do to get distcp to use the public hostname of the
>>> new machines? I haven’t tried running distcp in the other direction (I’m
>>> about to), but I suspect I’ll run into the same problem.
>>>
>>>  Thanks!
>>>  Jameel
>>>
>>
>>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>
Sent: Sep 20, 2014 1:12 PM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>
Sent: Sep 20, 2014 1:12 PM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>
Sent: Sep 20, 2014 1:12 PM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Also, out of curiosity. Why would we need to update the /etc/hosts if the nodes appear to be registered with their public hostname? Where would they be getting the private hostname from?

The only thing I can think of would be if the namenode resolved the data node hostname to an ip, then did a reverse DNS lookup, and then reported that. However, that seems completely illogical.

Jameel Al-Aziz

From: Jameel Al-Aziz <ja...@6sense.com>
Sent: Sep 20, 2014 1:12 PM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Jameel Al-Aziz <ja...@6sense.com>.

Hi Ankit,

We originally tried to copy to S3 and back. In fact, it is actually our fallback plan. We were having issues with the copy to S3 not maintaining the directory layout, so we decided to try and do a direct copy.

I'll give it another shot though!

Jameel Al-Aziz

From: Ankit Singhal <an...@gmail.com>
Sent: Sep 20, 2014 8:25 AM
To: user@hadoop.apache.org
Subject: Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3 and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts or DNS of all the nodes of your old cluster with "publicIp   internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>> wrote:
It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com>> wrote:
Hi all,

We're in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the VPC, and assigned the name node and data node temporary public IPs. Initially, we had a lot of trouble getting the name node to redirect to the public hostname instead of private IPs. After some fiddling around, we finally got webhdfs and dfs -cp to work using public hostnames. However, distcp simply refuses to use the public hostnames when connecting to the data nodes.

We're running distcp on the old cluster, copying data into the new cluster.

The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.

So far, on the new cluster, we've tried:
- Using public DNS hostnames in the master and slaves files (on both the name node and data nodes)
- Setting the hostname of all the boxes to their public DNS name
- Setting "fs.default.name<http://fs.default.name>" to the public DNS name of the new name node.

And on both clusters:
- Setting the "dfs.datanode.use.datanode.hostname" and "dfs.client.use.datanode.hostname" to "true" on both the old and new cluster.

Even though webhdfs is finally redirecting to data nodes using the public hostname, we keep seeing errors when running distcp. The errors are all similar to: http://pastebin.com/ZYR07Fvm

What do we need to do to get distcp to use the public hostname of the new machines? I haven't tried running distcp in the other direction (I'm about to), but I suspect I'll run into the same problem.

Thanks!
Jameel

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3
and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts
or DNS of all the nodes of your old cluster with "publicIp
internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes
of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> It maybe easier to copy the data to s3 and then from s3 to the new cluster.
>
> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:
>
>>  Hi all,
>>
>>  We’re in the process of migrating from EC2-Classic to VPC and needed to
>> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
>> the name node and data node temporary public IPs. Initially, we had a lot
>> of trouble getting the name node to redirect to the public hostname instead
>> of private IPs. After some fiddling around, we finally got webhdfs and dfs
>> -cp to work using public hostnames. However, distcp simply refuses to use
>> the public hostnames when connecting to the data nodes.
>>
>>  We’re running distcp on the old cluster, copying data into the new
>> cluster.
>>
>>  The old hadoop cluster is running 1.0.4 and the new one is running
>> 1.2.1.
>>
>>  So far, on the new cluster, we’ve tried:
>>  - Using public DNS hostnames in the master and slaves files (on both the
>> name node and data nodes)
>>  - Setting the hostname of all the boxes to their public DNS name
>>  - Setting “fs.default.name” to the public DNS name of the new name node.
>>
>>  And on both clusters:
>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>> cluster.
>>
>>  Even though webhdfs is finally redirecting to data nodes using the
>> public hostname, we keep seeing errors when running distcp. The errors are
>> all similar to: http://pastebin.com/ZYR07Fvm
>>
>>  What do we need to do to get distcp to use the public hostname of the
>> new machines? I haven’t tried running distcp in the other direction (I’m
>> about to), but I suspect I’ll run into the same problem.
>>
>>  Thanks!
>>  Jameel
>>
>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3
and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts
or DNS of all the nodes of your old cluster with "publicIp
internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes
of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> It maybe easier to copy the data to s3 and then from s3 to the new cluster.
>
> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:
>
>>  Hi all,
>>
>>  We’re in the process of migrating from EC2-Classic to VPC and needed to
>> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
>> the name node and data node temporary public IPs. Initially, we had a lot
>> of trouble getting the name node to redirect to the public hostname instead
>> of private IPs. After some fiddling around, we finally got webhdfs and dfs
>> -cp to work using public hostnames. However, distcp simply refuses to use
>> the public hostnames when connecting to the data nodes.
>>
>>  We’re running distcp on the old cluster, copying data into the new
>> cluster.
>>
>>  The old hadoop cluster is running 1.0.4 and the new one is running
>> 1.2.1.
>>
>>  So far, on the new cluster, we’ve tried:
>>  - Using public DNS hostnames in the master and slaves files (on both the
>> name node and data nodes)
>>  - Setting the hostname of all the boxes to their public DNS name
>>  - Setting “fs.default.name” to the public DNS name of the new name node.
>>
>>  And on both clusters:
>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>> cluster.
>>
>>  Even though webhdfs is finally redirecting to data nodes using the
>> public hostname, we keep seeing errors when running distcp. The errors are
>> all similar to: http://pastebin.com/ZYR07Fvm
>>
>>  What do we need to do to get distcp to use the public hostname of the
>> new machines? I haven’t tried running distcp in the other direction (I’m
>> about to), but I suspect I’ll run into the same problem.
>>
>>  Thanks!
>>  Jameel
>>
>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3
and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts
or DNS of all the nodes of your old cluster with "publicIp
internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes
of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> It maybe easier to copy the data to s3 and then from s3 to the new cluster.
>
> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:
>
>>  Hi all,
>>
>>  We’re in the process of migrating from EC2-Classic to VPC and needed to
>> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
>> the name node and data node temporary public IPs. Initially, we had a lot
>> of trouble getting the name node to redirect to the public hostname instead
>> of private IPs. After some fiddling around, we finally got webhdfs and dfs
>> -cp to work using public hostnames. However, distcp simply refuses to use
>> the public hostnames when connecting to the data nodes.
>>
>>  We’re running distcp on the old cluster, copying data into the new
>> cluster.
>>
>>  The old hadoop cluster is running 1.0.4 and the new one is running
>> 1.2.1.
>>
>>  So far, on the new cluster, we’ve tried:
>>  - Using public DNS hostnames in the master and slaves files (on both the
>> name node and data nodes)
>>  - Setting the hostname of all the boxes to their public DNS name
>>  - Setting “fs.default.name” to the public DNS name of the new name node.
>>
>>  And on both clusters:
>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>> cluster.
>>
>>  Even though webhdfs is finally redirecting to data nodes using the
>> public hostname, we keep seeing errors when running distcp. The errors are
>> all similar to: http://pastebin.com/ZYR07Fvm
>>
>>  What do we need to do to get distcp to use the public hostname of the
>> new machines? I haven’t tried running distcp in the other direction (I’m
>> about to), but I suspect I’ll run into the same problem.
>>
>>  Thanks!
>>  Jameel
>>
>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Ankit Singhal <an...@gmail.com>.

Hi Jameel,

As Peyman said, best approach is to do distcp from your old cluster to s3
and making MR job reading directly from s3 on new cluster.

but If you still need to do distcp from hdfs to hdfs then update /etc/hosts
or DNS of all the nodes of your old cluster with "publicIp
internalAWSDNSName" of all nodes of new cluster.
for eq:-
/etc/hosts of all nodes of old cluster should have entry of all the nodes
of new cluster in below format.
54.xxx.xxx.xx1   ip-10-xxx-xxx-xx1.ec2.internal
54.xxx.xxx.xx2   ip-10-xxx-xxx-xx2.ec2.internal
54.xxx.xxx.xx3   ip-10-xxx-xxx-xx3.ec2.internal

Regards,
Ankit Singhal

On Sat, Sep 20, 2014 at 8:36 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> It maybe easier to copy the data to s3 and then from s3 to the new cluster.
>
> On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:
>
>>  Hi all,
>>
>>  We’re in the process of migrating from EC2-Classic to VPC and needed to
>> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
>> the name node and data node temporary public IPs. Initially, we had a lot
>> of trouble getting the name node to redirect to the public hostname instead
>> of private IPs. After some fiddling around, we finally got webhdfs and dfs
>> -cp to work using public hostnames. However, distcp simply refuses to use
>> the public hostnames when connecting to the data nodes.
>>
>>  We’re running distcp on the old cluster, copying data into the new
>> cluster.
>>
>>  The old hadoop cluster is running 1.0.4 and the new one is running
>> 1.2.1.
>>
>>  So far, on the new cluster, we’ve tried:
>>  - Using public DNS hostnames in the master and slaves files (on both the
>> name node and data nodes)
>>  - Setting the hostname of all the boxes to their public DNS name
>>  - Setting “fs.default.name” to the public DNS name of the new name node.
>>
>>  And on both clusters:
>>  - Setting the “dfs.datanode.use.datanode.hostname” and
>> “dfs.client.use.datanode.hostname” to “true" on both the old and new
>> cluster.
>>
>>  Even though webhdfs is finally redirecting to data nodes using the
>> public hostname, we keep seeing errors when running distcp. The errors are
>> all similar to: http://pastebin.com/ZYR07Fvm
>>
>>  What do we need to do to get distcp to use the public hostname of the
>> new machines? I haven’t tried running distcp in the other direction (I’m
>> about to), but I suspect I’ll run into the same problem.
>>
>>  Thanks!
>>  Jameel
>>
>
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Peyman Mohajerian <mo...@gmail.com>.

It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Hi all,
>
>  We’re in the process of migrating from EC2-Classic to VPC and needed to
> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
> the name node and data node temporary public IPs. Initially, we had a lot
> of trouble getting the name node to redirect to the public hostname instead
> of private IPs. After some fiddling around, we finally got webhdfs and dfs
> -cp to work using public hostnames. However, distcp simply refuses to use
> the public hostnames when connecting to the data nodes.
>
>  We’re running distcp on the old cluster, copying data into the new
> cluster.
>
>  The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.
>
>  So far, on the new cluster, we’ve tried:
>  - Using public DNS hostnames in the master and slaves files (on both the
> name node and data nodes)
>  - Setting the hostname of all the boxes to their public DNS name
>  - Setting “fs.default.name” to the public DNS name of the new name node.
>
>  And on both clusters:
>  - Setting the “dfs.datanode.use.datanode.hostname” and
> “dfs.client.use.datanode.hostname” to “true" on both the old and new
> cluster.
>
>  Even though webhdfs is finally redirecting to data nodes using the
> public hostname, we keep seeing errors when running distcp. The errors are
> all similar to: http://pastebin.com/ZYR07Fvm
>
>  What do we need to do to get distcp to use the public hostname of the
> new machines? I haven’t tried running distcp in the other direction (I’m
> about to), but I suspect I’ll run into the same problem.
>
>  Thanks!
>  Jameel
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Peyman Mohajerian <mo...@gmail.com>.

It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Hi all,
>
>  We’re in the process of migrating from EC2-Classic to VPC and needed to
> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
> the name node and data node temporary public IPs. Initially, we had a lot
> of trouble getting the name node to redirect to the public hostname instead
> of private IPs. After some fiddling around, we finally got webhdfs and dfs
> -cp to work using public hostnames. However, distcp simply refuses to use
> the public hostnames when connecting to the data nodes.
>
>  We’re running distcp on the old cluster, copying data into the new
> cluster.
>
>  The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.
>
>  So far, on the new cluster, we’ve tried:
>  - Using public DNS hostnames in the master and slaves files (on both the
> name node and data nodes)
>  - Setting the hostname of all the boxes to their public DNS name
>  - Setting “fs.default.name” to the public DNS name of the new name node.
>
>  And on both clusters:
>  - Setting the “dfs.datanode.use.datanode.hostname” and
> “dfs.client.use.datanode.hostname” to “true" on both the old and new
> cluster.
>
>  Even though webhdfs is finally redirecting to data nodes using the
> public hostname, we keep seeing errors when running distcp. The errors are
> all similar to: http://pastebin.com/ZYR07Fvm
>
>  What do we need to do to get distcp to use the public hostname of the
> new machines? I haven’t tried running distcp in the other direction (I’m
> about to), but I suspect I’ll run into the same problem.
>
>  Thanks!
>  Jameel
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Peyman Mohajerian <mo...@gmail.com>.

It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Hi all,
>
>  We’re in the process of migrating from EC2-Classic to VPC and needed to
> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
> the name node and data node temporary public IPs. Initially, we had a lot
> of trouble getting the name node to redirect to the public hostname instead
> of private IPs. After some fiddling around, we finally got webhdfs and dfs
> -cp to work using public hostnames. However, distcp simply refuses to use
> the public hostnames when connecting to the data nodes.
>
>  We’re running distcp on the old cluster, copying data into the new
> cluster.
>
>  The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.
>
>  So far, on the new cluster, we’ve tried:
>  - Using public DNS hostnames in the master and slaves files (on both the
> name node and data nodes)
>  - Setting the hostname of all the boxes to their public DNS name
>  - Setting “fs.default.name” to the public DNS name of the new name node.
>
>  And on both clusters:
>  - Setting the “dfs.datanode.use.datanode.hostname” and
> “dfs.client.use.datanode.hostname” to “true" on both the old and new
> cluster.
>
>  Even though webhdfs is finally redirecting to data nodes using the
> public hostname, we keep seeing errors when running distcp. The errors are
> all similar to: http://pastebin.com/ZYR07Fvm
>
>  What do we need to do to get distcp to use the public hostname of the
> new machines? I haven’t tried running distcp in the other direction (I’m
> about to), but I suspect I’ll run into the same problem.
>
>  Thanks!
>  Jameel
>

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

Posted by Peyman Mohajerian <mo...@gmail.com>.

It maybe easier to copy the data to s3 and then from s3 to the new cluster.

On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz <ja...@6sense.com> wrote:

>  Hi all,
>
>  We’re in the process of migrating from EC2-Classic to VPC and needed to
> transfer our HDFS data. We setup a new cluster inside the VPC, and assigned
> the name node and data node temporary public IPs. Initially, we had a lot
> of trouble getting the name node to redirect to the public hostname instead
> of private IPs. After some fiddling around, we finally got webhdfs and dfs
> -cp to work using public hostnames. However, distcp simply refuses to use
> the public hostnames when connecting to the data nodes.
>
>  We’re running distcp on the old cluster, copying data into the new
> cluster.
>
>  The old hadoop cluster is running 1.0.4 and the new one is running 1.2.1.
>
>  So far, on the new cluster, we’ve tried:
>  - Using public DNS hostnames in the master and slaves files (on both the
> name node and data nodes)
>  - Setting the hostname of all the boxes to their public DNS name
>  - Setting “fs.default.name” to the public DNS name of the new name node.
>
>  And on both clusters:
>  - Setting the “dfs.datanode.use.datanode.hostname” and
> “dfs.client.use.datanode.hostname” to “true" on both the old and new
> cluster.
>
>  Even though webhdfs is finally redirecting to data nodes using the
> public hostname, we keep seeing errors when running distcp. The errors are
> all similar to: http://pastebin.com/ZYR07Fvm
>
>  What do we need to do to get distcp to use the public hostname of the
> new machines? I haven’t tried running distcp in the other direction (I’m
> about to), but I suspect I’ll run into the same problem.
>
>  Thanks!
>  Jameel
>