You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Himanish Kushary <hi...@gmail.com> on 2013/03/28 04:54:07 UTC

Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Hello,

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

The distcp mapreduce job keeps on failing with this error

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"

and in the task attempt logs I can see lot of INFO messages like

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the "
mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
 Looking forward for suggestions.


-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Ted Dunning <td...@maprtech.com>.
The EMR distributions have special versions of the s3 file system.  They
might be helpful here.

Of course, you likely aren't running those if you are seeing 5MB/s.

An extreme alternative would be to light up an EMR cluster, copy to it,
then to S3.


On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kushary <hi...@gmail.com>wrote:

> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.
>

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Thanks Dave.

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.

Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
is not present on the CDH4 (my local env) hadoop jars.

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all
the jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc )

Appreciate your help regarding this.

- Himanish



On Fri, Mar 29, 2013 at 1:41 AM, David Parks <da...@yahoo.com> wrote:

> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
> ** **
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
> ** **
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
> ** **
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
> ** **
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
> ** **
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
> ** **
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Hi Dave,****
>
> ** **
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
> ** **
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Thanks Dave.

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.

Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
is not present on the CDH4 (my local env) hadoop jars.

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all
the jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc )

Appreciate your help regarding this.

- Himanish



On Fri, Mar 29, 2013 at 1:41 AM, David Parks <da...@yahoo.com> wrote:

> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
> ** **
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
> ** **
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
> ** **
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
> ** **
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
> ** **
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
> ** **
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Hi Dave,****
>
> ** **
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
> ** **
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Thanks Dave.

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.

Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
is not present on the CDH4 (my local env) hadoop jars.

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all
the jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc )

Appreciate your help regarding this.

- Himanish



On Fri, Mar 29, 2013 at 1:41 AM, David Parks <da...@yahoo.com> wrote:

> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
> ** **
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
> ** **
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
> ** **
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
> ** **
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
> ** **
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
> ** **
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Hi Dave,****
>
> ** **
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
> ** **
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Thanks Dave.

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.

Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
is not present on the CDH4 (my local env) hadoop jars.

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all
the jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc )

Appreciate your help regarding this.

- Himanish



On Fri, Mar 29, 2013 at 1:41 AM, David Parks <da...@yahoo.com> wrote:

> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
> ** **
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
> ** **
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
> ** **
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
> ** **
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
> ** **
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
> ** **
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Hi Dave,****
>
> ** **
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
> ** **
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
None of that complexity, they distribute the jar publicly (not the source,
but the jar). You can just add this to your libjars:
s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar

 

No VPN or anything, if you can access the internet you can get to S3. 

 

Follow their docs here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s
3distcp.html

 

Doesn't matter where you're Hadoop instance is running.

 

Here's an example of code/parameters I used to run it from within another
Tool, it's a Tool, so it's actually designed to run from the Hadoop command
line normally.

 

       ToolRunner.run(getConf(), new S3DistCp(), new String[] {

              "--src",             "/frugg/image-cache-stage2/",

              "--srcPattern",      ".*part.*",

              "--dest",            "s3n://fruggmapreduce/results-"+env+"/" +
JobUtils.isoDate + "/output/itemtable/", 

              "--s3Endpoint",      "s3.amazonaws.com"         });

 

Watch the "srcPattern", make sure you have that leading `.*`, that one threw
me for a loop once.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 5:51 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hi Dave,

 

Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could
you please provide some details on how i could use the s3distcp from amazon
to transfer data from our on-premises hadoop to amazon s3. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amazon on our local server ?

 

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish 


RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
None of that complexity, they distribute the jar publicly (not the source,
but the jar). You can just add this to your libjars:
s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar

 

No VPN or anything, if you can access the internet you can get to S3. 

 

Follow their docs here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s
3distcp.html

 

Doesn't matter where you're Hadoop instance is running.

 

Here's an example of code/parameters I used to run it from within another
Tool, it's a Tool, so it's actually designed to run from the Hadoop command
line normally.

 

       ToolRunner.run(getConf(), new S3DistCp(), new String[] {

              "--src",             "/frugg/image-cache-stage2/",

              "--srcPattern",      ".*part.*",

              "--dest",            "s3n://fruggmapreduce/results-"+env+"/" +
JobUtils.isoDate + "/output/itemtable/", 

              "--s3Endpoint",      "s3.amazonaws.com"         });

 

Watch the "srcPattern", make sure you have that leading `.*`, that one threw
me for a loop once.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 5:51 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hi Dave,

 

Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could
you please provide some details on how i could use the s3distcp from amazon
to transfer data from our on-premises hadoop to amazon s3. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amazon on our local server ?

 

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish 


RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
None of that complexity, they distribute the jar publicly (not the source,
but the jar). You can just add this to your libjars:
s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar

 

No VPN or anything, if you can access the internet you can get to S3. 

 

Follow their docs here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s
3distcp.html

 

Doesn't matter where you're Hadoop instance is running.

 

Here's an example of code/parameters I used to run it from within another
Tool, it's a Tool, so it's actually designed to run from the Hadoop command
line normally.

 

       ToolRunner.run(getConf(), new S3DistCp(), new String[] {

              "--src",             "/frugg/image-cache-stage2/",

              "--srcPattern",      ".*part.*",

              "--dest",            "s3n://fruggmapreduce/results-"+env+"/" +
JobUtils.isoDate + "/output/itemtable/", 

              "--s3Endpoint",      "s3.amazonaws.com"         });

 

Watch the "srcPattern", make sure you have that leading `.*`, that one threw
me for a loop once.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 5:51 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hi Dave,

 

Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could
you please provide some details on how i could use the s3distcp from amazon
to transfer data from our on-premises hadoop to amazon s3. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amazon on our local server ?

 

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish 


RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
None of that complexity, they distribute the jar publicly (not the source,
but the jar). You can just add this to your libjars:
s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar

 

No VPN or anything, if you can access the internet you can get to S3. 

 

Follow their docs here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s
3distcp.html

 

Doesn't matter where you're Hadoop instance is running.

 

Here's an example of code/parameters I used to run it from within another
Tool, it's a Tool, so it's actually designed to run from the Hadoop command
line normally.

 

       ToolRunner.run(getConf(), new S3DistCp(), new String[] {

              "--src",             "/frugg/image-cache-stage2/",

              "--srcPattern",      ".*part.*",

              "--dest",            "s3n://fruggmapreduce/results-"+env+"/" +
JobUtils.isoDate + "/output/itemtable/", 

              "--s3Endpoint",      "s3.amazonaws.com"         });

 

Watch the "srcPattern", make sure you have that leading `.*`, that one threw
me for a loop once.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 5:51 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hi Dave,

 

Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could
you please provide some details on how i could use the s3distcp from amazon
to transfer data from our on-premises hadoop to amazon s3. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amazon on our local server ?

 

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish 


Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Hi Dave,

Thanks for your reply. Our hadoop instance is inside our corporate
LAN.Could you please provide some details on how i could use the s3distcp
from amazon to transfer data from our on-premises hadoop to amazon s3.
Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
> ** **
>
> Hello,****
>
> ** **
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
> ** **
>
> The distcp mapreduce job keeps on failing with this error ****
>
> ** **
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
> ** **
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
> ** **
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
> ** **
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Hi Dave,

Thanks for your reply. Our hadoop instance is inside our corporate
LAN.Could you please provide some details on how i could use the s3distcp
from amazon to transfer data from our on-premises hadoop to amazon s3.
Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
> ** **
>
> Hello,****
>
> ** **
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
> ** **
>
> The distcp mapreduce job keeps on failing with this error ****
>
> ** **
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
> ** **
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
> ** **
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
> ** **
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Hi Dave,

Thanks for your reply. Our hadoop instance is inside our corporate
LAN.Could you please provide some details on how i could use the s3distcp
from amazon to transfer data from our on-premises hadoop to amazon s3.
Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
> ** **
>
> Hello,****
>
> ** **
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
> ** **
>
> The distcp mapreduce job keeps on failing with this error ****
>
> ** **
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
> ** **
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
> ** **
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
> ** **
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Himanish Kushary <hi...@gmail.com>.
Hi Dave,

Thanks for your reply. Our hadoop instance is inside our corporate
LAN.Could you please provide some details on how i could use the s3distcp
from amazon to transfer data from our on-premises hadoop to amazon s3.
Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <da...@yahoo.com> wrote:

> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
> ** **
>
> Hello,****
>
> ** **
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
> ** **
>
> The distcp mapreduce job keeps on failing with this error ****
>
> ** **
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
> ** **
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
> ** **
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
> ** **
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 


RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 


Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Ted Dunning <td...@maprtech.com>.
The EMR distributions have special versions of the s3 file system.  They
might be helpful here.

Of course, you likely aren't running those if you are seeing 5MB/s.

An extreme alternative would be to light up an EMR cluster, copy to it,
then to S3.


On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kushary <hi...@gmail.com>wrote:

> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.
>

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Ted Dunning <td...@maprtech.com>.
The EMR distributions have special versions of the s3 file system.  They
might be helpful here.

Of course, you likely aren't running those if you are seeing 5MB/s.

An extreme alternative would be to light up an EMR cluster, copy to it,
then to S3.


On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kushary <hi...@gmail.com>wrote:

> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.
>

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by Ted Dunning <td...@maprtech.com>.
The EMR distributions have special versions of the s3 file system.  They
might be helpful here.

Of course, you likely aren't running those if you are seeing 5MB/s.

An extreme alternative would be to light up an EMR cluster, copy to it,
then to S3.


On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kushary <hi...@gmail.com>wrote:

> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.
>

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 


RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Posted by David Parks <da...@yahoo.com>.
Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:himanish@gmail.com] 
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.org
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish