You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by James Newhaven <ja...@gmail.com> on 2012/06/08 13:40:12 UTC

Copying files to Amazon S3 using Pig is slow

I want to copy 26,000 HDFS files generated by a pig script to Amazon S3.

I am using the copyToLocal command, but I noticed the copy throughput is
only one file per second - so it is going to take about 7 hours to copy all
the files.

The command I am using is: copyToLocal /tmp/files/ s3://my-bucket/

Does anyone have any ideas how I could speed this up?

Thanks,
James

Re: Copying files to Amazon S3 using Pig is slow

Posted by Mohit Anchlia <mo...@gmail.com>.

Also use multiple streams of s3 to get better throughput

On Fri, Jun 8, 2012 at 3:24 PM, Aniket Mokashi <an...@gmail.com> wrote:

>
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
>
> On Fri, Jun 8, 2012 at 4:40 AM, James Newhaven <james.newhaven@gmail.com
> >wrote:
>
> > I want to copy 26,000 HDFS files generated by a pig script to Amazon S3.
> >
> > I am using the copyToLocal command, but I noticed the copy throughput is
> > only one file per second - so it is going to take about 7 hours to copy
> all
> > the files.
> >
> > The command I am using is: copyToLocal /tmp/files/ s3://my-bucket/
> >
> > Does anyone have any ideas how I could speed this up?
> >
> > Thanks,
> > James
> >
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>

Re: Copying files to Amazon S3 using Pig is slow

Posted by Aniket Mokashi <an...@gmail.com>.

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

On Fri, Jun 8, 2012 at 4:40 AM, James Newhaven <ja...@gmail.com>wrote:

> I want to copy 26,000 HDFS files generated by a pig script to Amazon S3.
>
> I am using the copyToLocal command, but I noticed the copy throughput is
> only one file per second - so it is going to take about 7 hours to copy all
> the files.
>
> The command I am using is: copyToLocal /tmp/files/ s3://my-bucket/
>
> Does anyone have any ideas how I could speed this up?
>
> Thanks,
> James
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Copying files to Amazon S3 using Pig is slow

Posted by Dan Young <da...@gmail.com>.

Definitely go down the s3distcp route. I use it to copy large number of
smaller files from s3 into fewer larger ones in HDFS and it's been working
great. This also helps out with the Pig jobs running faster vs. having Pig
try to load XXXX files from s3.

Regards,

Dan

On Fri, Jun 8, 2012 at 5:40 AM, James Newhaven <ja...@gmail.com>wrote:

> I want to copy 26,000 HDFS files generated by a pig script to Amazon S3.
>
> I am using the copyToLocal command, but I noticed the copy throughput is
> only one file per second - so it is going to take about 7 hours to copy all
> the files.
>
> The command I am using is: copyToLocal /tmp/files/ s3://my-bucket/
>
> Does anyone have any ideas how I could speed this up?
>
> Thanks,
> James
>