You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Darin McBeath <dd...@yahoo.com.INVALID> on 2016/01/13 20:42:38 UTC

Best practice for retrieving over 1 million files from S3

I'm looking for some suggestions based on other's experiences.

I currently have a job that I need to run periodically where I need to read on the order of 1+ million files from an S3 bucket.  It is not the entire bucket (nor does it match a pattern).  Instead, I have a list of random keys that are 'names' for the files in this S3 bucket.  The bucket itself will contain upwards of 60M or more files.

My current approach has been to get my list of keys, partition on the key, and then map this to an underlying class that uses the most recent AWS SDK to retrieve the file from S3 using this key, which then returns the file.  So, in the end, I have an RDD<String>.  This works, but I really wonder if this is the best way.  I suspect there might be a better/faster way.

One thing I've been considering is passing all of the keys (using s3n: urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have embedded newlines).  But, I wonder how either of these would behave if I passed literally a million (or more) 'filenames'.

Before I spend time exploring, I wanted to seek some input.

Any thoughts would be appreciated.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Best practice for retrieving over 1 million files from S3

Posted by Daniel Imberman <da...@gmail.com>.

I guess my big question would be why do you have so many files? Is there no
possibility that you can merge a lot of those files together before
processing them?

On Wed, Jan 13, 2016 at 11:59 AM Darin McBeath <dd...@yahoo.com> wrote:

> Thanks for the tip, as I had not seen this before.  That's pretty much
> what I'm doing already.  Was just thinking there might be a better way.
>
> Darin.
> ------------------------------
> *From:* Daniel Imberman <da...@gmail.com>
> *To:* Darin McBeath <dd...@yahoo.com>; User <us...@spark.apache.org>
> *Sent:* Wednesday, January 13, 2016 2:48 PM
> *Subject:* Re: Best practice for retrieving over 1 million files from S3
>
> Hi Darin,
>
> You should read this article. TextFile is very inefficient in S3.
>
> http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
>
> Cheers
>
> On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath <dd...@yahoo.com.invalid>
> wrote:
>
> I'm looking for some suggestions based on other's experiences.
>
> I currently have a job that I need to run periodically where I need to
> read on the order of 1+ million files from an S3 bucket.  It is not the
> entire bucket (nor does it match a pattern).  Instead, I have a list of
> random keys that are 'names' for the files in this S3 bucket.  The bucket
> itself will contain upwards of 60M or more files.
>
> My current approach has been to get my list of keys, partition on the key,
> and then map this to an underlying class that uses the most recent AWS SDK
> to retrieve the file from S3 using this key, which then returns the file.
> So, in the end, I have an RDD<String>.  This works, but I really wonder if
> this is the best way.  I suspect there might be a better/faster way.
>
> One thing I've been considering is passing all of the keys (using s3n:
> urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have
> embedded newlines).  But, I wonder how either of these would behave if I
> passed literally a million (or more) 'filenames'.
>
> Before I spend time exploring, I wanted to seek some input.
>
> Any thoughts would be appreciated.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>

Re: Best practice for retrieving over 1 million files from S3

Posted by Daniel Imberman <da...@gmail.com>.

Hi Darin,

You should read this article. TextFile is very inefficient in S3.

http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219

Cheers

On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath <dd...@yahoo.com.invalid>
wrote:

> I'm looking for some suggestions based on other's experiences.
>
> I currently have a job that I need to run periodically where I need to
> read on the order of 1+ million files from an S3 bucket.  It is not the
> entire bucket (nor does it match a pattern).  Instead, I have a list of
> random keys that are 'names' for the files in this S3 bucket.  The bucket
> itself will contain upwards of 60M or more files.
>
> My current approach has been to get my list of keys, partition on the key,
> and then map this to an underlying class that uses the most recent AWS SDK
> to retrieve the file from S3 using this key, which then returns the file.
> So, in the end, I have an RDD<String>.  This works, but I really wonder if
> this is the best way.  I suspect there might be a better/faster way.
>
> One thing I've been considering is passing all of the keys (using s3n:
> urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have
> embedded newlines).  But, I wonder how either of these would behave if I
> passed literally a million (or more) 'filenames'.
>
> Before I spend time exploring, I wanted to seek some input.
>
> Any thoughts would be appreciated.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Best practice for retrieving over 1 million files from S3

Posted by Steve Loughran <st...@hortonworks.com>.

use s3a://, especially on hadoop-2.7+. It uses the amazon libs and is faster for directory lookups than jets3t

> On 13 Jan 2016, at 11:42, Darin McBeath <dd...@yahoo.com.INVALID> wrote:
> 
> I'm looking for some suggestions based on other's experiences.
> 
> I currently have a job that I need to run periodically where I need to read on the order of 1+ million files from an S3 bucket.  It is not the entire bucket (nor does it match a pattern).  Instead, I have a list of random keys that are 'names' for the files in this S3 bucket.  The bucket itself will contain upwards of 60M or more files.
> 
> My current approach has been to get my list of keys, partition on the key, and then map this to an underlying class that uses the most recent AWS SDK to retrieve the file from S3 using this key, which then returns the file.  So, in the end, I have an RDD<String>.  This works, but I really wonder if this is the best way.  I suspect there might be a better/faster way.
> 
> One thing I've been considering is passing all of the keys (using s3n: urls) to sc.textFile or sc.wholeTextFiles(since some of my files can have embedded newlines).  But, I wonder how either of these would behave if I passed literally a million (or more) 'filenames'.
> 
> Before I spend time exploring, I wanted to seek some input.
> 
> Any thoughts would be appreciated.
> 
> Darin.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org