You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Edden Burrow <ed...@gmail.com> on 2016/11/16 22:34:10 UTC

Any with S3 experience with Spark? Having ListBucket issues

Anyone dealing with a lot of files with spark?  We're trying s3a with 2.0.1
because we're seeing intermittent errors in S3 where jobs fail and
saveAsText file fails. Using pyspark.

Is there any issue with working in a S3 folder that has too many files?
How about having versioning enabled? Are these things going to be a problem?

We're pre-building the s3 file list and storing it in a file and passing
that to textFile as a long comma separated list of files - So we are not
running list files.

But we get errors with saveAsText file, related to ListBucket.  Even though
we're not using wildcard '*'.

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
Failed to parse XML document with handler class
org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler


Running spark 2.0.1 with the s3a protocol.

thanks

Re: Any with S3 experience with Spark? Having ListBucket issues

Posted by Steve Loughran <st...@hortonworks.com>.
On 16 Nov 2016, at 22:34, Edden Burrow <ed...@gmail.com>> wrote:

Anyone dealing with a lot of files with spark?  We're trying s3a with 2.0.1 because we're seeing intermittent errors in S3 where jobs fail and saveAsText file fails. Using pyspark.

How many files? Thousands? Millions?

If you do have some big/complex file structure, I'd really like to know; it not only helps us make sure that spark/hive metastore/s3a can handle the layout, it may help improve some advice on what not to do.


Is there any issue with working in a S3 folder that has too many files?  How about having versioning enabled? Are these things going to be a problem?

Many, many files shouldn't be a problem, except for slowing down some operations, and creating larger memory structures to get passed round. Partitioning can get slow.


We're pre-building the s3 file list and storing it in a file and passing that to textFile as a long comma separated list of files - So we are not running list files.

But we get errors with saveAsText file, related to ListBucket.  Even though we're not using wildcard '*'.

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Failed to parse XML document with handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler


at a guess, it'll be some checks before the write that the parent directory exists and the destination path isn't a directory.


Running spark 2.0.1 with the s3a protocol.

Not with a stack trace containing org.jets3t you aren't. That's what you'd expect for s3 and s3n; the key feature of s3a is moving onto the amazon SDK, where stack traces move to com.amazon classes

Make sure you *are* using s3a, ideally on Hadoop 2.7.x  (or even better, HDP 2.5 where you get all the Hadoop 2.8 read pipeline optimisations) On Hadoop 2.6.x there were still some stabilisation issues that only surfaced in the wild.

Some related slides http://www.slideshare.net/steve_l/apache-spark-and-object-stores

-Steve