You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Brian Gawalt <bg...@gmail.com> on 2014/07/03 01:17:11 UTC

AWS Credentials for private S3 reads

Hello everyone,

I'm having some difficulty reading from my company's private S3 buckets. 
I've got an S3 access key and secret key, and I can read the files fine from
a non-Spark Scala routine via  AWScala <http://github.com/seratch/AWScala> 
. But trying to read them with the SparkContext.textFiles([comma separated
s3n://bucket/key uris]) leads to the following stack trace (where I've
changed the object key to use terms 'subbucket' and 'datafile-' for privacy
reasons:

[error] (run-main-0) org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 HEAD request failed for
'/subbucket%2F2014%2F01%2Fdatafile-01.gz' - ResponseCode=403,
ResponseMessage=Forbidden
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 HEAD request failed for '/ja_quick_info%2F2014%2F01%2Fapplied-01.gz' -
ResponseCode=403, ResponseMessage=Forbidden
	at
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:122)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI
[... etc ...]

I'm handing off the credentials themselves via the following method:

def cleanKey(s: String): String = s.replace("/", "%2F")

val sc = new SparkContext("local[8]", "SparkS3Test")

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
cleanKey(creds.accessKeyId))
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
cleanKey(creds.secretAccessKey))

The comma-separated URIs themselves look each look like:

s3n://odesk-bucket-name/subbucket/2014/01/datafile-01.gz

The actual string that I've replaced with 'subbucket' includes underscores
but otherwise is just straight ASCII; the term 'datafile' is substituting is
also just straight ASCII.

This is using Spark 1.0.0, via a library dependency to sbt of:
"org.apache.spark" % "spark-core_2.10" % "1.0.0"

Any tips appeciated!
Thanks much,
-Brian




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-tp8687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: AWS Credentials for private S3 reads

Posted by Matei Zaharia <ma...@gmail.com>.

Hmm, yeah, that is weird but because it’s only on some files it might mean those didn’t get fully uploaded. 

Matei

On Jul 2, 2014, at 4:50 PM, Brian Gawalt <bg...@gmail.com> wrote:

> HUH; not-scrubbing the slashes fixed it. I would have sworn I tried it, got a
> 403 Forbidden, then remembered the slash prescription. Can confirm I was
> never scrubbing the actual URIs. It looks like it'd all be working now
> except it's smacking its head against:
> 
> 14/07/02 23:37:38 INFO rdd.HadoopRDD: Input split:
> s3n://odesk-bucket/subbucket/2014/01/datafile-01.gz:0+661974299
> 14/07/02 23:37:38 INFO rdd.HadoopRDD: Input split:
> s3n://odesk-bucket/subbucket/2014/01/datafile-03.gz:0+1207089239
> 14/07/02 23:37:38 INFO rdd.HadoopRDD: Input split:
> s3n://odesk-bucket/subbucket/2014/01/datafile-06.gz:0+1155725077
> 14/07/02 23:38:57 ERROR executor.Executor: Exception in task ID 0
> java.io.IOException: stored gzip size doesn't match decompressed size
>        at
> org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeTrailerState(BuiltInGzipDecompressor.java:389)
>        at
> org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:224)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
> 
> but maybe that's just something we need to deal with internally.
> 
> Thanks,
> --Brian
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-tp8689p8692.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: AWS Credentials for private S3 reads

Posted by Brian Gawalt <bg...@gmail.com>.

HUH; not-scrubbing the slashes fixed it. I would have sworn I tried it, got a
403 Forbidden, then remembered the slash prescription. Can confirm I was
never scrubbing the actual URIs. It looks like it'd all be working now
except it's smacking its head against:

14/07/02 23:37:38 INFO rdd.HadoopRDD: Input split:
s3n://odesk-bucket/subbucket/2014/01/datafile-01.gz:0+661974299
14/07/02 23:37:38 INFO rdd.HadoopRDD: Input split:
s3n://odesk-bucket/subbucket/2014/01/datafile-03.gz:0+1207089239
14/07/02 23:37:38 INFO rdd.HadoopRDD: Input split:
s3n://odesk-bucket/subbucket/2014/01/datafile-06.gz:0+1155725077
14/07/02 23:38:57 ERROR executor.Executor: Exception in task ID 0
java.io.IOException: stored gzip size doesn't match decompressed size
        at
org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeTrailerState(BuiltInGzipDecompressor.java:389)
        at
org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:224)
        at
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
        at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)

but maybe that's just something we need to deal with internally.

Thanks,
--Brian



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-tp8689p8692.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: AWS Credentials for private S3 reads

Posted by Matei Zaharia <ma...@gmail.com>.

When you use hadoopConfiguration directly, I don’t think you have to replace the “/“ with “%2f”. Have you tried it without that? Also make sure you’re not replacing slashes in the URL itself.

Matei

On Jul 2, 2014, at 4:17 PM, Brian Gawalt <bg...@gmail.com> wrote:

> Hello everyone,
> 
> I'm having some difficulty reading from my company's private S3 buckets. 
> I've got an S3 access key and secret key, and I can read the files fine from
> a non-Spark Scala routine via  AWScala <http://github.com/seratch/AWScala> 
> . But trying to read them with the SparkContext.textFiles([comma separated
> s3n://bucket/key uris]) leads to the following stack trace (where I've
> changed the object key to use terms 'subbucket' and 'datafile-' for privacy
> reasons:
> 
> [error] (run-main-0) org.apache.hadoop.fs.s3.S3Exception:
> org.jets3t.service.S3ServiceException: S3 HEAD request failed for
> '/subbucket%2F2014%2F01%2Fdatafile-01.gz' - ResponseCode=403,
> ResponseMessage=Forbidden
> org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
> S3 HEAD request failed for '/ja_quick_info%2F2014%2F01%2Fapplied-01.gz' -
> ResponseCode=403, ResponseMessage=Forbidden
> 	at
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:122)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:483)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI
> [... etc ...]
> 
> I'm handing off the credentials themselves via the following method:
> 
> def cleanKey(s: String): String = s.replace("/", "%2F")
> 
> val sc = new SparkContext("local[8]", "SparkS3Test")
> 
> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
> cleanKey(creds.accessKeyId))
> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
> cleanKey(creds.secretAccessKey))
> 
> The comma-separated URIs themselves look each look like:
> 
> s3n://odesk-bucket-name/subbucket/2014/01/datafile-01.gz
> 
> The actual string that I've replaced with 'subbucket' includes underscores
> but otherwise is just straight ASCII; the term 'datafile' is substituting is
> also just straight ASCII.
> 
> This is using Spark 1.0.0, via a library dependency to sbt of:
> "org.apache.spark" % "spark-core_2.10" % "1.0.0"
> 
> Any tips appeciated!
> Thanks much,
> -Brian
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-tp8687.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.