You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Joshua Buss <jo...@gmail.com> on 2016/02/26 20:29:46 UTC

s3 access through proxy

Hello,

I'm trying to use spark with google cloud storage, but from a network where
I cannot talk to the outside internet directly.  This means we have a proxy
set up for all requests heading towards GCS.

So far, I've had good luck with projects that talk to S3 through libraries
like boto (for python) and the AWS SDK (for node.js), because these both
appear compatible with what Google calls " interoperability mode
<https://cloud.google.com/storage/docs/migrating>  ".

Spark, (or whatever it uses for s3 connectivity under the hood, maybe
"JetS3t"?), on the other hand, doesn't appear to be compatible. 
Furthermore, I can't use hadoop Google Cloud Storage connector because it
doesn't have any properties exposed for setting up a proxy host.

 When I set the following core-xml values for the s3a connector, I get an
AmazonS3Exception:

Caused by: com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception:
The provided security credentials are not valid. (Service: Amazon S3; Status
Code: 403; Error Code: InvalidSecurity; Request ID: null), S3 Extended
Request ID: null

I know this isn't real xml, I just condensed it a bit for readability:
  <name>fs.s3a.access.key</name>
  <value>....</value>
  <name>fs.s3a.secret.key</name>
  <value>....</value>
  <name>fs.s3a.endpoint</name>
  <value>https://storage.googleapis.com</value>
  <name>fs.s3a.connection.ssl.enabled</name>
  <value>True</value>
  <name>fs.s3a.proxy.host</name>
  <value>proxyhost</value>
  <name>fs.s3a.proxy.port</name>
  <value>12345</value>

I'd inspect the traffic manually to learn more, but it's all encrypted with
SSL of course.  Any suggestions?

I feel like I should also mention that since this is critical to get working
asap, I'm also going down the route of using another local proxy like  this
one <https://github.com/reversefold/python-s3proxy>   written in python
because boto handles the interoperability mode correctly and is designed to
"look like actual s3" to clients of it.

Using this approach I can get a basic read to work with spark but this has
numerous drawbacks, including being very fragile, not supporting writes, and
not to mention I'm sure it'll be a huge performance bottleneck once we start
trying to run larger workloads ...so... I'd like to get the native s3a/s3n
connectors working if at all possible, but need some help.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/s3-access-through-proxy-tp26347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: s3 access through proxy

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

why are you trying to access data in S3 via another network? Does that not
cause huge network overhead, and data transmissions losses (as data is
getting transferred over internet) and other inconsistencies?

Have you tried using AWS CLI? Using "aws s3 sync" command you can copy all
the files in a s3://bucket/ or s3://bucket/key/ to your local system. And
then you can point your spark cluster to the local data store and run the
queries.Of course that depends on the data volume as well.

Regards,
Gourav Sengupta

On Fri, Feb 26, 2016 at 7:29 PM, Joshua Buss <jo...@gmail.com> wrote:

> Hello,
>
> I'm trying to use spark with google cloud storage, but from a network where
> I cannot talk to the outside internet directly.  This means we have a proxy
> set up for all requests heading towards GCS.
>
> So far, I've had good luck with projects that talk to S3 through libraries
> like boto (for python) and the AWS SDK (for node.js), because these both
> appear compatible with what Google calls " interoperability mode
> <https://cloud.google.com/storage/docs/migrating>  ".
>
> Spark, (or whatever it uses for s3 connectivity under the hood, maybe
> "JetS3t"?), on the other hand, doesn't appear to be compatible.
> Furthermore, I can't use hadoop Google Cloud Storage connector because it
> doesn't have any properties exposed for setting up a proxy host.
>
>  When I set the following core-xml values for the s3a connector, I get an
> AmazonS3Exception:
>
> Caused by: com.cloudera.com.amazonaws.services.s3.model.AmazonS3Exception:
> The provided security credentials are not valid. (Service: Amazon S3;
> Status
> Code: 403; Error Code: InvalidSecurity; Request ID: null), S3 Extended
> Request ID: null
>
> I know this isn't real xml, I just condensed it a bit for readability:
>   <name>fs.s3a.access.key</name>
>   <value>....</value>
>   <name>fs.s3a.secret.key</name>
>   <value>....</value>
>   <name>fs.s3a.endpoint</name>
>   <value>https://storage.googleapis.com</value>
>   <name>fs.s3a.connection.ssl.enabled</name>
>   <value>True</value>
>   <name>fs.s3a.proxy.host</name>
>   <value>proxyhost</value>
>   <name>fs.s3a.proxy.port</name>
>   <value>12345</value>
>
> I'd inspect the traffic manually to learn more, but it's all encrypted with
> SSL of course.  Any suggestions?
>
> I feel like I should also mention that since this is critical to get
> working
> asap, I'm also going down the route of using another local proxy like  this
> one <https://github.com/reversefold/python-s3proxy>   written in python
> because boto handles the interoperability mode correctly and is designed to
> "look like actual s3" to clients of it.
>
> Using this approach I can get a basic read to work with spark but this has
> numerous drawbacks, including being very fragile, not supporting writes,
> and
> not to mention I'm sure it'll be a huge performance bottleneck once we
> start
> trying to run larger workloads ...so... I'd like to get the native s3a/s3n
> connectors working if at all possible, but need some help.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/s3-access-through-proxy-tp26347.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>