You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2016/08/30 18:58:20 UTC
[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

    [ https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449830#comment-15449830 ] 

Steve Loughran commented on SPARK-17307:
----------------------------------------

I think this a subset of SPARK-7481, where I am doing the docs

https://github.com/steveloughran/spark/blob/f39018eee40ef463ebfdfb0f6a7ba6384b46c459/docs/cloud-integration.md

I haven't done the bit on authentication setup through; I'm planning to point to the [Hadoop docs there|https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html], because as well as the details on how to configure the latest hadoop s3x clients, it's got a troubleshooting section.

Looking at the code,

# It's dangerous to put AWS secrets in the source file —it's too easy to leak them. Stick them in your spark configuration file, prefixed with {{spark.hadoop}}
# if you are using Hadoop 2.7+ as the Hadoop version, please use s3a:// paths instead of s3n://. Your life will be better.

Anyway, can you have a look at the cloud integration doc I've linked to, comment on the [pull request|https://github.com/apache/spark/pull/12004] where it could be improved....I'll do my best


> Document what all access is needed on S3 bucket when trying to save a model
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-17307
>                 URL: https://issues.apache.org/jira/browse/SPARK-17307
>             Project: Spark
>          Issue Type: Documentation
>            Reporter: Aseem Bansal
>            Priority: Minor
>
> I faced this lack of documentation when I was trying to save a model to S3. Initially I thought it should be only write. Then I found it also needs delete to delete temporary files. Now I requested access for delete and tried again and I am get the error
> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/dev-qa_%24folder%24' XML Error Message
> To reproduce this error the below can be used
> {code}
> SparkSession sparkSession = SparkSession
>                 .builder()
>                 .appName("my app")
>                 .master("local") 
>                 .getOrCreate();
>         JavaSparkContext jsc = new JavaSparkContext(sparkSession.sparkContext());
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", <ACCESS_KEY>);
>         jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", <SECRET ACCESS KEY>);
> //Create a Pipelinemode
>         pipelineModel.write().overwrite().save("s3n://<BUCKET>/dev-qa/modelTest");
> {code}
> This back and forth could be avoided if it was clearly mentioned what all access spark needs to write to S3. Also would be great if why all of the access is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org