You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mayuresh Kunjir <ma...@cs.duke.edu> on 2016/05/29 21:55:00 UTC

Accessing s3a files from Spark

I'm running into permission issues while accessing data in S3 bucket stored
using s3a file system from a local Spark cluster. Has anyone found success
with this?

My setup is:
- Spark 1.6.1 compiled against Hadoop 2.7.2
- aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
- Spark's Hadoop configuration is as follows:

sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")

sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)

sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)

(The secret key does not have any '/' characters which is reported to cause
some issue by others)


I have configured my S3 bucket to grant the necessary permissions. (
https://sparkour.urizone.net/recipes/configuring-s3/)


What works: Listing, reading from, and writing to s3a using hadoop command.
e.g. hadoop dfs -ls s3a://<bucket name>/<file path>


What doesn't work: Reading from s3a using Spark's textFile API. Each task
throws an exception which says *Forbidden Access(403)*.


Some online documents suggest to use IAM roles to grant permissions for an
AWS cluster. But I would like a solution for my local standalone cluster.


Any help would be appreciated.


Regards,

~Mayuresh

Re: Accessing s3a files from Spark

Posted by Mayuresh Kunjir <ma...@cs.duke.edu>.
On Tue, May 31, 2016 at 7:05 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> And on another note, is it required to use s3a? Why not use s3:// only? I
> prefer to use s3a:// only while writing files to S3 from EMR
>

​Does Spark support s3://? I am using s3a over s3n because I needed to
access files larger than 5GB​.

​I am on a local cluster with (most likely) no firewall restrictions, but
considering spawning an EMR cluster now. The version of Spark is 1.6.



>
> Regards,
> Gourav Sengupta
>
> On Tue, May 31, 2016 at 12:04 PM, Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
>> Hi,
>>
>> Is your spark cluster running in EMR or via self created SPARK cluster
>> using EC2 or from a local cluster behind firewall?  What is the SPARK
>> version you are using?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, May 29, 2016 at 10:55 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
>> wrote:
>>
>>> I'm running into permission issues while accessing data in S3 bucket
>>> stored using s3a file system from a local Spark cluster. Has anyone found
>>> success with this?
>>>
>>> My setup is:
>>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>>> - Spark's Hadoop configuration is as follows:
>>>
>>>
>>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>
>>> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>>>
>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>>>
>>> (The secret key does not have any '/' characters which is reported to
>>> cause some issue by others)
>>>
>>>
>>> I have configured my S3 bucket to grant the necessary permissions. (
>>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>>
>>>
>>> What works: Listing, reading from, and writing to s3a using hadoop
>>> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>>>
>>>
>>> What doesn't work: Reading from s3a using Spark's textFile API. Each
>>> task throws an exception which says *Forbidden Access(403)*.
>>>
>>>
>>> Some online documents suggest to use IAM roles to grant permissions for
>>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>>
>>>
>>> Any help would be appreciated.
>>>
>>>
>>> Regards,
>>>
>>> ~Mayuresh
>>>
>>
>>
>

Re: Accessing s3a files from Spark

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

And on another note, is it required to use s3a? Why not use s3:// only? I
prefer to use s3a:// only while writing files to S3 from EMR.

Regards,
Gourav Sengupta

On Tue, May 31, 2016 at 12:04 PM, Gourav Sengupta <gourav.sengupta@gmail.com
> wrote:

> Hi,
>
> Is your spark cluster running in EMR or via self created SPARK cluster
> using EC2 or from a local cluster behind firewall?  What is the SPARK
> version you are using?
>
> Regards,
> Gourav Sengupta
>
> On Sun, May 29, 2016 at 10:55 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
> wrote:
>
>> I'm running into permission issues while accessing data in S3 bucket
>> stored using s3a file system from a local Spark cluster. Has anyone found
>> success with this?
>>
>> My setup is:
>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>> - Spark's Hadoop configuration is as follows:
>>
>>
>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>>
>> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>>
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>>
>> (The secret key does not have any '/' characters which is reported to
>> cause some issue by others)
>>
>>
>> I have configured my S3 bucket to grant the necessary permissions. (
>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>
>>
>> What works: Listing, reading from, and writing to s3a using hadoop
>> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>>
>>
>> What doesn't work: Reading from s3a using Spark's textFile API. Each task
>> throws an exception which says *Forbidden Access(403)*.
>>
>>
>> Some online documents suggest to use IAM roles to grant permissions for
>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>
>>
>> Any help would be appreciated.
>>
>>
>> Regards,
>>
>> ~Mayuresh
>>
>
>

Re: Accessing s3a files from Spark

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

Is your spark cluster running in EMR or via self created SPARK cluster
using EC2 or from a local cluster behind firewall?  What is the SPARK
version you are using?

Regards,
Gourav Sengupta

On Sun, May 29, 2016 at 10:55 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
wrote:

> I'm running into permission issues while accessing data in S3 bucket
> stored using s3a file system from a local Spark cluster. Has anyone found
> success with this?
>
> My setup is:
> - Spark 1.6.1 compiled against Hadoop 2.7.2
> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
> - Spark's Hadoop configuration is as follows:
>
>
> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>
> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>
> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>
> (The secret key does not have any '/' characters which is reported to
> cause some issue by others)
>
>
> I have configured my S3 bucket to grant the necessary permissions. (
> https://sparkour.urizone.net/recipes/configuring-s3/)
>
>
> What works: Listing, reading from, and writing to s3a using hadoop
> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>
>
> What doesn't work: Reading from s3a using Spark's textFile API. Each task
> throws an exception which says *Forbidden Access(403)*.
>
>
> Some online documents suggest to use IAM roles to grant permissions for an
> AWS cluster. But I would like a solution for my local standalone cluster.
>
>
> Any help would be appreciated.
>
>
> Regards,
>
> ~Mayuresh
>

Re: Accessing s3a files from Spark

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

I am sorry, I do read this https://wiki.apache.org/hadoop/AmazonS3 which
mentions about s3:// being deprecated. From what I read using s3a is the
preferred way to go.

Ofcourse, I have been using it for writing data from SPARK but not for
reading yet. Let me try that and come back.

Regards,
Gourav Sengupta

On Tue, May 31, 2016 at 12:22 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
wrote:

> How do I use it? I'm accessing s3a from Spark's textFile API.
>
> On Tue, May 31, 2016 at 7:16 AM, Deepak Sharma <de...@gmail.com>
> wrote:
>
>> Hi Mayuresh
>> Instead of s3a , have you tried the https:// uri for the same s3 bucket?
>>
>> HTH
>> Deepak
>>
>> On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
>> wrote:
>>
>>>
>>>
>>> On Tue, May 31, 2016 at 5:29 AM, Steve Loughran <st...@hortonworks.com>
>>> wrote:
>>>
>>>> which s3 endpoint?
>>>>
>>>>
>>> ​I have tried both s3.amazonaws.com and s3-external-1.amazonaws.com​.
>>>
>>>
>>>>
>>>>
>>>> On 29 May 2016, at 22:55, Mayuresh Kunjir <ma...@cs.duke.edu> wrote:
>>>>
>>>> I'm running into permission issues while accessing data in S3 bucket
>>>> stored using s3a file system from a local Spark cluster. Has anyone found
>>>> success with this?
>>>>
>>>> My setup is:
>>>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>>>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>>>> - Spark's Hadoop configuration is as follows:
>>>>
>>>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>>>> (The secret key does not have any '/' characters which is reported to
>>>> cause some issue by others)
>>>>
>>>> I have configured my S3 bucket to grant the necessary permissions. (
>>>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>>>
>>>> What works: Listing, reading from, and writing to s3a using hadoop
>>>> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>>>>
>>>> What doesn't work: Reading from s3a using Spark's textFile API. Each
>>>> task throws an exception which says *Forbidden Access(403)*.
>>>>
>>>> Some online documents suggest to use IAM roles to grant permissions for
>>>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Regards,
>>>> ~Mayuresh
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>

Re: Accessing s3a files from Spark

Posted by Mayuresh Kunjir <ma...@cs.duke.edu>.
How do I use it? I'm accessing s3a from Spark's textFile API.

On Tue, May 31, 2016 at 7:16 AM, Deepak Sharma <de...@gmail.com>
wrote:

> Hi Mayuresh
> Instead of s3a , have you tried the https:// uri for the same s3 bucket?
>
> HTH
> Deepak
>
> On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
> wrote:
>
>>
>>
>> On Tue, May 31, 2016 at 5:29 AM, Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>> which s3 endpoint?
>>>
>>>
>> ​I have tried both s3.amazonaws.com and s3-external-1.amazonaws.com​.
>>
>>
>>>
>>>
>>> On 29 May 2016, at 22:55, Mayuresh Kunjir <ma...@cs.duke.edu> wrote:
>>>
>>> I'm running into permission issues while accessing data in S3 bucket
>>> stored using s3a file system from a local Spark cluster. Has anyone found
>>> success with this?
>>>
>>> My setup is:
>>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>>> - Spark's Hadoop configuration is as follows:
>>>
>>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>>> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>>> (The secret key does not have any '/' characters which is reported to
>>> cause some issue by others)
>>>
>>> I have configured my S3 bucket to grant the necessary permissions. (
>>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>>
>>> What works: Listing, reading from, and writing to s3a using hadoop
>>> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>>>
>>> What doesn't work: Reading from s3a using Spark's textFile API. Each
>>> task throws an exception which says *Forbidden Access(403)*.
>>>
>>> Some online documents suggest to use IAM roles to grant permissions for
>>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>>
>>> Any help would be appreciated.
>>>
>>> Regards,
>>> ~Mayuresh
>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Accessing s3a files from Spark

Posted by Deepak Sharma <de...@gmail.com>.
Hi Mayuresh
Instead of s3a , have you tried the https:// uri for the same s3 bucket?

HTH
Deepak

On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
wrote:

>
>
> On Tue, May 31, 2016 at 5:29 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>> which s3 endpoint?
>>
>>
> ​I have tried both s3.amazonaws.com and s3-external-1.amazonaws.com​.
>
>
>>
>>
>> On 29 May 2016, at 22:55, Mayuresh Kunjir <ma...@cs.duke.edu> wrote:
>>
>> I'm running into permission issues while accessing data in S3 bucket
>> stored using s3a file system from a local Spark cluster. Has anyone found
>> success with this?
>>
>> My setup is:
>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>> - Spark's Hadoop configuration is as follows:
>>
>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>> (The secret key does not have any '/' characters which is reported to
>> cause some issue by others)
>>
>> I have configured my S3 bucket to grant the necessary permissions. (
>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>
>> What works: Listing, reading from, and writing to s3a using hadoop
>> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>>
>> What doesn't work: Reading from s3a using Spark's textFile API. Each task
>> throws an exception which says *Forbidden Access(403)*.
>>
>> Some online documents suggest to use IAM roles to grant permissions for
>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>
>> Any help would be appreciated.
>>
>> Regards,
>> ~Mayuresh
>>
>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Accessing s3a files from Spark

Posted by Mayuresh Kunjir <ma...@cs.duke.edu>.
On Tue, May 31, 2016 at 5:29 AM, Steve Loughran <st...@hortonworks.com>
wrote:

> which s3 endpoint?
>
>
​I have tried both s3.amazonaws.com and s3-external-1.amazonaws.com​.


>
>
> On 29 May 2016, at 22:55, Mayuresh Kunjir <ma...@cs.duke.edu> wrote:
>
> I'm running into permission issues while accessing data in S3 bucket
> stored using s3a file system from a local Spark cluster. Has anyone found
> success with this?
>
> My setup is:
> - Spark 1.6.1 compiled against Hadoop 2.7.2
> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
> - Spark's Hadoop configuration is as follows:
>
> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
> (The secret key does not have any '/' characters which is reported to
> cause some issue by others)
>
> I have configured my S3 bucket to grant the necessary permissions. (
> https://sparkour.urizone.net/recipes/configuring-s3/)
>
> What works: Listing, reading from, and writing to s3a using hadoop
> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>
> What doesn't work: Reading from s3a using Spark's textFile API. Each task
> throws an exception which says *Forbidden Access(403)*.
>
> Some online documents suggest to use IAM roles to grant permissions for an
> AWS cluster. But I would like a solution for my local standalone cluster.
>
> Any help would be appreciated.
>
> Regards,
> ~Mayuresh
>
>
>

Re: Accessing s3a files from Spark

Posted by Steve Loughran <st...@hortonworks.com>.
which s3 endpoint?



On 29 May 2016, at 22:55, Mayuresh Kunjir <ma...@cs.duke.edu>> wrote:

I'm running into permission issues while accessing data in S3 bucket stored using s3a file system from a local Spark cluster. Has anyone found success with this?

My setup is:
- Spark 1.6.1 compiled against Hadoop 2.7.2
- aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
- Spark's Hadoop configuration is as follows:
sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
(The secret key does not have any '/' characters which is reported to cause some issue by others)

I have configured my S3 bucket to grant the necessary permissions. (https://sparkour.urizone.net/recipes/configuring-s3/)

What works: Listing, reading from, and writing to s3a using hadoop command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>

What doesn't work: Reading from s3a using Spark's textFile API. Each task throws an exception which says *Forbidden Access(403)*.

Some online documents suggest to use IAM roles to grant permissions for an AWS cluster. But I would like a solution for my local standalone cluster.

Any help would be appreciated.

Regards,
~Mayuresh


Re: Accessing s3a files from Spark

Posted by Mayuresh Kunjir <ma...@cs.duke.edu>.
On Sun, May 29, 2016 at 7:49 PM, Ted Yu <yu...@gmail.com> wrote:

> Have you seen this thread ?
>
>
> http://search-hadoop.com/m/q3RTthWU8o1MbFC2&subj=Re+Forbidded+Error+Code+403
>
>
​
Thanks for the pointer. I have followed the thread, got no success though.

I am trying out the Spark branch suggested by Teng Qiu above, will update
soon.

​


> On Sun, May 29, 2016 at 2:55 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
> wrote:
>
>> I'm running into permission issues while accessing data in S3 bucket
>> stored using s3a file system from a local Spark cluster. Has anyone found
>> success with this?
>>
>> My setup is:
>> - Spark 1.6.1 compiled against Hadoop 2.7.2
>> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
>> - Spark's Hadoop configuration is as follows:
>>
>>
>> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>>
>> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>>
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>>
>> (The secret key does not have any '/' characters which is reported to
>> cause some issue by others)
>>
>>
>> I have configured my S3 bucket to grant the necessary permissions. (
>> https://sparkour.urizone.net/recipes/configuring-s3/)
>>
>>
>> What works: Listing, reading from, and writing to s3a using hadoop
>> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>>
>>
>> What doesn't work: Reading from s3a using Spark's textFile API. Each task
>> throws an exception which says *Forbidden Access(403)*.
>>
>>
>> Some online documents suggest to use IAM roles to grant permissions for
>> an AWS cluster. But I would like a solution for my local standalone cluster.
>>
>>
>> Any help would be appreciated.
>>
>>
>> Regards,
>>
>> ~Mayuresh
>>
>
>

Re: Accessing s3a files from Spark

Posted by Ted Yu <yu...@gmail.com>.
Have you seen this thread ?

http://search-hadoop.com/m/q3RTthWU8o1MbFC2&subj=Re+Forbidded+Error+Code+403

On Sun, May 29, 2016 at 2:55 PM, Mayuresh Kunjir <ma...@cs.duke.edu>
wrote:

> I'm running into permission issues while accessing data in S3 bucket
> stored using s3a file system from a local Spark cluster. Has anyone found
> success with this?
>
> My setup is:
> - Spark 1.6.1 compiled against Hadoop 2.7.2
> - aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.2.jar in the classpath
> - Spark's Hadoop configuration is as follows:
>
>
> sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
>
> sc.hadoopConfiguration.set("fs.s3a.access.key", <access>)
>
> sc.hadoopConfiguration.set("fs.s3a.secret.key", <secret>)
>
> (The secret key does not have any '/' characters which is reported to
> cause some issue by others)
>
>
> I have configured my S3 bucket to grant the necessary permissions. (
> https://sparkour.urizone.net/recipes/configuring-s3/)
>
>
> What works: Listing, reading from, and writing to s3a using hadoop
> command. e.g. hadoop dfs -ls s3a://<bucket name>/<file path>
>
>
> What doesn't work: Reading from s3a using Spark's textFile API. Each task
> throws an exception which says *Forbidden Access(403)*.
>
>
> Some online documents suggest to use IAM roles to grant permissions for an
> AWS cluster. But I would like a solution for my local standalone cluster.
>
>
> Any help would be appreciated.
>
>
> Regards,
>
> ~Mayuresh
>