You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Pagliari, Roberto" <rp...@appcomsci.com> on 2015/07/14 19:50:11 UTC

Spark on EMR with S3 example (Python)

Is there an example about how to load data from a public S3 bucket in Python? I haven't found any.

Thank you,

Re: Spark on EMR with S3 example (Python)

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

I think any requests going to s3*:// requires the credentials. If they have
made it public (via http) then you won't require the keys.

Thanks
Best Regards

On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto <rp...@appcomsci.com>
wrote:

> Hi Sujit,
>
> I just wanted to access public datasets on Amazon. Do I still need to
> provide the keys?
>
>
>
> Thank you,
>
>
>
>
>
> *From:* Sujit Pal [mailto:sujitatgtalk@gmail.com]
> *Sent:* Tuesday, July 14, 2015 3:14 PM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on EMR with S3 example (Python)
>
>
>
> Hi Roberto,
>
>
>
> I have written PySpark code that reads from private S3 buckets, it should
> be similar for public S3 buckets as well. You need to set the AWS access
> and secret keys into the SparkContext, then you can access the S3 folders
> and files with their s3n:// paths. Something like this:
>
>
>
> sc = SparkContext()
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> aws_secret_key)
>
>
>
> mydata = sc.textFile("s3n://mybucket/my_input_folder") \
>
>                     .map(lambda x: do_something(x)) \
>
>                     .saveAsTextFile("s3://mybucket/my_output_folder")
>
> ...
>
>
>
> You can read and write sequence files as well - these are the only 2
> formats I have tried, but I'm sure the other ones like JSON would work
> also. Another approach is to embed the AWS access key and secret key into
> the s3n:// path.
>
>
>
> I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
> an older version but not sure) but it works for access.
>
>
>
> Hope this helps,
>
> Sujit
>
>
>
>
>
> On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <
> rpagliari@appcomsci.com> wrote:
>
> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>
>
>

Re: Spark on EMR with S3 example (Python)

Posted by Sujit Pal <su...@gmail.com>.

Hi Roberto,

I think you would need to as Akhil said. Just checked from this page:

http://aws.amazon.com/public-data-sets/

and clicking through to a few dataset links, all of them are available on
s3 (some are available via http and ftp, but I think the point of these
datasets are that they are usually very large so having it on s3 ensures
that its easier to take your code to it than bring the datasets to your
code.

-sujit


On Tue, Jul 14, 2015 at 1:56 PM, Pagliari, Roberto <rp...@appcomsci.com>
wrote:

> Hi Sujit,
>
> I just wanted to access public datasets on Amazon. Do I still need to
> provide the keys?
>
>
>
> Thank you,
>
>
>
>
>
> *From:* Sujit Pal [mailto:sujitatgtalk@gmail.com]
> *Sent:* Tuesday, July 14, 2015 3:14 PM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on EMR with S3 example (Python)
>
>
>
> Hi Roberto,
>
>
>
> I have written PySpark code that reads from private S3 buckets, it should
> be similar for public S3 buckets as well. You need to set the AWS access
> and secret keys into the SparkContext, then you can access the S3 folders
> and files with their s3n:// paths. Something like this:
>
>
>
> sc = SparkContext()
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
>
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
> aws_secret_key)
>
>
>
> mydata = sc.textFile("s3n://mybucket/my_input_folder") \
>
>                     .map(lambda x: do_something(x)) \
>
>                     .saveAsTextFile("s3://mybucket/my_output_folder")
>
> ...
>
>
>
> You can read and write sequence files as well - these are the only 2
> formats I have tried, but I'm sure the other ones like JSON would work
> also. Another approach is to embed the AWS access key and secret key into
> the s3n:// path.
>
>
>
> I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
> an older version but not sure) but it works for access.
>
>
>
> Hope this helps,
>
> Sujit
>
>
>
>
>
> On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <
> rpagliari@appcomsci.com> wrote:
>
> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>
>
>

RE: Spark on EMR with S3 example (Python)

Posted by "Pagliari, Roberto" <rp...@appcomsci.com>.

Hi Sujit,
I just wanted to access public datasets on Amazon. Do I still need to provide the keys?

Thank you,

From: Sujit Pal [mailto:sujitatgtalk@gmail.com]
Sent: Tuesday, July 14, 2015 3:14 PM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: Spark on EMR with S3 example (Python)

Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_key)

mydata = sc.textFile("s3n://mybucket/my_input_folder") \
                    .map(lambda x: do_something(x)) \
                    .saveAsTextFile("s3://mybucket/my_output_folder")
...

You can read and write sequence files as well - these are the only 2 formats I have tried, but I'm sure the other ones like JSON would work also. Another approach is to embed the AWS access key and secret key into the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its an older version but not sure) but it works for access.

Hope this helps,
Sujit

On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <rp...@appcomsci.com>> wrote:
Is there an example about how to load data from a public S3 bucket in Python? I haven’t found any.

Thank you,

Re: Spark on EMR with S3 example (Python)

Posted by Sujit Pal <su...@gmail.com>.

Hi Roberto,

I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:

sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
aws_secret_key)

mydata = sc.textFile("s3n://mybucket/my_input_folder") \
                    .map(lambda x: do_something(x)) \
                    .saveAsTextFile("s3://mybucket/my_output_folder")
...

You can read and write sequence files as well - these are the only 2
formats I have tried, but I'm sure the other ones like JSON would work
also. Another approach is to embed the AWS access key and secret key into
the s3n:// path.

I wasn't able to use the s3 protocol, but s3n is equivalent (I believe its
an older version but not sure) but it works for access.

Hope this helps,
Sujit

On Tue, Jul 14, 2015 at 10:50 AM, Pagliari, Roberto <rpagliari@appcomsci.com
> wrote:

> Is there an example about how to load data from a public S3 bucket in
> Python? I haven’t found any.
>
>
>
> Thank you,
>
>
>