You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aseem Bansal <as...@gmail.com> on 2016/10/12 09:49:18 UTC

Reading from and writing to different S3 buckets in spark

Hi

I want to read CSV from one bucket, do some processing and write to a
different bucket. I know the way to set S3 credentials using

jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

But the problem is that spark is lazy. So if do the following

   - set credentails 1
   - read input csv
   - do some processing
   - set credentials 2
   - write result csv

Then there is a chance that due to laziness while reading input csv the
program may try to use credentails 2.

A solution is to cache the result csv but in case there is not enough
storage it is possible that the csv will be re-read. So how to handle this
situation?

Re: Reading from and writing to different S3 buckets in spark

Posted by Steve Loughran <st...@hortonworks.com>.

On 12 Oct 2016, at 10:49, Aseem Bansal <as...@gmail.com>> wrote:

Hi

I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using

jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

But the problem is that spark is lazy. So if do the following

  *   set credentails 1
  *   read input csv
  *   do some processing
  *   set credentials 2
  *   write result csv

Then there is a chance that due to laziness while reading input csv the program may try to use credentails 2.

A solution is to cache the result csv but in case there is not enough storage it is possible that the csv will be re-read. So how to handle this situation?


1. use S3a as your destination
2. play with bucket configs so you don't need separate accounts to work with them.
3. Buffer output to local HDFS and then copy in (distCP?)

I'd actually go for #3 as S3 isn't ideal for using as a direct destination of work

Re: Reading from and writing to different S3 buckets in spark

Posted by Mridul Muralidharan <mr...@gmail.com>.

If using RDD's, you can use saveAsHadoopFile or saveAsNewAPIHadoopFile
with the conf passed in which overrides the keys you need.
For example, you can do :

val saveConf = new Configuration(sc.hadoopConfiguration)
// configure saveConf with overridden s3 config
rdd.saveAsNewAPIHadoopFile(..., conf = saveConf)



Regards,
Mridul


On Wed, Oct 12, 2016 at 2:49 AM, Aseem Bansal <as...@gmail.com> wrote:
> Hi
>
> I want to read CSV from one bucket, do some processing and write to a
> different bucket. I know the way to set S3 credentials using
>
> jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
> jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
>
> But the problem is that spark is lazy. So if do the following
>
> set credentails 1
> read input csv
> do some processing
> set credentials 2
> write result csv
>
> Then there is a chance that due to laziness while reading input csv the
> program may try to use credentails 2.
>
> A solution is to cache the result csv but in case there is not enough
> storage it is possible that the csv will be re-read. So how to handle this
> situation?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org