You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2016/02/24 02:27:38 UTC

streaming spark is writing results to S3 a good idea?

Currently our stream apps write results to hdfs. We are running into
problems with HDFS becoming corrupted and running out of space. It seems
like a better solution might be to write directly to S3. Is this a good
idea?

We plan to continue to write our checkpoints to hdfs

Are there any issues to be aware of? Maybe performance or something else to
watch out for?

This is our first S3 project. Does storage just grow on on demand?

Kind regards

Andy


P.s. Turns out we are using an old version of hadoop (v 1.0.4)

Re: streaming spark is writing results to S3 a good idea?

Posted by Sabarish Sasidharan <sa...@manthan.com>.

And yes, storage grows on demand. No issues with that.

Regards
Sab
On 24-Feb-2016 6:57 am, "Andy Davidson" <An...@santacruzintegration.com>
wrote:

> Currently our stream apps write results to hdfs. We are running into
> problems with HDFS becoming corrupted and running out of space. It seems
> like a better solution might be to write directly to S3. Is this a good
> idea?
>
> We plan to continue to write our checkpoints to hdfs
>
> Are there any issues to be aware of? Maybe performance or something else
> to watch out for?
>
> This is our first S3 project. Does storage just grow on on demand?
>
> Kind regards
>
> Andy
>
>
> P.s. Turns out we are using an old version of hadoop (v 1.0.4)
>
>
>
>

Re: streaming spark is writing results to S3 a good idea?

Posted by Sabarish Sasidharan <sa...@manthan.com>.

Writing to S3 is over the network. So will obviously be slower than local
disk. That said, within AWS the network is pretty fast. Still you might
want to write to S3 only after a certain threshold in data is reached, so
that it's efficient. You might also want to use the DirectOutputCommitter
as it avoid one extra set of writes and is doubly faster.

Note that when using S3 your data moves through the public Internet, though
it's still https. If you don't like that you should look at using vpc
endpoints.

Regards
Sab
On 24-Feb-2016 6:57 am, "Andy Davidson" <An...@santacruzintegration.com>
wrote:

> Currently our stream apps write results to hdfs. We are running into
> problems with HDFS becoming corrupted and running out of space. It seems
> like a better solution might be to write directly to S3. Is this a good
> idea?
>
> We plan to continue to write our checkpoints to hdfs
>
> Are there any issues to be aware of? Maybe performance or something else
> to watch out for?
>
> This is our first S3 project. Does storage just grow on on demand?
>
> Kind regards
>
> Andy
>
>
> P.s. Turns out we are using an old version of hadoop (v 1.0.4)
>
>
>
>