You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Grzegorz Białek <gr...@codilime.com> on 2014/09/18 12:19:55 UTC

Spot instances on Amazon EMR

Hi,
I would like to run Spark application on Amazon EMR. I have some questions
about that:
1. I have input data on other hdfs (not on Amazon). Can I send all input
data from that cluster to HDFS on Amazon EMR cluster (if it has enough
storage memory) or do I have send it to Amazon S3 storage and then load
this data on EMR cluster where I want to run my application?
2. Which nodes should be on-demand instances and which can be spot
instances (I don't want to spend to much money but I also lost my data or
have to recompute everything after spot instance interruption)?
3. Can I use Amazon S3 storage for input and output data to have less
on-demand instances and more spot instances? (Or maybe there is another
solution to lower costs)

I would like to run this application once and computation would take around
30h I think.

Could you answer on (at least some of) this questions?

Thanks,
Grzegorz

Re: Spot instances on Amazon EMR

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Grzegorz,

EMR is a service that is not maintained by the Spark community. So
this list isn't the right place to ask EMR questions.

- Patrick

On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek
<gr...@codilime.com> wrote:
> Hi,
> I would like to run Spark application on Amazon EMR. I have some questions
> about that:
> 1. I have input data on other hdfs (not on Amazon). Can I send all input
> data from that cluster to HDFS on Amazon EMR cluster (if it has enough
> storage memory) or do I have send it to Amazon S3 storage and then load this
> data on EMR cluster where I want to run my application?
> 2. Which nodes should be on-demand instances and which can be spot instances
> (I don't want to spend to much money but I also lost my data or have to
> recompute everything after spot instance interruption)?
> 3. Can I use Amazon S3 storage for input and output data to have less
> on-demand instances and more spot instances? (Or maybe there is another
> solution to lower costs)
>
> I would like to run this application once and computation would take around
> 30h I think.
>
> Could you answer on (at least some of) this questions?
>
> Thanks,
> Grzegorz

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org