You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alex Gittens <sw...@gmail.com> on 2015/07/01 19:38:47 UTC

Re: Need clarification on spark on cluster set up instruction

I have a similar use case, so I wrote a python script to fix the cluster
configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster
with enough machines that the hdfs system can hold 1Tb (so use instance
types that have SSDs), then follow the instructions at
http://thousandfold.net/cz/2015/07/01/installing-spark-with-hadoop-2-using-spark-ec2/.
Let me know if you have any issues.

On Mon, Jun 29, 2015 at 4:32 PM, manish ranjan <cs...@gmail.com>
wrote:

>
> Hi All
>
> here goes my first question :
> Here is my use case
>
> I have 1TB data I want to process on ec2 using spark
> I have uploaded the data on ebs volume
> The instruction on amazon ec2 set up explains
> "*If your application needs to access large datasets, the fastest way to
> do that is to load them from Amazon S3 or an Amazon EBS device into an
> instance of the Hadoop Distributed File System (HDFS) on your nodes*"
>
> Now the new amazon instances don't have any physical volume
> http://aws.amazon.com/ec2/instance-types/
>
> So do I need to do a set up for HDFS separately  on ec2 (instruction also
> says The spark-ec2 script already sets up a HDFS instance for you") ? Any
> blog/write up which can help me understanding this better ?
>
> ~Manish
>
>
>