You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by mohan <ra...@gmail.com> on 2014/09/30 18:32:35 UTC

Installation question

Sorry to ask another basic question.

Could you point out what I should read to setup a pseudo-distributed
Hadoop,Mahout and Spark cluster ? Does it really need something like CDH ?

I want to access Mahout and Spark output and display in Play(outside CDH). I
also want to access Spark output from R. The VM may hinder it.

I have a 4 GB Mac and want to avoid another VM if I can.

Thanks,
Mohan



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installation-question-tp15412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Installation question

Posted by Sean Owen <so...@cloudera.com>.

If you want a single-machine 'cluster' to try all of these things, you
don't strictly need a distribution, but, it will probably save you a
great deal of time and trouble compared to setting all of this up by
hand.

Naturally I would promote CDH, as it contains Spark and Mahout and
supports them all, but you can find other distributions that you can
get working too.

I don't think this changes the issue of running a VM or not, and 4GB
is small to run all of the processes of a Hadoop cluster and still
have room to get work done. This won't change because you set it up by
hand, although, I find using a distribution lets you easily turn off
services you don't want and turn down memory settings for example.

You do not have to consume a 1-machine cluster as a VM image. (Note,
you can run R or Play inside the VM or other instance you create.) For
example, in the case of CDH, Cloudera Manager is also the installer
and can set up a cluster on any machine you like.  Or, you can connect
to the instance you create as if it's a remote machine and access the
data from R for example.

Also consider running an instance in the cloud on Amazon EC2 or GCE,
which you can pause and restart when you want to play with it.

In the case of Spark, you don't strictly need Hadoop at all. It's easy
to play around locally on the local file system.

On Tue, Sep 30, 2014 at 5:32 PM, mohan <ra...@gmail.com> wrote:
> Sorry to ask another basic question.
>
> Could you point out what I should read to setup a pseudo-distributed
> Hadoop,Mahout and Spark cluster ? Does it really need something like CDH ?
>
> I want to access Mahout and Spark output and display in Play(outside CDH). I
> also want to access Spark output from R. The VM may hinder it.
>
> I have a 4 GB Mac and want to avoid another VM if I can.
>
> Thanks,
> Mohan
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installation-question-tp15412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org