You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Niek Tax <ni...@gmail.com> on 2014/07/14 13:00:12 UTC

Running Spark on Microsoft Azure HDInsight

Hi everyone,

Currently I am working on parallelizing a machine learning algorithm using
a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
MapReduce, but since my algorithm is iterative the job scheduling overhead
and data loading overhead severely limits the performance of my algorithm
in terms of training time.

Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
would allow me to use run Spark jobs, which seem more fitting for my task. So
far I have not been able however to find how I can run Apache Spark jobs on
a HDInsight cluster.

It seems like remote job submission (which would have my preference) is not
possible for Spark on HDInsight, as REST endpoints for Oozie and templeton
do not seem to support submission of Spark jobs. I also tried to RDP to the
headnode for job submission from the headnode. On the headnode drives I can
find other new YARN computation models like Tez and I also managed to run
Tez jobs on it through YARN. However, Spark seems to be missing. Does this
mean that HDInsight currently does not support Spark, even though it
supports Hadoop versions with YARN? Or do I need to install Spark on the
HDInsight cluster first, in some way? Or is there maybe something else that
I'm missing and can I run Spark jobs on HDInsight some other way?

Many thanks in advance!


Kind regards,

Niek Tax

Re: Running Spark on Microsoft Azure HDInsight

Posted by Marco Shaw <ma...@gmail.com>.

Looks like going with cluster mode is not a good idea:
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/

Seems like a non-HDInsight VM might be needed to make it the Spark master
node.

Marco



On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw <ma...@gmail.com> wrote:

> I'm a Spark and HDInsight novice, so I could be wrong...
>
> HDInsight is based on HDP2, so my guess here is that you have the option
> of installing/configuring Spark in cluster mode (YARN) or in standalone
> mode and package the Spark binaries with your job.
>
> Everything I seem to look at is related to UNIX shell scripts.  So, one
> might need to pull apart some of these scripts to pick out how to run this
> on Windows.
>
> Interesting project...
>
> Marco
>
>
>
> On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax <ni...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Currently I am working on parallelizing a machine learning algorithm
>> using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
>> MapReduce, but since my algorithm is iterative the job scheduling overhead
>> and data loading overhead severely limits the performance of my algorithm
>> in terms of training time.
>>
>> Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
>> would allow me to use run Spark jobs, which seem more fitting for my task. So
>> far I have not been able however to find how I can run Apache Spark jobs on
>> a HDInsight cluster.
>>
>> It seems like remote job submission (which would have my preference) is
>> not possible for Spark on HDInsight, as REST endpoints for Oozie and
>> templeton do not seem to support submission of Spark jobs. I also tried to
>> RDP to the headnode for job submission from the headnode. On the headnode
>> drives I can find other new YARN computation models like Tez and I also
>> managed to run Tez jobs on it through YARN. However, Spark seems to be
>> missing. Does this mean that HDInsight currently does not support Spark,
>> even though it supports Hadoop versions with YARN? Or do I need to install
>> Spark on the HDInsight cluster first, in some way? Or is there maybe
>> something else that I'm missing and can I run Spark jobs on HDInsight some
>> other way?
>>
>> Many thanks in advance!
>>
>>
>> Kind regards,
>>
>> Niek Tax
>>
>
>

Re: Running Spark on Microsoft Azure HDInsight

Posted by Marco Shaw <ma...@gmail.com>.

I'm a Spark and HDInsight novice, so I could be wrong...

HDInsight is based on HDP2, so my guess here is that you have the option of
installing/configuring Spark in cluster mode (YARN) or in standalone mode
and package the Spark binaries with your job.

Everything I seem to look at is related to UNIX shell scripts.  So, one
might need to pull apart some of these scripts to pick out how to run this
on Windows.

Interesting project...

Marco



On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax <ni...@gmail.com> wrote:

> Hi everyone,
>
> Currently I am working on parallelizing a machine learning algorithm using
> a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
> MapReduce, but since my algorithm is iterative the job scheduling overhead
> and data loading overhead severely limits the performance of my algorithm
> in terms of training time.
>
> Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
> would allow me to use run Spark jobs, which seem more fitting for my task. So
> far I have not been able however to find how I can run Apache Spark jobs on
> a HDInsight cluster.
>
> It seems like remote job submission (which would have my preference) is
> not possible for Spark on HDInsight, as REST endpoints for Oozie and
> templeton do not seem to support submission of Spark jobs. I also tried to
> RDP to the headnode for job submission from the headnode. On the headnode
> drives I can find other new YARN computation models like Tez and I also
> managed to run Tez jobs on it through YARN. However, Spark seems to be
> missing. Does this mean that HDInsight currently does not support Spark,
> even though it supports Hadoop versions with YARN? Or do I need to install
> Spark on the HDInsight cluster first, in some way? Or is there maybe
> something else that I'm missing and can I run Spark jobs on HDInsight some
> other way?
>
> Many thanks in advance!
>
>
> Kind regards,
>
> Niek Tax
>