You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Pasmanik, Paul" <Pa...@danteinc.com> on 2015/01/06 17:07:58 UTC

running spark-itemsimilarity against HDP sandbox with Spark

Hi, I've been trying to run spark-itemsimilarity against Hortonworks Sandbox with Spark running in a VM, but have not succeeded yet.

Do I need to install mahout and run within a VM or is there a way to run remotely against a VM where spark and hadoop are running?

I tried running a scala ItemSimilaritySuite test with some modifications pointing hdfs and spark to sandbox but getting various errors the latest one with ShuffleMapTask getting hdfs block missing exception trying to read an input file that I uploaded to the hdfs cluster.


________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

Re: running spark-itemsimilarity against HDP sandbox with Spark

Posted by "AlShater, Hani" <ha...@souq.com>.

Great that it helps,

If I ware you, I would install hdp 2.2 at ec2 machines and save some cost,
then connect one of the machines to my workstation/laptop using vagrant.
When you do so, yarn-client will be better for you.
Good luck.


On Wed, Jan 7, 2015 at 5:00 PM, Pasmanik, Paul <Pa...@danteinc.com>
wrote:

> Thanks.
> I've been using HDP 2.1.5 with Spark 1.1.0 which I believe is Hadoop 2.4.0.
> Yarn-client works but yarn-cluster does not.  I've done everything that
> you described except trying spark 1.1.1.   I'll do that next.
> My prod deployment will actually be Amazon EMR with Spark and mahout (I've
> been using EMR extensively in my other projects).  So, hopefully
> yarn-cluster would work there.
>
> I've not looked deep into how mahout integrates with spark yet, something
> must be different as the Pi example using yarn-cluster and using
> 'spark-submit' on my installation works just fine.
>
> -----Original Message-----
> From: AlShater, Hani [mailto:halshater@souq.com]
> Sent: Wednesday, January 07, 2015 5:31 AM
> To: user@mahout.apache.org
> Subject: Re: running spark-itemsimilarity against HDP sandbox with Spark
>
> I have tried spark-itemsimilarity on HDP 2.1. Initially I got the error
> you got but then resolve it, here are the steps that worded with me:
> - check that $HADOOP_CONF_DIR is pointing to the right hadoop config dir.
> - get spark 1.1.1 binaries precompiled for hadoop 2.4. If you are using HDP
> 2.2 I think they should be compiled for hadoop 2.6. Set $SPARK_HOME to the
> spark home dir. compiling spark with suitalbe mvn profile, getting spark
> 1.1.0 did not work with me.
> - get the mahout-snapshot and compile it using mvn.
> - run spark-itemsimilarity, set the master to yarn-client. Yarn-cluster
> mode does not work for me, and it gave the same error you are seeing.
> - If you data is big, you can alter yarn container sizes from ambari, and
> then alocate more memory to spark executers using the -sem option.
>
> refer to spark docs to read more about tunning spark, it contain many
> useful hints about memory, serialization, network and other parameters you
> may need to tune.
>
> Hope this will hel
> p.
>
>
> On Wed, Jan 7, 2015 at 4:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > Did you build Spark from source and deploy it to the cluster?
> >
> > When you build Mahout it’s running its tests against the artifacts it
> > gets from maven repos. When you run mahout on a cluster it is running
> > from the artifacts on the cluster. These may not be the same and there
> > have been problems that present similari to what you have shown.
> > Building Spark from source has solved these for several people
> (including me).
> >
> > I haven’t used yarn since I’m still on hadoop 1.2.1 but another user
> > on this list was successful after some initial problems. Not sure if
> > passing the yarn master in is sufficient there are a bunch of
> > Spark-yarn config params you may want to check out and if any are
> > runtime conf you can pass them in to the spark-itemsimilarity with the
> -D:key=value CLI option.
> >
> > On Jan 6, 2015, at 12:30 PM, Pasmanik, Paul
> > <Pa...@danteinc.com>
> > wrote:
> >
> > So, when I follow examples from hortonworks and run spark Pi example
> > using spark-submit - everything works.
> > I can run mahout spark-itemsimilarity without specifying master parameter
> > which means it is running in the local mode (right?) and it works.   But
> if
> > I try to run mahout using  -ma (master) parameter to point to yarn
> > cluster it always gets stuck with the following warning:
> >
> > WARN YarnClusterScheduler: Initial job has not accepted any resources;
> > check your cluster UI to ensure that workers are registered and have
> > sufficient memory
> >
> > According to several places the error means that Hadoop does not have
> > sufficient memory  - but I have plenty and I tried to lower
> > executor-memory and driver-memory all way to 250 Megs.  I still get
> > that error and nothing is processed.
> >
> > Did you guys run into this issues?
> >
> > Thanks.
> >
> > More stack trace below:
> >
> > 15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(4024)
> > called with curMem=87562, maxMem=2061647216
> > 15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1 stored
> > as values in memory (estimated size 3.9 KB, free 1966.1 MB)
> > 15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(2336)
> > called with curMem=91586, maxMem=2061647216
> > 15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1_piece0
> > stored as bytes in memory (estimated size 2.3 KB, free 1966.1 MB)
> > 15/01/06 12:14:57 INFO storage.BlockManagerInfo: Added
> > broadcast_1_piece0 in memory on sandbox.hortonworks.com:53919 (size:
> > 2.3 KB, free: 19
> > 66.1 MB)
> > 15/01/06 12:14:57 INFO storage.BlockManagerMaster: Updated info of
> > block
> > broadcast_1_piece0
> > 15/01/06 12:14:57 INFO scheduler.DAGScheduler: Submitting 2 missing
> > tasks from Stage 1 (MappedRDD[6] at distinct at
> > TextDelimitedReaderWrite
> > r.scala:76)
> > 15/01/06 12:14:57 INFO cluster.YarnClusterScheduler: Adding task set
> > 1.0 with 2 tasks
> > 15/01/06 12:14:57 INFO util.RackResolver: Resolved
> > sandbox.hortonworks.com to /default-rack
> > 15/01/06 12:15:13 WARN cluster.YarnClusterScheduler: Initial job has
> > not accepted any resources; check your cluster UI to ensure that
> > worker s are registered and have sufficient memory
> > 15/01/06 12:15:27 WARN cluster.YarnClusterScheduler: Initial job has
> > not accepted any resources; check your cluster UI to ensure that
> > worker s are registered and have sufficient memory
> >
> > -----Original Message-----
> > From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com]
> > Sent: Tuesday, January 06, 2015 2:49 PM
> > To: user@mahout.apache.org
> > Subject: RE: running spark-itemsimilarity against HDP sandbox with
> > Spark
> >
> > Thanks, Pat.
> > I am using HDP with spark 1.1.0:
> > http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
> >
> > Spark examples run without issues.  For mahout I had to create a
> > couple of env vars: (HADOOP_HOME, SPARK_HOME, MAHOUT_HOME).  Also, to
> run using yarn
> > cluster with HDP   -ma yarn-cluster needs to be passed in.
> > Also, default memory allocated to yarn was not enough out of the box (
> > 2g), increased to 3g, now restarting and trying again.
> >
> > -----Original Message-----
> > From: Pat Ferrel [mailto:pat@occamsmachete.com]
> > Sent: Tuesday, January 06, 2015 12:58 PM
> > To: user@mahout.apache.org
> > Subject: Re: running spark-itemsimilarity against HDP sandbox with
> > Spark
> >
> > There are some issues with using Mahout on Windows so you’ll have to
> > run on a ‘nix machine or VM. There shouldn’t be any problem with using
> > VMs as long as your Spark install is setup correctly.
> >
> > Currently you have to build Spark first and then Mahout from source.
> > Mahout uses Spark 1.1. You’ll need to build Spark from source using
> > “mvn install” rather than their recommended “mvn package” There were
> > some problems in the Spark artifacts when running from the binary
> > release. Check Mahout’s Spark FAQ for some pointers
> > http://mahout.apache.org/users/sparkbindings/faq.html
> >
> > Verify Spark is running correctly by trying their sample SparkPi job.
> > http://spark.apache.org/docs/1.1.1/submitting-applications.html
> >
> > Spark in general and spark-itemsimilarity especially like lots of
> > memory so you may have to play with the -sem option to
> spark-itemsimilarity.
> >
> > On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul
> > <Pa...@danteinc.com>
> > wrote:
> >
> > Hi, I've been trying to run spark-itemsimilarity against Hortonworks
> > Sandbox with Spark running in a VM, but have not succeeded yet.
> >
> > Do I need to install mahout and run within a VM or is there a way to
> > run remotely against a VM where spark and hadoop are running?
> >
> > I tried running a scala ItemSimilaritySuite test with some
> > modifications pointing hdfs and spark to sandbox but getting various
> > errors the latest one with ShuffleMapTask getting hdfs block missing
> > exception trying to read an input file that I uploaded to the hdfs
> cluster.
> >
> >
> > ________________________________
> > The information contained in this electronic transmission is intended
> > only for the use of the recipient and may be confidential and privileged.
> > Unauthorized use, disclosure, or reproduction is strictly prohibited
> > and may be unlawful. If you have received this electronic transmission
> > in error, please notify the sender immediately.
> >
> >
> >
> >
>
> --
>
>
> *Download free Souq.com <http://souq.com/> mobile apps for iPhone <
> https://itunes.apple.com/us/app/id675000850>, iPad <
> https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android <
> https://play.google.com/store/apps/details?id=com.souq.app> or Windows
> Phone <
> http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc>
> **and never miss a deal! *
>
> The information contained in this electronic transmission is intended only
> for the use of the recipient and may be confidential and privileged.
> Unauthorized use, disclosure, or reproduction is strictly prohibited and
> may be unlawful. If you have received this electronic transmission in
> error, please notify the sender immediately.
>

-- 


*Download free Souq.com <http://souq.com/> mobile apps for iPhone 
<https://itunes.apple.com/us/app/id675000850>, iPad 
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android 
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows 
Phone 
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never 
miss a deal! *

RE: running spark-itemsimilarity against HDP sandbox with Spark

Posted by "Pasmanik, Paul" <Pa...@danteinc.com>.

Thanks.
I've been using HDP 2.1.5 with Spark 1.1.0 which I believe is Hadoop 2.4.0.
Yarn-client works but yarn-cluster does not.  I've done everything that you described except trying spark 1.1.1.   I'll do that next.
My prod deployment will actually be Amazon EMR with Spark and mahout (I've been using EMR extensively in my other projects).  So, hopefully yarn-cluster would work there.

I've not looked deep into how mahout integrates with spark yet, something must be different as the Pi example using yarn-cluster and using 'spark-submit' on my installation works just fine.

-----Original Message-----
From: AlShater, Hani [mailto:halshater@souq.com]
Sent: Wednesday, January 07, 2015 5:31 AM
To: user@mahout.apache.org
Subject: Re: running spark-itemsimilarity against HDP sandbox with Spark

I have tried spark-itemsimilarity on HDP 2.1. Initially I got the error you got but then resolve it, here are the steps that worded with me:
- check that $HADOOP_CONF_DIR is pointing to the right hadoop config dir.
- get spark 1.1.1 binaries precompiled for hadoop 2.4. If you are using HDP
2.2 I think they should be compiled for hadoop 2.6. Set $SPARK_HOME to the spark home dir. compiling spark with suitalbe mvn profile, getting spark
1.1.0 did not work with me.
- get the mahout-snapshot and compile it using mvn.
- run spark-itemsimilarity, set the master to yarn-client. Yarn-cluster mode does not work for me, and it gave the same error you are seeing.
- If you data is big, you can alter yarn container sizes from ambari, and then alocate more memory to spark executers using the -sem option.

refer to spark docs to read more about tunning spark, it contain many useful hints about memory, serialization, network and other parameters you may need to tune.

Hope this will hel
p.


On Wed, Jan 7, 2015 at 4:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Did you build Spark from source and deploy it to the cluster?
>
> When you build Mahout it’s running its tests against the artifacts it
> gets from maven repos. When you run mahout on a cluster it is running
> from the artifacts on the cluster. These may not be the same and there
> have been problems that present similari to what you have shown.
> Building Spark from source has solved these for several people (including me).
>
> I haven’t used yarn since I’m still on hadoop 1.2.1 but another user
> on this list was successful after some initial problems. Not sure if
> passing the yarn master in is sufficient there are a bunch of
> Spark-yarn config params you may want to check out and if any are
> runtime conf you can pass them in to the spark-itemsimilarity with the -D:key=value CLI option.
>
> On Jan 6, 2015, at 12:30 PM, Pasmanik, Paul
> <Pa...@danteinc.com>
> wrote:
>
> So, when I follow examples from hortonworks and run spark Pi example
> using spark-submit - everything works.
> I can run mahout spark-itemsimilarity without specifying master parameter
> which means it is running in the local mode (right?) and it works.   But if
> I try to run mahout using  -ma (master) parameter to point to yarn
> cluster it always gets stuck with the following warning:
>
> WARN YarnClusterScheduler: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory
>
> According to several places the error means that Hadoop does not have
> sufficient memory  - but I have plenty and I tried to lower
> executor-memory and driver-memory all way to 250 Megs.  I still get
> that error and nothing is processed.
>
> Did you guys run into this issues?
>
> Thanks.
>
> More stack trace below:
>
> 15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(4024)
> called with curMem=87562, maxMem=2061647216
> 15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1 stored
> as values in memory (estimated size 3.9 KB, free 1966.1 MB)
> 15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(2336)
> called with curMem=91586, maxMem=2061647216
> 15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1_piece0
> stored as bytes in memory (estimated size 2.3 KB, free 1966.1 MB)
> 15/01/06 12:14:57 INFO storage.BlockManagerInfo: Added
> broadcast_1_piece0 in memory on sandbox.hortonworks.com:53919 (size:
> 2.3 KB, free: 19
> 66.1 MB)
> 15/01/06 12:14:57 INFO storage.BlockManagerMaster: Updated info of
> block
> broadcast_1_piece0
> 15/01/06 12:14:57 INFO scheduler.DAGScheduler: Submitting 2 missing
> tasks from Stage 1 (MappedRDD[6] at distinct at
> TextDelimitedReaderWrite
> r.scala:76)
> 15/01/06 12:14:57 INFO cluster.YarnClusterScheduler: Adding task set
> 1.0 with 2 tasks
> 15/01/06 12:14:57 INFO util.RackResolver: Resolved
> sandbox.hortonworks.com to /default-rack
> 15/01/06 12:15:13 WARN cluster.YarnClusterScheduler: Initial job has
> not accepted any resources; check your cluster UI to ensure that
> worker s are registered and have sufficient memory
> 15/01/06 12:15:27 WARN cluster.YarnClusterScheduler: Initial job has
> not accepted any resources; check your cluster UI to ensure that
> worker s are registered and have sufficient memory
>
> -----Original Message-----
> From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com]
> Sent: Tuesday, January 06, 2015 2:49 PM
> To: user@mahout.apache.org
> Subject: RE: running spark-itemsimilarity against HDP sandbox with
> Spark
>
> Thanks, Pat.
> I am using HDP with spark 1.1.0:
> http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
>
> Spark examples run without issues.  For mahout I had to create a
> couple of env vars: (HADOOP_HOME, SPARK_HOME, MAHOUT_HOME).  Also, to run using yarn
> cluster with HDP   -ma yarn-cluster needs to be passed in.
> Also, default memory allocated to yarn was not enough out of the box (
> 2g), increased to 3g, now restarting and trying again.
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, January 06, 2015 12:58 PM
> To: user@mahout.apache.org
> Subject: Re: running spark-itemsimilarity against HDP sandbox with
> Spark
>
> There are some issues with using Mahout on Windows so you’ll have to
> run on a ‘nix machine or VM. There shouldn’t be any problem with using
> VMs as long as your Spark install is setup correctly.
>
> Currently you have to build Spark first and then Mahout from source.
> Mahout uses Spark 1.1. You’ll need to build Spark from source using
> “mvn install” rather than their recommended “mvn package” There were
> some problems in the Spark artifacts when running from the binary
> release. Check Mahout’s Spark FAQ for some pointers
> http://mahout.apache.org/users/sparkbindings/faq.html
>
> Verify Spark is running correctly by trying their sample SparkPi job.
> http://spark.apache.org/docs/1.1.1/submitting-applications.html
>
> Spark in general and spark-itemsimilarity especially like lots of
> memory so you may have to play with the -sem option to spark-itemsimilarity.
>
> On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul
> <Pa...@danteinc.com>
> wrote:
>
> Hi, I've been trying to run spark-itemsimilarity against Hortonworks
> Sandbox with Spark running in a VM, but have not succeeded yet.
>
> Do I need to install mahout and run within a VM or is there a way to
> run remotely against a VM where spark and hadoop are running?
>
> I tried running a scala ItemSimilaritySuite test with some
> modifications pointing hdfs and spark to sandbox but getting various
> errors the latest one with ShuffleMapTask getting hdfs block missing
> exception trying to read an input file that I uploaded to the hdfs cluster.
>
>
> ________________________________
> The information contained in this electronic transmission is intended
> only for the use of the recipient and may be confidential and privileged.
> Unauthorized use, disclosure, or reproduction is strictly prohibited
> and may be unlawful. If you have received this electronic transmission
> in error, please notify the sender immediately.
>
>
>
>

--


*Download free Souq.com <http://souq.com/> mobile apps for iPhone <https://itunes.apple.com/us/app/id675000850>, iPad <https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android <https://play.google.com/store/apps/details?id=com.souq.app> or Windows Phone <http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never miss a deal! *

The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

Re: running spark-itemsimilarity against HDP sandbox with Spark

Posted by "AlShater, Hani" <ha...@souq.com>.

I have tried spark-itemsimilarity on HDP 2.1. Initially I got the error you
got but then resolve it, here are the steps that worded with me:
- check that $HADOOP_CONF_DIR is pointing to the right hadoop config dir.
- get spark 1.1.1 binaries precompiled for hadoop 2.4. If you are using HDP
2.2 I think they should be compiled for hadoop 2.6. Set $SPARK_HOME to the
spark home dir. compiling spark with suitalbe mvn profile, getting spark
1.1.0 did not work with me.
- get the mahout-snapshot and compile it using mvn.
- run spark-itemsimilarity, set the master to yarn-client. Yarn-cluster
mode does not work for me, and it gave the same error you are seeing.
- If you data is big, you can alter yarn container sizes from ambari, and
then alocate more memory to spark executers using the -sem option.

refer to spark docs to read more about tunning spark, it contain many
useful hints about memory, serialization, network and other parameters you
may need to tune.

Hope this will hel
p.


On Wed, Jan 7, 2015 at 4:29 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Did you build Spark from source and deploy it to the cluster?
>
> When you build Mahout it’s running its tests against the artifacts it gets
> from maven repos. When you run mahout on a cluster it is running from the
> artifacts on the cluster. These may not be the same and there have been
> problems that present similari to what you have shown. Building Spark from
> source has solved these for several people (including me).
>
> I haven’t used yarn since I’m still on hadoop 1.2.1 but another user on
> this list was successful after some initial problems. Not sure if passing
> the yarn master in is sufficient there are a bunch of Spark-yarn config
> params you may want to check out and if any are runtime conf you can pass
> them in to the spark-itemsimilarity with the -D:key=value CLI option.
>
> On Jan 6, 2015, at 12:30 PM, Pasmanik, Paul <Pa...@danteinc.com>
> wrote:
>
> So, when I follow examples from hortonworks and run spark Pi example using
> spark-submit - everything works.
> I can run mahout spark-itemsimilarity without specifying master parameter
> which means it is running in the local mode (right?) and it works.   But if
> I try to run mahout using  -ma (master) parameter to point to yarn cluster
> it always gets stuck with the following warning:
>
> WARN YarnClusterScheduler: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory
>
> According to several places the error means that Hadoop does not have
> sufficient memory  - but I have plenty and I tried to lower executor-memory
> and driver-memory all way to 250 Megs.  I still get that error and nothing
> is processed.
>
> Did you guys run into this issues?
>
> Thanks.
>
> More stack trace below:
>
> 15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(4024) called
> with curMem=87562, maxMem=2061647216
> 15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1 stored as
> values in memory (estimated size 3.9 KB, free 1966.1 MB)
> 15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(2336) called
> with curMem=91586, maxMem=2061647216
> 15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1_piece0
> stored as bytes in memory (estimated size 2.3 KB, free 1966.1 MB)
> 15/01/06 12:14:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
> in memory on sandbox.hortonworks.com:53919 (size: 2.3 KB, free: 19
> 66.1 MB)
> 15/01/06 12:14:57 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_1_piece0
> 15/01/06 12:14:57 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
> from Stage 1 (MappedRDD[6] at distinct at TextDelimitedReaderWrite
> r.scala:76)
> 15/01/06 12:14:57 INFO cluster.YarnClusterScheduler: Adding task set 1.0
> with 2 tasks
> 15/01/06 12:14:57 INFO util.RackResolver: Resolved sandbox.hortonworks.com
> to /default-rack
> 15/01/06 12:15:13 WARN cluster.YarnClusterScheduler: Initial job has not
> accepted any resources; check your cluster UI to ensure that worker
> s are registered and have sufficient memory
> 15/01/06 12:15:27 WARN cluster.YarnClusterScheduler: Initial job has not
> accepted any resources; check your cluster UI to ensure that worker
> s are registered and have sufficient memory
>
> -----Original Message-----
> From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com]
> Sent: Tuesday, January 06, 2015 2:49 PM
> To: user@mahout.apache.org
> Subject: RE: running spark-itemsimilarity against HDP sandbox with Spark
>
> Thanks, Pat.
> I am using HDP with spark 1.1.0:
> http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
>
> Spark examples run without issues.  For mahout I had to create a couple of
> env vars: (HADOOP_HOME, SPARK_HOME, MAHOUT_HOME).  Also, to run using yarn
> cluster with HDP   -ma yarn-cluster needs to be passed in.
> Also, default memory allocated to yarn was not enough out of the box (
> 2g), increased to 3g, now restarting and trying again.
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat@occamsmachete.com]
> Sent: Tuesday, January 06, 2015 12:58 PM
> To: user@mahout.apache.org
> Subject: Re: running spark-itemsimilarity against HDP sandbox with Spark
>
> There are some issues with using Mahout on Windows so you’ll have to run
> on a ‘nix machine or VM. There shouldn’t be any problem with using VMs as
> long as your Spark install is setup correctly.
>
> Currently you have to build Spark first and then Mahout from source.
> Mahout uses Spark 1.1. You’ll need to build Spark from source using “mvn
> install” rather than their recommended “mvn package” There were some
> problems in the Spark artifacts when running from the binary release. Check
> Mahout’s Spark FAQ for some pointers
> http://mahout.apache.org/users/sparkbindings/faq.html
>
> Verify Spark is running correctly by trying their sample SparkPi job.
> http://spark.apache.org/docs/1.1.1/submitting-applications.html
>
> Spark in general and spark-itemsimilarity especially like lots of memory
> so you may have to play with the -sem option to spark-itemsimilarity.
>
> On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul <Pa...@danteinc.com>
> wrote:
>
> Hi, I've been trying to run spark-itemsimilarity against Hortonworks
> Sandbox with Spark running in a VM, but have not succeeded yet.
>
> Do I need to install mahout and run within a VM or is there a way to run
> remotely against a VM where spark and hadoop are running?
>
> I tried running a scala ItemSimilaritySuite test with some modifications
> pointing hdfs and spark to sandbox but getting various errors the latest
> one with ShuffleMapTask getting hdfs block missing exception trying to read
> an input file that I uploaded to the hdfs cluster.
>
>
> ________________________________
> The information contained in this electronic transmission is intended only
> for the use of the recipient and may be confidential and privileged.
> Unauthorized use, disclosure, or reproduction is strictly prohibited and
> may be unlawful. If you have received this electronic transmission in
> error, please notify the sender immediately.
>
>
>
>

-- 


*Download free Souq.com <http://souq.com/> mobile apps for iPhone 
<https://itunes.apple.com/us/app/id675000850>, iPad 
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android 
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows 
Phone 
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc> **and never 
miss a deal! *

Re: running spark-itemsimilarity against HDP sandbox with Spark

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Did you build Spark from source and deploy it to the cluster?

When you build Mahout it’s running its tests against the artifacts it gets from maven repos. When you run mahout on a cluster it is running from the artifacts on the cluster. These may not be the same and there have been problems that present similari to what you have shown. Building Spark from source has solved these for several people (including me). 

I haven’t used yarn since I’m still on hadoop 1.2.1 but another user on this list was successful after some initial problems. Not sure if passing the yarn master in is sufficient there are a bunch of Spark-yarn config params you may want to check out and if any are runtime conf you can pass them in to the spark-itemsimilarity with the -D:key=value CLI option.

On Jan 6, 2015, at 12:30 PM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

So, when I follow examples from hortonworks and run spark Pi example using spark-submit - everything works.
I can run mahout spark-itemsimilarity without specifying master parameter which means it is running in the local mode (right?) and it works.   But if I try to run mahout using  -ma (master) parameter to point to yarn cluster it always gets stuck with the following warning:

WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

According to several places the error means that Hadoop does not have sufficient memory  - but I have plenty and I tried to lower executor-memory and driver-memory all way to 250 Megs.  I still get that error and nothing is processed.

Did you guys run into this issues?

Thanks.

More stack trace below:

15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(4024) called with curMem=87562, maxMem=2061647216
15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.9 KB, free 1966.1 MB)
15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(2336) called with curMem=91586, maxMem=2061647216
15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 1966.1 MB)
15/01/06 12:14:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on sandbox.hortonworks.com:53919 (size: 2.3 KB, free: 19
66.1 MB)
15/01/06 12:14:57 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/01/06 12:14:57 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at distinct at TextDelimitedReaderWrite
r.scala:76)
15/01/06 12:14:57 INFO cluster.YarnClusterScheduler: Adding task set 1.0 with 2 tasks
15/01/06 12:14:57 INFO util.RackResolver: Resolved sandbox.hortonworks.com to /default-rack
15/01/06 12:15:13 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that worker
s are registered and have sufficient memory
15/01/06 12:15:27 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that worker
s are registered and have sufficient memory

-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Tuesday, January 06, 2015 2:49 PM
To: user@mahout.apache.org
Subject: RE: running spark-itemsimilarity against HDP sandbox with Spark

Thanks, Pat.
I am using HDP with spark 1.1.0: http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/

Spark examples run without issues.  For mahout I had to create a couple of env vars: (HADOOP_HOME, SPARK_HOME, MAHOUT_HOME).  Also, to run using yarn cluster with HDP   -ma yarn-cluster needs to be passed in.   
Also, default memory allocated to yarn was not enough out of the box ( 2g), increased to 3g, now restarting and trying again.

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Tuesday, January 06, 2015 12:58 PM
To: user@mahout.apache.org
Subject: Re: running spark-itemsimilarity against HDP sandbox with Spark

There are some issues with using Mahout on Windows so you’ll have to run on a ‘nix machine or VM. There shouldn’t be any problem with using VMs as long as your Spark install is setup correctly.

Currently you have to build Spark first and then Mahout from source. Mahout uses Spark 1.1. You’ll need to build Spark from source using “mvn install” rather than their recommended “mvn package” There were some problems in the Spark artifacts when running from the binary release. Check Mahout’s Spark FAQ for some pointers http://mahout.apache.org/users/sparkbindings/faq.html

Verify Spark is running correctly by trying their sample SparkPi job. 
http://spark.apache.org/docs/1.1.1/submitting-applications.html

Spark in general and spark-itemsimilarity especially like lots of memory so you may have to play with the -sem option to spark-itemsimilarity.

On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Hi, I've been trying to run spark-itemsimilarity against Hortonworks Sandbox with Spark running in a VM, but have not succeeded yet.

Do I need to install mahout and run within a VM or is there a way to run remotely against a VM where spark and hadoop are running?

I tried running a scala ItemSimilaritySuite test with some modifications pointing hdfs and spark to sandbox but getting various errors the latest one with ShuffleMapTask getting hdfs block missing exception trying to read an input file that I uploaded to the hdfs cluster.


________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

RE: running spark-itemsimilarity against HDP sandbox with Spark

Posted by "Pasmanik, Paul" <Pa...@danteinc.com>.

So, when I follow examples from hortonworks and run spark Pi example using spark-submit - everything works.
I can run mahout spark-itemsimilarity without specifying master parameter which means it is running in the local mode (right?) and it works.   But if I try to run mahout using  -ma (master) parameter to point to yarn cluster it always gets stuck with the following warning:

WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

According to several places the error means that Hadoop does not have sufficient memory  - but I have plenty and I tried to lower executor-memory and driver-memory all way to 250 Megs.  I still get that error and nothing is processed.

Did you guys run into this issues?

Thanks.

More stack trace below:

15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(4024) called with curMem=87562, maxMem=2061647216
15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.9 KB, free 1966.1 MB)
15/01/06 12:14:57 INFO storage.MemoryStore: ensureFreeSpace(2336) called with curMem=91586, maxMem=2061647216
15/01/06 12:14:57 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 1966.1 MB)
15/01/06 12:14:57 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on sandbox.hortonworks.com:53919 (size: 2.3 KB, free: 19
66.1 MB)
15/01/06 12:14:57 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/01/06 12:14:57 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at distinct at TextDelimitedReaderWrite
r.scala:76)
15/01/06 12:14:57 INFO cluster.YarnClusterScheduler: Adding task set 1.0 with 2 tasks
15/01/06 12:14:57 INFO util.RackResolver: Resolved sandbox.hortonworks.com to /default-rack
15/01/06 12:15:13 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that worker
s are registered and have sufficient memory
15/01/06 12:15:27 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that worker
s are registered and have sufficient memory

-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Tuesday, January 06, 2015 2:49 PM
To: user@mahout.apache.org
Subject: RE: running spark-itemsimilarity against HDP sandbox with Spark

Thanks, Pat.
I am using HDP with spark 1.1.0: http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/

Spark examples run without issues.  For mahout I had to create a couple of env vars: (HADOOP_HOME, SPARK_HOME, MAHOUT_HOME).  Also, to run using yarn cluster with HDP   -ma yarn-cluster needs to be passed in.   
Also, default memory allocated to yarn was not enough out of the box ( 2g), increased to 3g, now restarting and trying again.

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Tuesday, January 06, 2015 12:58 PM
To: user@mahout.apache.org
Subject: Re: running spark-itemsimilarity against HDP sandbox with Spark

There are some issues with using Mahout on Windows so you’ll have to run on a ‘nix machine or VM. There shouldn’t be any problem with using VMs as long as your Spark install is setup correctly.

Currently you have to build Spark first and then Mahout from source. Mahout uses Spark 1.1. You’ll need to build Spark from source using “mvn install” rather than their recommended “mvn package” There were some problems in the Spark artifacts when running from the binary release. Check Mahout’s Spark FAQ for some pointers http://mahout.apache.org/users/sparkbindings/faq.html

Verify Spark is running correctly by trying their sample SparkPi job. 
http://spark.apache.org/docs/1.1.1/submitting-applications.html

Spark in general and spark-itemsimilarity especially like lots of memory so you may have to play with the -sem option to spark-itemsimilarity.

On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Hi, I've been trying to run spark-itemsimilarity against Hortonworks Sandbox with Spark running in a VM, but have not succeeded yet.

Do I need to install mahout and run within a VM or is there a way to run remotely against a VM where spark and hadoop are running?

I tried running a scala ItemSimilaritySuite test with some modifications pointing hdfs and spark to sandbox but getting various errors the latest one with ShuffleMapTask getting hdfs block missing exception trying to read an input file that I uploaded to the hdfs cluster.


________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

RE: running spark-itemsimilarity against HDP sandbox with Spark

Posted by "Pasmanik, Paul" <Pa...@danteinc.com>.

Thanks, Pat.
I am using HDP with spark 1.1.0: http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/

Spark examples run without issues.  For mahout I had to create a couple of env vars: (HADOOP_HOME, SPARK_HOME, MAHOUT_HOME).  Also, to run using yarn cluster with HDP   -ma yarn-cluster needs to be passed in.   
Also, default memory allocated to yarn was not enough out of the box ( 2g), increased to 3g, now restarting and trying again.

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Tuesday, January 06, 2015 12:58 PM
To: user@mahout.apache.org
Subject: Re: running spark-itemsimilarity against HDP sandbox with Spark

There are some issues with using Mahout on Windows so you’ll have to run on a ‘nix machine or VM. There shouldn’t be any problem with using VMs as long as your Spark install is setup correctly.

Currently you have to build Spark first and then Mahout from source. Mahout uses Spark 1.1. You’ll need to build Spark from source using “mvn install” rather than their recommended “mvn package” There were some problems in the Spark artifacts when running from the binary release. Check Mahout’s Spark FAQ for some pointers http://mahout.apache.org/users/sparkbindings/faq.html

Verify Spark is running correctly by trying their sample SparkPi job. 
http://spark.apache.org/docs/1.1.1/submitting-applications.html

Spark in general and spark-itemsimilarity especially like lots of memory so you may have to play with the -sem option to spark-itemsimilarity.

On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Hi, I've been trying to run spark-itemsimilarity against Hortonworks Sandbox with Spark running in a VM, but have not succeeded yet.

Do I need to install mahout and run within a VM or is there a way to run remotely against a VM where spark and hadoop are running?

I tried running a scala ItemSimilaritySuite test with some modifications pointing hdfs and spark to sandbox but getting various errors the latest one with ShuffleMapTask getting hdfs block missing exception trying to read an input file that I uploaded to the hdfs cluster.

________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

Re: running spark-itemsimilarity against HDP sandbox with Spark

Posted by Pat Ferrel <pa...@occamsmachete.com>.

There are some issues with using Mahout on Windows so you’ll have to run on a ‘nix machine or VM. There shouldn’t be any problem with using VMs as long as your Spark install is setup correctly.

Currently you have to build Spark first and then Mahout from source. Mahout uses Spark 1.1. You’ll need to build Spark from source using “mvn install” rather than their recommended “mvn package” There were some problems in the Spark artifacts when running from the binary release. Check Mahout’s Spark FAQ for some pointers
http://mahout.apache.org/users/sparkbindings/faq.html

Verify Spark is running correctly by trying their sample SparkPi job.
http://spark.apache.org/docs/1.1.1/submitting-applications.html

Spark in general and spark-itemsimilarity especially like lots of memory so you may have to play with the -sem option to spark-itemsimilarity.

On Jan 6, 2015, at 8:07 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Hi, I've been trying to run spark-itemsimilarity against Hortonworks Sandbox with Spark running in a VM, but have not succeeded yet.

Do I need to install mahout and run within a VM or is there a way to run remotely against a VM where spark and hadoop are running?

I tried running a scala ItemSimilaritySuite test with some modifications pointing hdfs and spark to sandbox but getting various errors the latest one with ShuffleMapTask getting hdfs block missing exception trying to read an input file that I uploaded to the hdfs cluster.

________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.