You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kal El <pi...@yahoo.com> on 2014/01/22 15:35:32 UTC

Running K-Means on a cluster setup

I have created a cluster setup with 2 workers (one of them is also the master)

Can anyone help me with a tutorial on how to run K-Means for example on this cluster (it would be better to run it from outside the cluster command line)?

I am mostly interested on how do I initiate the sparkcontext (what jars do I need to add ? :
newSparkContext(master,appName,[sparkHome],[jars])) and what other steps I need to run.

I am using the standalone spark cluster.

Thanks

Re: Running K-Means on a cluster setup

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Nice!


On Wed, Jan 22, 2014 at 2:58 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>> Hello,
>>
>> I have found that you generally need two separate pools of knowledge to
>> be successful in this game :). One is to have enough knowledge of network
>> topologies, systems, java, scala and whatever else to actually set up the
>> whole system (esp. if your requirements are different than running on a
>> local machine or in the ec2 cluster supported by the scripts that come with
>> spark).
>>
>> The other is actual knowledge of the API and how it works and how to
>> express and solve your problems using the primitives offered by spark.
>>
>> There is also a third: since you can supply any function to a spark
>> primitive, you generally need to know scala or java (or python?) to
>> actually solve your problem.
>>
>> I am not sure this list is viewed as appropriate place to offer advice on
>> how to actually solve these problems. Not that I would mind seeing various
>> solutions to various problems :) and also optimizations.
>>
>> For example, I am trying to do rudimentary retention analysis. I am a
>> total beginner in the whole map/reduce way of solving problems. I have come
>> up with a solution that is pretty slow but implemented in 5 or 6 lines of
>> code for the simplest problem. However, my files are 20 GB in size each,
>> all json strings. Figuring out what the limiting factor is (network
>> bandwidth is my suspicion since I am accessing things via S3 is my guess)
>> is somewhat of a black magic to me at this point. I think for most of this
>> stuff you will have to read the code. The bigger question after that is
>> optimizing your solutions to be faster :). I would love to see practical
>> tutorials on doing such things and I am willing to put my attempts at
>> solving problems out there to eventually get cannibalized, ridiculed and
>> reimplemented properly :).
>>
>> Sorry for this long winded email, it did not really answer your question
>> anyway :)
>> Ognen
>>
>>
>> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pi...@yahoo.com> wrote:
>>
>>> I have created a cluster setup with 2 workers (one of them is also the
>>> master)
>>>
>>> Can anyone help me with a tutorial on how to run K-Means for example on
>>> this cluster (it would be better to run it from outside the cluster command
>>> line)?
>>>
>>> I am mostly interested on how do I initiate the sparkcontext (what jars
>>> do I need to add ? :
>>> new SparkContext(master, appName, [sparkHome], [jars])) and what other
>>> steps I need to run.
>>>
>>> I am using the standalone spark cluster.
>>>
>>> Thanks
>>>
>>>
>>>
>>
>

Re: Running K-Means on a cluster setup

Posted by Mayur Rustagi <ma...@gmail.com>.

You can put the file in hdfs and access it in hdfs. This will be available
on all machines.
You will have to change the parameters on configuration files and sync it
across the cluster. I believe their is a helper script to do that. Can
somebody help here?
Regards
Mayur

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Jan 23, 2014 at 8:23 PM, Kal El <pi...@yahoo.com> wrote:

> I have figured that part out but there are 2 problems remaining:
> 1) I need an input file. If this input file is not present on all workers
> before running the code I will receive an error. How can I have the input
> file only on the master and the slaves to download it over the network ?
> I was thinking about packing this file in the jar itself but I do not know
> how to read the file in scala (eg. sc.textFile( what path do I put in
> here??? ))
>
> 2) I cannot set the memory used by the jvm on each machine: it appears to
> be 512 MB and I have these settings in "spark-env.sh"
>
> export SPARK_DAEMON_MEMORY=15g
> export SPARK_WORKER_MEMORY=10g
> export SPARK_DAEMON_JAVA_OPTS="-J-Xms20g -J-Xmx20g"
>
> When I am running the code locally I use "-J-Xms20g -J-Xmx20g" as a
> parameter, but this will not work over the cluster so I get an error.
>
> Any help on this ?
>
>
>
>   On Thursday, January 23, 2014 3:53 PM, Mayur Rustagi <
> mayur.rustagi@gmail.com> wrote:
>  Are you running spark scala shell?
> Applications can add dependencies for themselves through
> SparkContext.addJar. If you want to add that jar for multiple applications
> you can add the jar path to SPARK_CLASSPATH environment variable before
> starting the spark shell.
> I am not sure if this automatically copies the jar to all workers , if it
> does not you might have to do that manually. Their was some discussion
> around bundling jar to all slaves that you can follow here:
> https://groups.google.com/forum/?fromgroups=#!topic/spark-users/IBgbLoFWbxw
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Thu, Jan 23, 2014 at 5:50 PM, Kal El <pi...@yahoo.com> wrote:
>
> Ok, so I took a basic code (that shows the clock), packed everything in a
> .jar file, included the path to the jar file in "ADD_JAR" environment
> variable and launched a spark-shell on the cluster.
>
> How do I run the code from the jar file from the console ?
>
>
>   On Thursday, January 23, 2014 12:12 AM, Ewen Cheslack-Postava <
> me@ewencp.org> wrote:
>  I think Mayur pointed to that code because it includes the relevant
> initialization code you were asking about. Running on a cluster doesn't
> require much change: pass the spark:// address of the master instead of
> "local" and add any jars containing your code. You could set the jars
> manually, but the linked code uses
>
> JavaSparkContext.jarOfClass(JavaKMeans.class)
>
>  to get the right jar filename.
>
> -Ewen
>
>   Kal El <pi...@yahoo.com>
>  January 22, 2014 2:02 PM
> please understand that the code from your link is completely useless to
> me. It's like someone is trying to solve a differential equation and you
> tell them what's the formula for the area of the circle.
>
> i can do that with my code too (kmeans code). the idea is that i want to
> run it on a cluster ...
>
>
>   On Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi
> <ma...@gmail.com> <ma...@gmail.com> wrote:
>  I am sorry that is not a tutorial. You can take this source code:
>
>
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
> Sync and Build this project:
> https://github.com/apache/incubator-spark/
> You should be able to call JavaKMeans class, Reynold may be able to shed
> some details on how to use it.
> If you reach some where and get stuck post it back and I can try and help.
> I hope this helps.
>
> Regards
> Mayur
>
>
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pi...@yahoo.com> wrote:
>
>
>
>     Mayur Rustagi <ma...@gmail.com>
>  January 22, 2014 7:19 AM
> I am sorry that is not a tutorial. You can take this source code:
>
>
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
> Sync and Build this project:
> https://github.com/apache/incubator-spark/
> You should be able to call JavaKMeans class, Reynold may be able to shed
> some details on how to use it.
> If you reach some where and get stuck post it back and I can try and help.
> I hope this helps.
>
> Regards
> Mayur
>
>
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
>
>   Kal El <pi...@yahoo.com>
>  January 22, 2014 7:05 AM
> @Mayur: I do not see any tutorial about how to run mlib on a cluster, just
> some basic presentation non related with actual running the algorithm
>
> @Ognen: Thanks, I have figured that out :)) that's why I need some
> tutorials
>
>
>   On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi
> <ma...@gmail.com> <ma...@gmail.com> wrote:
>  How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>
>
>     Mayur Rustagi <ma...@gmail.com>
>  January 22, 2014 6:58 AM
> How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
>
>   Ognen Duzlevski <og...@nengoiksvelzud.com>
>  January 22, 2014 6:50 AM
> Hello,
>
> I have found that you generally need two separate pools of knowledge to be
> successful in this game :). One is to have enough knowledge of network
> topologies, systems, java, scala and whatever else to actually set up the
> whole system (esp. if your requirements are different than running on a
> local machine or in the ec2 cluster supported by the scripts that come with
> spark).
>
> The other is actual knowledge of the API and how it works and how to
> express and solve your problems using the primitives offered by spark.
>
> There is also a third: since you can supply any function to a spark
> primitive, you generally need to know scala or java (or python?) to
> actually solve your problem.
>
> I am not sure this list is viewed as appropriate place to offer advice on
> how to actually solve these problems. Not that I would mind seeing various
> solutions to various problems :) and also optimizations.
>
> For example, I am trying to do rudimentary retention analysis. I am a
> total beginner in the whole map/reduce way of solving problems. I have come
> up with a solution that is pretty slow but implemented in 5 or 6 lines of
> code for the simplest problem. However, my files are 20 GB in size each,
> all json strings. Figuring out what the limiting factor is (network
> bandwidth is my suspicion since I am accessing things via S3 is my guess)
> is somewhat of a black magic to me at this point. I think for most of this
> stuff you will have to read the code. The bigger question after that is
> optimizing your solutions to be faster :). I would love to see practical
> tutorials on doing such things and I am willing to put my attempts at
> solving problems out there to eventually get cannibalized, ridiculed and
> reimplemented properly :).
>
> Sorry for this long winded email, it did not really answer your question
> anyway :)
> Ognen
>
>
>
>
>
>
>
>
>

Re: Running K-Means on a cluster setup

Posted by Kal El <pi...@yahoo.com>.

I have figured that part out but there are 2 problems remaining:
1) I need an input file. If this input file is not present on all workers before running the code I will receive an error. How can I have the input file only on the master and the slaves to download it over the network ? 
I was thinking about packing this file in the jar itself but I do not know how to read the file in scala (eg. sc.textFile( what path do I put in here??? ))

2) I cannot set the memory used by the jvm on each machine: it appears to be 512 MB and I have these settings in "spark-env.sh"

export SPARK_DAEMON_MEMORY=15g
export SPARK_WORKER_MEMORY=10g 
export SPARK_DAEMON_JAVA_OPTS="-J-Xms20g -J-Xmx20g"

When I am running the code locally I use "-J-Xms20g -J-Xmx20g" as a parameter, but this will not work over the cluster so I get an error.

Any help on this ? 




On Thursday, January 23, 2014 3:53 PM, Mayur Rustagi <ma...@gmail.com> wrote:
 
Are you running spark scala shell?
Applications can add dependencies for themselves through SparkContext.addJar. If you want to add that jar for multiple applications you can add the jar path to SPARK_CLASSPATH environment variable before starting the spark shell.
I am not sure if this automatically copies the jar to all workers , if it does not you might have to do that manually. Their was some discussion around bundling jar to all slaves that you can follow here: https://groups.google.com/forum/?fromgroups=#!topic/spark-users/IBgbLoFWbxw 


Mayur Rustagi
Ph: +919632149971
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Jan 23, 2014 at 5:50 PM, Kal El <pi...@yahoo.com> wrote:

Ok, so I took a basic code (that shows the clock), packed everything in a .jar file, included the path to the jar file in "ADD_JAR" environment variable and launched a spark-shell on the cluster.
>
>
>How do I run the code from the jar file from the console ?
>
>
>
>On Thursday, January 23, 2014 12:12 AM, Ewen Cheslack-Postava <me...@ewencp.org> wrote:
> 
>I think Mayur pointed to 
that code because it includes the relevant initialization code you were 
asking about. Running on a cluster doesn't require much change: pass the spark:// address of the master instead of "local" and add any jars 
containing your code. You could set the jars manually, but the linked 
code uses 
>JavaSparkContext.jarOfClass(JavaKMeans.class) to get the right jar filename.
>
>-Ewen
>
>
>Kal El
>>January 22, 2014 
2:02 PM
>>please understand that the code from your link is completely useless to me. 
It's like someone is trying to solve a differential equation and you 
tell them what's the formula for the area of the circle. 
>>
>>
>>i 
can do that with my code too (kmeans code). the idea is that i want to 
run it on a cluster ...
>>
>>
>>
>>On 
Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi <ma...@gmail.com> wrote:
>> 
>>I am sorry that is not a tutorial. You can take this source code: 
>>
>>
>>https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>>
>>
>>
>>Sync and Build this project: 
>>https://github.com/apache/incubator-spark/
>>
>>You should be able to call JavaKMeans class, 
Reynold may be able to shed some details on how to use it. 
>>If you reach some where and get stuck post it back and I can try 
and help. I hope this helps.
>>
>>
>>Regards
>>Mayur
>>
>>
>>
>>
>>
>>
>>
>>
>>Mayur Rustagi
>>Ph: +919632149971
>>http://www.sigmoidanalytics.com
>>https://twitter.com/mayur_rustagi
>>
>>
>>
>>
>>On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pi...@yahoo.com> wrote:
>>
>>
>>
>>
>>Mayur Rustagi
>>January 22, 2014 
7:19 AM
>>I am sorry that 
is not a tutorial. You can take this source code: 
>>
>>
>>https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>>
>>
>>
>>Sync and Build this project: 
>>https://github.com/apache/incubator-spark/
>>
>>You should be able to call JavaKMeans class, Reynold may be able to shed 
some details on how to use it. 
>>If you reach some where and get stuck post it back and I can try 
and help. I hope this helps.
>>
>>
>>Regards
>>Mayur
>>
>>
>>
>>
>>
>>
>>
>>
>>Mayur Rustagi
>>Ph: +919632149971
>>http://www.sigmoidanalytics.com
>>https://twitter.com/mayur_rustagi
>>
>>
>>
>>
>>
>>Kal El
>>January 22, 2014 
7:05 AM
>>@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic presentation non related with actual running the algorithm
>>
>>
>>@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials
>>
>>
>>
>>On Wednesday, January 22, 2014 4:59 PM, Mayur 
Rustagi <ma...@gmail.com> wrote:
>> 
>>How 
about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>>Regards
>>Mayur
>>
>>
>>Mayur Rustagi
>>Ph: +919632149971
>>http://www.sigmoidanalytics.com
>>https://twitter.com/mayur_rustagi
>>
>>
>>
>>
>>On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:
>>
>>
>>
>>
>>Mayur Rustagi
>>January 22, 2014 
6:58 AM
>>How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>>Regards
>>Mayur
>>
>>
>>Mayur Rustagi
>>Ph: +919632149971
>>http://www.sigmoidanalytics.com
>>https://twitter.com/mayur_rustagi
>>
>>
>>
>>
>>
>>Ognen Duzlevski
>>January 22, 2014 
6:50 AM
>>Hello,
>>
>>I have found that you generally need two separate pools of knowledge to 
be successful in this game :). One is to have enough knowledge of 
network topologies, systems, java, scala and whatever else to actually 
set up the whole system (esp. if your requirements are different than 
running on a local machine or in the ec2 cluster supported by the 
scripts that come with spark).
>>
>>The other is actual knowledge of the API and how it works and 
how to express and solve your problems using the primitives offered by 
spark.
>>
>>There is also a third: since you can supply any 
function to a spark primitive, you generally need to know scala or java 
(or python?) to actually solve your problem.
>>
>>I am not sure this list is viewed as appropriate place to 
offer advice on how to actually solve these problems. Not that I would 
mind seeing various solutions to various problems :) and also 
optimizations.
>>
>>For example, I am trying to do rudimentary retention analysis. I am a 
total beginner in the whole map/reduce way of solving problems. I have 
come up with a solution that is pretty slow but implemented in 5 or 6 
lines of code for the simplest problem. However, my files are 20 GB in 
size each, all json strings. Figuring out what the limiting factor is 
(network bandwidth is my suspicion since I am accessing things via S3 is
 my guess) is somewhat of a black magic to me at this point. I think for
 most of this stuff you will have to read the code. The bigger question 
after that is optimizing your solutions to be faster :). I would love to
 see practical tutorials on doing such things and I am willing to put my
 attempts at solving problems out there to eventually get cannibalized, 
ridiculed and reimplemented properly :).
>>
>>Sorry for this long winded email, it did not really answer your 
question anyway :)
>>
>>Ognen
>>
>>
>>
>>
>>
>
>

Re: Running K-Means on a cluster setup

Posted by Mayur Rustagi <ma...@gmail.com>.

Are you running spark scala shell?
Applications can add dependencies for themselves through
SparkContext.addJar. If you want to add that jar for multiple applications
you can add the jar path to SPARK_CLASSPATH environment variable before
starting the spark shell.
I am not sure if this automatically copies the jar to all workers , if it
does not you might have to do that manually. Their was some discussion
around bundling jar to all slaves that you can follow here:
https://groups.google.com/forum/?fromgroups=#!topic/spark-users/IBgbLoFWbxw

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Jan 23, 2014 at 5:50 PM, Kal El <pi...@yahoo.com> wrote:

> Ok, so I took a basic code (that shows the clock), packed everything in a
> .jar file, included the path to the jar file in "ADD_JAR" environment
> variable and launched a spark-shell on the cluster.
>
> How do I run the code from the jar file from the console ?
>
>
>   On Thursday, January 23, 2014 12:12 AM, Ewen Cheslack-Postava <
> me@ewencp.org> wrote:
>  I think Mayur pointed to that code because it includes the relevant
> initialization code you were asking about. Running on a cluster doesn't
> require much change: pass the spark:// address of the master instead of
> "local" and add any jars containing your code. You could set the jars
> manually, but the linked code uses
>
> JavaSparkContext.jarOfClass(JavaKMeans.class)
>
>  to get the right jar filename.
>
> -Ewen
>
>   Kal El <pi...@yahoo.com>
>  January 22, 2014 2:02 PM
> please understand that the code from your link is completely useless to
> me. It's like someone is trying to solve a differential equation and you
> tell them what's the formula for the area of the circle.
>
> i can do that with my code too (kmeans code). the idea is that i want to
> run it on a cluster ...
>
>
>   On Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi
> <ma...@gmail.com> <ma...@gmail.com> wrote:
>  I am sorry that is not a tutorial. You can take this source code:
>
>
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
> Sync and Build this project:
> https://github.com/apache/incubator-spark/
> You should be able to call JavaKMeans class, Reynold may be able to shed
> some details on how to use it.
> If you reach some where and get stuck post it back and I can try and help.
> I hope this helps.
>
> Regards
> Mayur
>
>
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pi...@yahoo.com> wrote:
>
>
>
>     Mayur Rustagi <ma...@gmail.com>
>  January 22, 2014 7:19 AM
> I am sorry that is not a tutorial. You can take this source code:
>
>
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
> Sync and Build this project:
> https://github.com/apache/incubator-spark/
> You should be able to call JavaKMeans class, Reynold may be able to shed
> some details on how to use it.
> If you reach some where and get stuck post it back and I can try and help.
> I hope this helps.
>
> Regards
> Mayur
>
>
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
>
>   Kal El <pi...@yahoo.com>
>  January 22, 2014 7:05 AM
> @Mayur: I do not see any tutorial about how to run mlib on a cluster, just
> some basic presentation non related with actual running the algorithm
>
> @Ognen: Thanks, I have figured that out :)) that's why I need some
> tutorials
>
>
>   On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi
> <ma...@gmail.com> <ma...@gmail.com> wrote:
>  How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>
>
>     Mayur Rustagi <ma...@gmail.com>
>  January 22, 2014 6:58 AM
> How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
>
>   Ognen Duzlevski <og...@nengoiksvelzud.com>
>  January 22, 2014 6:50 AM
> Hello,
>
> I have found that you generally need two separate pools of knowledge to be
> successful in this game :). One is to have enough knowledge of network
> topologies, systems, java, scala and whatever else to actually set up the
> whole system (esp. if your requirements are different than running on a
> local machine or in the ec2 cluster supported by the scripts that come with
> spark).
>
> The other is actual knowledge of the API and how it works and how to
> express and solve your problems using the primitives offered by spark.
>
> There is also a third: since you can supply any function to a spark
> primitive, you generally need to know scala or java (or python?) to
> actually solve your problem.
>
> I am not sure this list is viewed as appropriate place to offer advice on
> how to actually solve these problems. Not that I would mind seeing various
> solutions to various problems :) and also optimizations.
>
> For example, I am trying to do rudimentary retention analysis. I am a
> total beginner in the whole map/reduce way of solving problems. I have come
> up with a solution that is pretty slow but implemented in 5 or 6 lines of
> code for the simplest problem. However, my files are 20 GB in size each,
> all json strings. Figuring out what the limiting factor is (network
> bandwidth is my suspicion since I am accessing things via S3 is my guess)
> is somewhat of a black magic to me at this point. I think for most of this
> stuff you will have to read the code. The bigger question after that is
> optimizing your solutions to be faster :). I would love to see practical
> tutorials on doing such things and I am willing to put my attempts at
> solving problems out there to eventually get cannibalized, ridiculed and
> reimplemented properly :).
>
> Sorry for this long winded email, it did not really answer your question
> anyway :)
> Ognen
>
>
>
>
>
>

Re: Running K-Means on a cluster setup

Posted by Kal El <pi...@yahoo.com>.

Ok, so I took a basic code (that shows the clock), packed everything in a .jar file, included the path to the jar file in "ADD_JAR" environment variable and launched a spark-shell on the cluster.

How do I run the code from the jar file from the console ?



On Thursday, January 23, 2014 12:12 AM, Ewen Cheslack-Postava <me...@ewencp.org> wrote:
 
I think Mayur pointed to 
that code because it includes the relevant initialization code you were 
asking about. Running on a cluster doesn't require much change: pass the spark:// address of the master instead of "local" and add any jars 
containing your code. You could set the jars manually, but the linked 
code uses 
JavaSparkContext.jarOfClass(JavaKMeans.class) to get the right jar filename.

-Ewen


Kal El
>January 22, 2014 
2:02 PM
>please understand that the code from your link is completely useless to me. 
It's like someone is trying to solve a differential equation and you 
tell them what's the formula for the area of the circle. 
>
>
>i 
can do that with my code too (kmeans code). the idea is that i want to 
run it on a cluster ...
>
>
>
>On 
Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi <ma...@gmail.com> wrote:
> 
>I am sorry that is not a tutorial. You can take this source code: 
>
>
>https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
>
>
>Sync and Build this project: 
>https://github.com/apache/incubator-spark/
>
>You should be able to call JavaKMeans class, 
Reynold may be able to shed some details on how to use it. 
>If you reach some where and get stuck post it back and I can try 
and help. I hope this helps.
>
>
>Regards
>Mayur
>
>
>
>
>
>
>
>
>Mayur Rustagi
>Ph: 
+919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pi...@yahoo.com> wrote:
>
>
>
>
>Mayur Rustagi
>January 22, 2014 
7:19 AM
>I am sorry that 
is not a tutorial. You can take this source code: 
>
>
>https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
>
>
>Sync and Build this project: 
>https://github.com/apache/incubator-spark/
>
>You should be able to call JavaKMeans class, Reynold may be able to shed 
some details on how to use it. 
>If you reach some where and get stuck post it back and I can try 
and help. I hope this helps.
>
>
>Regards
>Mayur
>
>
>
>
>
>
>
>
>Mayur Rustagi
>Ph: +919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>
>Kal El
>January 22, 2014 
7:05 AM
>@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic presentation non related with actual running the algorithm
>
>
>@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials
>
>
>
>On Wednesday, January 22, 2014 4:59 PM, Mayur 
Rustagi <ma...@gmail.com> wrote:
> 
>How 
about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>Regards
>Mayur
>
>
>Mayur Rustagi
>Ph: 
+919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:
>
>
>
>
>Mayur Rustagi
>January 22, 2014 
6:58 AM
>How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>Regards
>Mayur
>
>
>Mayur Rustagi
>Ph: +919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>
>Ognen Duzlevski
>January 22, 2014 
6:50 AM
>Hello,
>
>I have found that you generally need two separate pools of knowledge to 
be successful in this game :). One is to have enough knowledge of 
network topologies, systems, java, scala and whatever else to actually 
set up the whole system (esp. if your requirements are different than 
running on a local machine or in the ec2 cluster supported by the 
scripts that come with spark).
>
>The other is actual knowledge of the API and how it works and 
how to express and solve your problems using the primitives offered by 
spark.
>
>There is also a third: since you can supply any 
function to a spark primitive, you generally need to know scala or java 
(or python?) to actually solve your problem.
>
>I am not sure this list is viewed as appropriate place to 
offer advice on how to actually solve these problems. Not that I would 
mind seeing various solutions to various problems :) and also 
optimizations.
>
>For example, I am trying to do rudimentary retention analysis. I am a 
total beginner in the whole map/reduce way of solving problems. I have 
come up with a solution that is pretty slow but implemented in 5 or 6 
lines of code for the simplest problem. However, my files are 20 GB in 
size each, all json strings. Figuring out what the limiting factor is 
(network bandwidth is my suspicion since I am accessing things via S3 is
 my guess) is somewhat of a black magic to me at this point. I think for
 most of this stuff you will have to read the code. The bigger question 
after that is optimizing your solutions to be faster :). I would love to
 see practical tutorials on doing such things and I am willing to put my
 attempts at solving problems out there to eventually get cannibalized, 
ridiculed and reimplemented properly :).
>
>Sorry for this long winded email, it did not really answer your 
question anyway :)
>
>Ognen
>
>
>
>
>

Re: Running K-Means on a cluster setup

Posted by Ewen Cheslack-Postava <me...@ewencp.org>.

I think Mayur pointed to that code because it includes the relevant 
initialization code you were asking about. Running on a cluster doesn't 
require much change: pass the spark:// address of the master instead of 
"local" and add any jars containing your code. You could set the jars 
manually, but the linked code uses

JavaSparkContext.jarOfClass(JavaKMeans.class)

  to get the right jar filename.

-Ewen

> Kal El <ma...@yahoo.com>
> January 22, 2014 2:02 PM
> please understand that the code from your link is completely useless 
> to me. It's like someone is trying to solve a differential equation 
> and you tell them what's the formula for the area of the circle.
>
> i can do that with my code too (kmeans code). the idea is that i want 
> to run it on a cluster ...
>
>
> On Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi 
> <ma...@gmail.com> wrote:
> I am sorry that is not a tutorial. You can take this source code:
>
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
> Sync and Build this project:
> https://github.com/apache/incubator-spark/
> You should be able to call JavaKMeans class, Reynold may be able to 
> shed some details on how to use it.
> If you reach some where and get stuck post it back and I can try and 
> help. I hope this helps.
>
> Regards
> Mayur
>
>
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com 
> <http://www.sigmoidanalytics.com/>
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pinu.datriciu@yahoo.com 
> <ma...@yahoo.com>> wrote:
>
>
>
> Mayur Rustagi <ma...@gmail.com>
> January 22, 2014 7:19 AM
> I am sorry that is not a tutorial. You can take this source code:
>
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
> Sync and Build this project:
> https://github.com/apache/incubator-spark/
> You should be able to call JavaKMeans class, Reynold may be able to 
> shed some details on how to use it.
> If you reach some where and get stuck post it back and I can try and 
> help. I hope this helps.
>
> Regards
> Mayur
>
>
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com 
> <http://www.sigmoidanalytics.com>
> https://twitter.com/mayur_rustagi
>
>
>
>
> Kal El <ma...@yahoo.com>
> January 22, 2014 7:05 AM
> @Mayur: I do not see any tutorial about how to run mlib on a cluster, 
> just some basic presentation non related with actual running the algorithm
>
> @Ognen: Thanks, I have figured that out :)) that's why I need some 
> tutorials
>
>
> On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi 
> <ma...@gmail.com> wrote:
> How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com 
> <http://www.sigmoidanalytics.com/>
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski 
> <ognen@nengoiksvelzud.com <ma...@nengoiksvelzud.com>> wrote:
>
>
>
> Mayur Rustagi <ma...@gmail.com>
> January 22, 2014 6:58 AM
> How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com 
> <http://www.sigmoidanalytics.com>
> https://twitter.com/mayur_rustagi
>
>
>
>
> Ognen Duzlevski <ma...@nengoiksvelzud.com>
> January 22, 2014 6:50 AM
> Hello,
>
> I have found that you generally need two separate pools of knowledge 
> to be successful in this game :). One is to have enough knowledge of 
> network topologies, systems, java, scala and whatever else to actually 
> set up the whole system (esp. if your requirements are different than 
> running on a local machine or in the ec2 cluster supported by the 
> scripts that come with spark).
>
> The other is actual knowledge of the API and how it works and how to 
> express and solve your problems using the primitives offered by spark.
>
> There is also a third: since you can supply any function to a spark 
> primitive, you generally need to know scala or java (or python?) to 
> actually solve your problem.
>
> I am not sure this list is viewed as appropriate place to offer advice 
> on how to actually solve these problems. Not that I would mind seeing 
> various solutions to various problems :) and also optimizations.
>
> For example, I am trying to do rudimentary retention analysis. I am a 
> total beginner in the whole map/reduce way of solving problems. I have 
> come up with a solution that is pretty slow but implemented in 5 or 6 
> lines of code for the simplest problem. However, my files are 20 GB in 
> size each, all json strings. Figuring out what the limiting factor is 
> (network bandwidth is my suspicion since I am accessing things via S3 
> is my guess) is somewhat of a black magic to me at this point. I think 
> for most of this stuff you will have to read the code. The bigger 
> question after that is optimizing your solutions to be faster :). I 
> would love to see practical tutorials on doing such things and I am 
> willing to put my attempts at solving problems out there to eventually 
> get cannibalized, ridiculed and reimplemented properly :).
>
> Sorry for this long winded email, it did not really answer your 
> question anyway :)
> Ognen
>
>
>

Re: Running K-Means on a cluster setup

Posted by Kal El <pi...@yahoo.com>.

please understand that the code from your link is completely useless to me. It's like someone is trying to solve a differential equation and you tell them what's the formula for the area of the circle. 

i can do that with my code too (kmeans code). the idea is that i want to run it on a cluster ...



On Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi <ma...@gmail.com> wrote:
 
I am sorry that is not a tutorial. You can take this source code: 

https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java


Sync and Build this project: 
https://github.com/apache/incubator-spark/

You should be able to call JavaKMeans class, Reynold may be able to shed some details on how to use it. 
If you reach some where and get stuck post it back and I can try and help. I hope this helps.

Regards
Mayur





Mayur Rustagi
Ph: +919632149971
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pi...@yahoo.com> wrote:

@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic presentation non related with actual running the algorithm
>
>
>@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials
>
>
>
>On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi <ma...@gmail.com> wrote:
> 
>How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>Regards
>Mayur
>
>
>Mayur Rustagi
>Ph: +919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:
>
>Hello,
>>
>>I have found that you generally need two separate pools of knowledge to be successful in this game :). One is to have enough knowledge of network topologies, systems, java, scala and whatever else to actually set up the whole system (esp. if your requirements are different than running on a local machine or in the ec2 cluster supported by the scripts that come with spark).
>>
>>The other is actual knowledge of the API and how it works and how to express and solve your problems using the primitives offered by spark.
>>
>>There is also a third: since you can supply any function to a spark primitive, you generally need to know scala or java (or python?) to actually solve your problem.
>>
>>I am not sure this list is viewed as appropriate place to offer advice on how to actually solve these problems. Not that I would mind seeing various solutions to various problems :) and also optimizations.
>>
>>For example, I am trying to do rudimentary retention analysis. I am a total beginner in the whole map/reduce way of solving problems. I have come up with a solution that is pretty slow but implemented in 5 or 6 lines of code for the simplest problem. However, my files are 20 GB in size each, all json strings. Figuring out what the limiting factor is (network bandwidth is my suspicion since I am accessing things via S3 is my guess) is somewhat of a black magic to me at this point. I think for most of this stuff you will have to read the code. The bigger question after that is optimizing your solutions to be faster :). I would love to see practical tutorials on doing such things and I am willing to put my attempts at solving problems out there to eventually get cannibalized, ridiculed and reimplemented properly :).
>>
>>Sorry for this long winded email, it did not really answer your question anyway :)
>>
>>Ognen
>>
>>
>>
>>
>>On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pi...@yahoo.com> wrote:
>>
>>I have created a cluster setup with 2 workers (one of them is also the master)
>>>
>>>
>>>Can anyone help me with a tutorial on how to run K-Means for example on this cluster (it would be better to run it from outside the cluster command line)?
>>>
>>>
>>>I am mostly interested on how do I initiate the sparkcontext (what jars do I need to add ? :
>>>newSparkContext(master,appName,[sparkHome],[jars])) and what other steps I need to run.
>>>
>>>
>>>I am using the standalone spark cluster.
>>>
>>>
>>>Thanks
>>>
>>>
>>>
>>>
>>
>
>
>

Re: Running K-Means on a cluster setup

Posted by Mayur Rustagi <ma...@gmail.com>.

I am sorry that is not a tutorial. You can take this source code:

https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java

Sync and Build this project:
https://github.com/apache/incubator-spark/
You should be able to call JavaKMeans class, Reynold may be able to shed
some details on how to use it.
If you reach some where and get stuck post it back and I can try and help.
I hope this helps.

Regards
Mayur




Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pi...@yahoo.com> wrote:

> @Mayur: I do not see any tutorial about how to run mlib on a cluster, just
> some basic presentation non related with actual running the algorithm
>
> @Ognen: Thanks, I have figured that out :)) that's why I need some
> tutorials
>
>
>   On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi <
> mayur.rustagi@gmail.com> wrote:
>  How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
> Hello,
>
> I have found that you generally need two separate pools of knowledge to be
> successful in this game :). One is to have enough knowledge of network
> topologies, systems, java, scala and whatever else to actually set up the
> whole system (esp. if your requirements are different than running on a
> local machine or in the ec2 cluster supported by the scripts that come with
> spark).
>
> The other is actual knowledge of the API and how it works and how to
> express and solve your problems using the primitives offered by spark.
>
> There is also a third: since you can supply any function to a spark
> primitive, you generally need to know scala or java (or python?) to
> actually solve your problem.
>
> I am not sure this list is viewed as appropriate place to offer advice on
> how to actually solve these problems. Not that I would mind seeing various
> solutions to various problems :) and also optimizations.
>
> For example, I am trying to do rudimentary retention analysis. I am a
> total beginner in the whole map/reduce way of solving problems. I have come
> up with a solution that is pretty slow but implemented in 5 or 6 lines of
> code for the simplest problem. However, my files are 20 GB in size each,
> all json strings. Figuring out what the limiting factor is (network
> bandwidth is my suspicion since I am accessing things via S3 is my guess)
> is somewhat of a black magic to me at this point. I think for most of this
> stuff you will have to read the code. The bigger question after that is
> optimizing your solutions to be faster :). I would love to see practical
> tutorials on doing such things and I am willing to put my attempts at
> solving problems out there to eventually get cannibalized, ridiculed and
> reimplemented properly :).
>
> Sorry for this long winded email, it did not really answer your question
> anyway :)
> Ognen
>
>
> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pi...@yahoo.com> wrote:
>
> I have created a cluster setup with 2 workers (one of them is also the
> master)
>
> Can anyone help me with a tutorial on how to run K-Means for example on
> this cluster (it would be better to run it from outside the cluster command
> line)?
>
> I am mostly interested on how do I initiate the sparkcontext (what jars do
> I need to add ? :
> new SparkContext(master, appName, [sparkHome], [jars])) and what other
> steps I need to run.
>
> I am using the standalone spark cluster.
>
> Thanks
>
>
>
>
>
>
>

Re: Running K-Means on a cluster setup

Posted by Kal El <pi...@yahoo.com>.

@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic presentation non related with actual running the algorithm

@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials



On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi <ma...@gmail.com> wrote:
 
How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
Regards
Mayur


Mayur Rustagi
Ph: +919632149971
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:

Hello,
>
>I have found that you generally need two separate pools of knowledge to be successful in this game :). One is to have enough knowledge of network topologies, systems, java, scala and whatever else to actually set up the whole system (esp. if your requirements are different than running on a local machine or in the ec2 cluster supported by the scripts that come with spark).
>
>The other is actual knowledge of the API and how it works and how to express and solve your problems using the primitives offered by spark.
>
>There is also a third: since you can supply any function to a spark primitive, you generally need to know scala or java (or python?) to actually solve your problem.
>
>I am not sure this list is viewed as appropriate place to offer advice on how to actually solve these problems. Not that I would mind seeing various solutions to various problems :) and also optimizations.
>
>For example, I am trying to do rudimentary retention analysis. I am a total beginner in the whole map/reduce way of solving problems. I have come up with a solution that is pretty slow but implemented in 5 or 6 lines of code for the simplest problem. However, my files are 20 GB in size each, all json strings. Figuring out what the limiting factor is (network bandwidth is my suspicion since I am accessing things via S3 is my guess) is somewhat of a black magic to me at this point. I think for most of this stuff you will have to read the code. The bigger question after that is optimizing your solutions to be faster :). I would love to see practical tutorials on doing such things and I am willing to put my attempts at solving problems out there to eventually get cannibalized, ridiculed and reimplemented properly :).
>
>Sorry for this long winded email, it did not really answer your question anyway :)
>
>Ognen
>
>
>
>
>On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pi...@yahoo.com> wrote:
>
>I have created a cluster setup with 2 workers (one of them is also the master)
>>
>>
>>Can anyone help me with a tutorial on how to run K-Means for example on this cluster (it would be better to run it from outside the cluster command line)?
>>
>>
>>I am mostly interested on how do I initiate the sparkcontext (what jars do I need to add ? :
>>newSparkContext(master,appName,[sparkHome],[jars])) and what other steps I need to run.
>>
>>
>>I am using the standalone spark cluster.
>>
>>
>>Thanks
>>
>>
>>
>>
>

Re: Running K-Means on a cluster setup

Posted by Mayur Rustagi <ma...@gmail.com>.

How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
Regards
Mayur

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I have found that you generally need two separate pools of knowledge to be
> successful in this game :). One is to have enough knowledge of network
> topologies, systems, java, scala and whatever else to actually set up the
> whole system (esp. if your requirements are different than running on a
> local machine or in the ec2 cluster supported by the scripts that come with
> spark).
>
> The other is actual knowledge of the API and how it works and how to
> express and solve your problems using the primitives offered by spark.
>
> There is also a third: since you can supply any function to a spark
> primitive, you generally need to know scala or java (or python?) to
> actually solve your problem.
>
> I am not sure this list is viewed as appropriate place to offer advice on
> how to actually solve these problems. Not that I would mind seeing various
> solutions to various problems :) and also optimizations.
>
> For example, I am trying to do rudimentary retention analysis. I am a
> total beginner in the whole map/reduce way of solving problems. I have come
> up with a solution that is pretty slow but implemented in 5 or 6 lines of
> code for the simplest problem. However, my files are 20 GB in size each,
> all json strings. Figuring out what the limiting factor is (network
> bandwidth is my suspicion since I am accessing things via S3 is my guess)
> is somewhat of a black magic to me at this point. I think for most of this
> stuff you will have to read the code. The bigger question after that is
> optimizing your solutions to be faster :). I would love to see practical
> tutorials on doing such things and I am willing to put my attempts at
> solving problems out there to eventually get cannibalized, ridiculed and
> reimplemented properly :).
>
> Sorry for this long winded email, it did not really answer your question
> anyway :)
> Ognen
>
>
> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pi...@yahoo.com> wrote:
>
>> I have created a cluster setup with 2 workers (one of them is also the
>> master)
>>
>> Can anyone help me with a tutorial on how to run K-Means for example on
>> this cluster (it would be better to run it from outside the cluster command
>> line)?
>>
>> I am mostly interested on how do I initiate the sparkcontext (what jars
>> do I need to add ? :
>> new SparkContext(master, appName, [sparkHome], [jars])) and what other
>> steps I need to run.
>>
>> I am using the standalone spark cluster.
>>
>> Thanks
>>
>>
>>
>

Re: Running K-Means on a cluster setup

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Hello,

I have found that you generally need two separate pools of knowledge to be
successful in this game :). One is to have enough knowledge of network
topologies, systems, java, scala and whatever else to actually set up the
whole system (esp. if your requirements are different than running on a
local machine or in the ec2 cluster supported by the scripts that come with
spark).

The other is actual knowledge of the API and how it works and how to
express and solve your problems using the primitives offered by spark.

There is also a third: since you can supply any function to a spark
primitive, you generally need to know scala or java (or python?) to
actually solve your problem.

I am not sure this list is viewed as appropriate place to offer advice on
how to actually solve these problems. Not that I would mind seeing various
solutions to various problems :) and also optimizations.

For example, I am trying to do rudimentary retention analysis. I am a total
beginner in the whole map/reduce way of solving problems. I have come up
with a solution that is pretty slow but implemented in 5 or 6 lines of code
for the simplest problem. However, my files are 20 GB in size each, all
json strings. Figuring out what the limiting factor is (network bandwidth
is my suspicion since I am accessing things via S3 is my guess) is somewhat
of a black magic to me at this point. I think for most of this stuff you
will have to read the code. The bigger question after that is optimizing
your solutions to be faster :). I would love to see practical tutorials on
doing such things and I am willing to put my attempts at solving problems
out there to eventually get cannibalized, ridiculed and reimplemented
properly :).

Sorry for this long winded email, it did not really answer your question
anyway :)
Ognen

On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pi...@yahoo.com> wrote:

> I have created a cluster setup with 2 workers (one of them is also the
> master)
>
> Can anyone help me with a tutorial on how to run K-Means for example on
> this cluster (it would be better to run it from outside the cluster command
> line)?
>
> I am mostly interested on how do I initiate the sparkcontext (what jars do
> I need to add ? :
> new SparkContext(master, appName, [sparkHome], [jars])) and what other
> steps I need to run.
>
> I am using the standalone spark cluster.
>
> Thanks
>
>
>