You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/01/02 11:40:46 UTC

Spark context jar confusions

Hi,

I do not understand why spark context has an option for loading jars at
runtime.

As an example, consider
this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
:

object BroadcastTest {
  def main(args: Array[String]) {

  val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))

}
}


This is *the* example, or *the* application that we want to run, what
does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained,
why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?

Re: Spark context jar confusions

Posted by Eugen Cepoi <ce...@gmail.com>.

Spark will send the closures to workers. If you don't have any external
dependency in your closure (using only spark types & scala/java) then it
will work fine. But now suppose you use some classes you have defined in
your project or depend on some common libs like jodatime. The workers don't
know about those classes, they must be in your classpath. Thus you need to
tell to the spark context which jars must be added in your classpath and
shared to the workers. Doing a fat jar is just easier than having a list of
jars.

To test it you can try with the spark shell, do something like
sc.makeRDD(Seq(DateTime.now(), DateTime.now())).map(date => date.getMillis
-> date).collect

when launching the shell do SPARK_CLASSPATH=path/to/joda-time.jar
spark-shell

if you don't do sc.addJar("path/to/jodatime.jar") you will get
classnotfound exceptions





2014/1/2 Archit Thakur <ar...@gmail.com>

> Eugen, you said spark sends the jar to each worker, if we specify it. What
> if we only create a fat jar and do not do the sc.jarOfclass(class)? If we
> have created a fat jar. Won't all of the class be available on the slave
> node? What if access it in the code which is supposed to be executed on one
> of the slave node? Eg, Object Z. which is present the fat jar and is
> accessed in the map function(which is executed distributedly?). Won't it be
> accessible(Coz it is at compile time) ? It usually is, Isn't it?
>
>
> On Thu, Jan 2, 2014 at 6:02 PM, Archit Thakur <ar...@gmail.com>wrote:
>
>> Aureliano, It doesn't matter actually. specifying "local" as your spark
>> master only does is It uses the single JVM to run whole application. Making
>> a cluster and then specifying "spark://localhost:7077" runs it on a set
>> of machines. Running spark in lcoal mode will be helpful for debugging
>> purposes but will perform much slower than if you have a cluster of 3-4-n
>> machines. If you do not have a set of machines, you can make your same
>> machine as a slave and start both master and slave on the same machine. Go
>> through Apache Spark home to know more about starting various node. Thx.
>>
>>
>>
>> On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <bu...@gmail.com>wrote:
>>
>>> How about when developing the spark application, do you use "localhost",
>>> or "spark://localhost:7077" for spark context master during development?
>>>
>>> Using "spark://localhost:7077" is a good way to simulate the production
>>> driver and it provides the web ui. When using "spark://localhost:7077", is
>>> it required to create the fat jar? Wouldn't that significantly slow down
>>> the development cycle?
>>>
>>>
>>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>
>>>> It depends how you deploy, I don't find it so complicated...
>>>>
>>>> 1) To build the fat jar I am using maven (as I am not familiar with
>>>> sbt).
>>>>
>>>> Inside I have something like that, saying which libs should be used in
>>>> the fat jar (the others won't be present in the final artifact).
>>>>
>>>> <plugin>
>>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>>                 <version>2.1</version>
>>>>                 <executions>
>>>>                     <execution>
>>>>                         <phase>package</phase>
>>>>                         <goals>
>>>>                             <goal>shade</goal>
>>>>                         </goals>
>>>>                         <configuration>
>>>>                             <minimizeJar>true</minimizeJar>
>>>>
>>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>>                             <artifactSet>
>>>>                                 <includes>
>>>>
>>>> <include>org.apache.hbase:*</include>
>>>>
>>>> <include>org.apache.hadoop:*</include>
>>>>
>>>> <include>com.typesafe:config</include>
>>>>                                     <include>org.apache.avro:*</include>
>>>>                                     <include>joda-time:*</include>
>>>>                                     <include>org.joda:*</include>
>>>>                                 </includes>
>>>>                             </artifactSet>
>>>>                             <filters>
>>>>                                 <filter>
>>>>                                     <artifact>*:*</artifact>
>>>>                                     <excludes>
>>>>                                         <exclude>META-INF/*.SF</exclude>
>>>>
>>>> <exclude>META-INF/*.DSA</exclude>
>>>>
>>>> <exclude>META-INF/*.RSA</exclude>
>>>>                                     </excludes>
>>>>                                 </filter>
>>>>                             </filters>
>>>>                         </configuration>
>>>>                     </execution>
>>>>                 </executions>
>>>>             </plugin>
>>>>
>>>>
>>>> 2) The App is the jar you have built, so you ship it to the driver node
>>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>>> plain old scp, etc) to run it you can do something like:
>>>>
>>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>>> com.myproject.MyJob
>>>>
>>>> where MyJob is the entry point to your job it defines a main method.
>>>>
>>>> 3) I don't know whats the "common way" but I am doing things this way:
>>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>>> it to a node that plays the role of the driver, run it over mesos using the
>>>> launch scripts + some conf.
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>
>>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>>> has followed scala in that there are more than one way of accomplishing a
>>>>> job, making scala an overcomplicated language)
>>>>>
>>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>>>> is shipped with a separate sbt?
>>>>>
>>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>>> assembly also create that jar?
>>>>>
>>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>>> cannot find any example by googling. What's the most common way that people
>>>>> use?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is the list of the jars you use in your job, the driver will
>>>>>> send all those jars to each worker (otherwise the workers won't have the
>>>>>> classes you need in your job). The easy way to go is to build a fat jar
>>>>>> with your code and all the libs you depend on and then use this utility to
>>>>>> get the path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>>
>>>>>>
>>>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I do not understand why spark context has an option for loading jars
>>>>>>> at runtime.
>>>>>>>
>>>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>>> :
>>>>>>>
>>>>>>> object BroadcastTest {
>>>>>>>   def main(args: Array[String]) {
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Finally, how does this help a real world spark application?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Archit Thakur <ar...@gmail.com>.

Eugen, you said spark sends the jar to each worker, if we specify it. What
if we only create a fat jar and do not do the sc.jarOfclass(class)? If we
have created a fat jar. Won't all of the class be available on the slave
node? What if access it in the code which is supposed to be executed on one
of the slave node? Eg, Object Z. which is present the fat jar and is
accessed in the map function(which is executed distributedly?). Won't it be
accessible(Coz it is at compile time) ? It usually is, Isn't it?


On Thu, Jan 2, 2014 at 6:02 PM, Archit Thakur <ar...@gmail.com>wrote:

> Aureliano, It doesn't matter actually. specifying "local" as your spark
> master only does is It uses the single JVM to run whole application. Making
> a cluster and then specifying "spark://localhost:7077" runs it on a set
> of machines. Running spark in lcoal mode will be helpful for debugging
> purposes but will perform much slower than if you have a cluster of 3-4-n
> machines. If you do not have a set of machines, you can make your same
> machine as a slave and start both master and slave on the same machine. Go
> through Apache Spark home to know more about starting various node. Thx.
>
>
>
> On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <bu...@gmail.com>wrote:
>
>> How about when developing the spark application, do you use "localhost",
>> or "spark://localhost:7077" for spark context master during development?
>>
>> Using "spark://localhost:7077" is a good way to simulate the production
>> driver and it provides the web ui. When using "spark://localhost:7077", is
>> it required to create the fat jar? Wouldn't that significantly slow down
>> the development cycle?
>>
>>
>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>
>>> It depends how you deploy, I don't find it so complicated...
>>>
>>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>>
>>> Inside I have something like that, saying which libs should be used in
>>> the fat jar (the others won't be present in the final artifact).
>>>
>>> <plugin>
>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>                 <version>2.1</version>
>>>                 <executions>
>>>                     <execution>
>>>                         <phase>package</phase>
>>>                         <goals>
>>>                             <goal>shade</goal>
>>>                         </goals>
>>>                         <configuration>
>>>                             <minimizeJar>true</minimizeJar>
>>>
>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>                             <artifactSet>
>>>                                 <includes>
>>>                                     <include>org.apache.hbase:*</include>
>>>
>>> <include>org.apache.hadoop:*</include>
>>>
>>> <include>com.typesafe:config</include>
>>>                                     <include>org.apache.avro:*</include>
>>>                                     <include>joda-time:*</include>
>>>                                     <include>org.joda:*</include>
>>>                                 </includes>
>>>                             </artifactSet>
>>>                             <filters>
>>>                                 <filter>
>>>                                     <artifact>*:*</artifact>
>>>                                     <excludes>
>>>                                         <exclude>META-INF/*.SF</exclude>
>>>                                         <exclude>META-INF/*.DSA</exclude>
>>>                                         <exclude>META-INF/*.RSA</exclude>
>>>                                     </excludes>
>>>                                 </filter>
>>>                             </filters>
>>>                         </configuration>
>>>                     </execution>
>>>                 </executions>
>>>             </plugin>
>>>
>>>
>>> 2) The App is the jar you have built, so you ship it to the driver node
>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>> plain old scp, etc) to run it you can do something like:
>>>
>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>> com.myproject.MyJob
>>>
>>> where MyJob is the entry point to your job it defines a main method.
>>>
>>> 3) I don't know whats the "common way" but I am doing things this way:
>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>> it to a node that plays the role of the driver, run it over mesos using the
>>> launch scripts + some conf.
>>>
>>>
>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>
>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>> has followed scala in that there are more than one way of accomplishing a
>>>> job, making scala an overcomplicated language)
>>>>
>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>>> is shipped with a separate sbt?
>>>>
>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>> assembly also create that jar?
>>>>
>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>> cannot find any example by googling. What's the most common way that people
>>>> use?
>>>>
>>>>
>>>>
>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is the list of the jars you use in your job, the driver will send
>>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>>> code and all the libs you depend on and then use this utility to get the
>>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>
>>>>>
>>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I do not understand why spark context has an option for loading jars
>>>>>> at runtime.
>>>>>>
>>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>> :
>>>>>>
>>>>>> object BroadcastTest {
>>>>>>   def main(args: Array[String]) {
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Finally, how does this help a real world spark application?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Archit Thakur <ar...@gmail.com>.

Aureliano, It doesn't matter actually. specifying "local" as your spark
master only does is It uses the single JVM to run whole application. Making
a cluster and then specifying "spark://localhost:7077" runs it on a set of
machines. Running spark in lcoal mode will be helpful for debugging
purposes but will perform much slower than if you have a cluster of 3-4-n
machines. If you do not have a set of machines, you can make your same
machine as a slave and start both master and slave on the same machine. Go
through Apache Spark home to know more about starting various node. Thx.



On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <bu...@gmail.com>wrote:

> How about when developing the spark application, do you use "localhost",
> or "spark://localhost:7077" for spark context master during development?
>
> Using "spark://localhost:7077" is a good way to simulate the production
> driver and it provides the web ui. When using "spark://localhost:7077", is
> it required to create the fat jar? Wouldn't that significantly slow down
> the development cycle?
>
>
> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>
>> It depends how you deploy, I don't find it so complicated...
>>
>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>
>> Inside I have something like that, saying which libs should be used in
>> the fat jar (the others won't be present in the final artifact).
>>
>> <plugin>
>>                 <groupId>org.apache.maven.plugins</groupId>
>>                 <artifactId>maven-shade-plugin</artifactId>
>>                 <version>2.1</version>
>>                 <executions>
>>                     <execution>
>>                         <phase>package</phase>
>>                         <goals>
>>                             <goal>shade</goal>
>>                         </goals>
>>                         <configuration>
>>                             <minimizeJar>true</minimizeJar>
>>
>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>                             <artifactSet>
>>                                 <includes>
>>                                     <include>org.apache.hbase:*</include>
>>                                     <include>org.apache.hadoop:*</include>
>>                                     <include>com.typesafe:config</include>
>>                                     <include>org.apache.avro:*</include>
>>                                     <include>joda-time:*</include>
>>                                     <include>org.joda:*</include>
>>                                 </includes>
>>                             </artifactSet>
>>                             <filters>
>>                                 <filter>
>>                                     <artifact>*:*</artifact>
>>                                     <excludes>
>>                                         <exclude>META-INF/*.SF</exclude>
>>                                         <exclude>META-INF/*.DSA</exclude>
>>                                         <exclude>META-INF/*.RSA</exclude>
>>                                     </excludes>
>>                                 </filter>
>>                             </filters>
>>                         </configuration>
>>                     </execution>
>>                 </executions>
>>             </plugin>
>>
>>
>> 2) The App is the jar you have built, so you ship it to the driver node
>> (it depends a lot on how you are planing to use it, debian packaging, a
>> plain old scp, etc) to run it you can do something like:
>>
>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>>
>> where MyJob is the entry point to your job it defines a main method.
>>
>> 3) I don't know whats the "common way" but I am doing things this way:
>> build the fat jar, provide some launch scripts, make debian packaging, ship
>> it to a node that plays the role of the driver, run it over mesos using the
>> launch scripts + some conf.
>>
>>
>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>
>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>> has followed scala in that there are more than one way of accomplishing a
>>> job, making scala an overcomplicated language)
>>>
>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>> is shipped with a separate sbt?
>>>
>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>> assembly also create that jar?
>>>
>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>> cannot find any example by googling. What's the most common way that people
>>> use?
>>>
>>>
>>>
>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> This is the list of the jars you use in your job, the driver will send
>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>> code and all the libs you depend on and then use this utility to get the
>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I do not understand why spark context has an option for loading jars
>>>>> at runtime.
>>>>>
>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>> :
>>>>>
>>>>> object BroadcastTest {
>>>>>   def main(args: Array[String]) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Finally, how does this help a real world spark application?
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Eugen Cepoi <ce...@gmail.com>.

2014/1/2 Aureliano Buendia <bu...@gmail.com>

>
>
>
> On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <ce...@gmail.com> wrote:
>
>> When developing I am using local[2] that launches a local cluster with 2
>> workers. In most cases it is fine, I just encountered some strange
>> behaviours for broadcasted variables, in local mode no broadcast is done
>> (at least in 0.8).
>>
>
> That's not good. This could hide bugs in production.
>

That depends on what you want to test...spark is really easy to unit test,
IMO when developping you don't need a full cluster.


>
>
>> You also have access to the ui in that case at localhost:4040.
>>
>
> That server has a short life, it dies when the program exits.
>

Sure, but you are developing at that moment, you want to make unit tests
and make sure they pass, no?


>
>>
>> In dev mode I am directly launching my main class from intellij so no I
>> don't need to build the fat jar.
>>
>
> Why is that it is not possible to work with spark://localhost:7077 while
> developing? This allows to monitor and review the jobs, while keeping a
> record of the past jobs.
>
> I've never been able to connect to spark://localhost:7077 in development,
> I get:
>
> WARN cluster.ClusterScheduler: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory
>
>
Try setting spark.executor.memory,
http://spark.incubator.apache.org/docs/latest/configuration.html


> The ui says the workers are alive and they do have plenty of memory. Also,
> I tried the exact spark master name given by the ui with no luck
> (apparently akka is too fragile and sensitive to this). Also, turning off
> firewall on os x had no effect.
>
>
>>
>>
>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>
>>> How about when developing the spark application, do you use "localhost",
>>> or "spark://localhost:7077" for spark context master during development?
>>>
>>> Using "spark://localhost:7077" is a good way to simulate the production
>>> driver and it provides the web ui. When using "spark://localhost:7077", is
>>> it required to create the fat jar? Wouldn't that significantly slow down
>>> the development cycle?
>>>
>>>
>>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>
>>>> It depends how you deploy, I don't find it so complicated...
>>>>
>>>> 1) To build the fat jar I am using maven (as I am not familiar with
>>>> sbt).
>>>>
>>>> Inside I have something like that, saying which libs should be used in
>>>> the fat jar (the others won't be present in the final artifact).
>>>>
>>>> <plugin>
>>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>>                 <version>2.1</version>
>>>>                 <executions>
>>>>                     <execution>
>>>>                         <phase>package</phase>
>>>>                         <goals>
>>>>                             <goal>shade</goal>
>>>>                         </goals>
>>>>                         <configuration>
>>>>                             <minimizeJar>true</minimizeJar>
>>>>
>>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>>                             <artifactSet>
>>>>                                 <includes>
>>>>
>>>> <include>org.apache.hbase:*</include>
>>>>
>>>> <include>org.apache.hadoop:*</include>
>>>>
>>>> <include>com.typesafe:config</include>
>>>>                                     <include>org.apache.avro:*</include>
>>>>                                     <include>joda-time:*</include>
>>>>                                     <include>org.joda:*</include>
>>>>                                 </includes>
>>>>                             </artifactSet>
>>>>                             <filters>
>>>>                                 <filter>
>>>>                                     <artifact>*:*</artifact>
>>>>                                     <excludes>
>>>>                                         <exclude>META-INF/*.SF</exclude>
>>>>
>>>> <exclude>META-INF/*.DSA</exclude>
>>>>
>>>> <exclude>META-INF/*.RSA</exclude>
>>>>                                     </excludes>
>>>>                                 </filter>
>>>>                             </filters>
>>>>                         </configuration>
>>>>                     </execution>
>>>>                 </executions>
>>>>             </plugin>
>>>>
>>>>
>>>> 2) The App is the jar you have built, so you ship it to the driver node
>>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>>> plain old scp, etc) to run it you can do something like:
>>>>
>>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>>> com.myproject.MyJob
>>>>
>>>> where MyJob is the entry point to your job it defines a main method.
>>>>
>>>> 3) I don't know whats the "common way" but I am doing things this way:
>>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>>> it to a node that plays the role of the driver, run it over mesos using the
>>>> launch scripts + some conf.
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>
>>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>>> has followed scala in that there are more than one way of accomplishing a
>>>>> job, making scala an overcomplicated language)
>>>>>
>>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>>>> is shipped with a separate sbt?
>>>>>
>>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>>> assembly also create that jar?
>>>>>
>>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>>> cannot find any example by googling. What's the most common way that people
>>>>> use?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is the list of the jars you use in your job, the driver will
>>>>>> send all those jars to each worker (otherwise the workers won't have the
>>>>>> classes you need in your job). The easy way to go is to build a fat jar
>>>>>> with your code and all the libs you depend on and then use this utility to
>>>>>> get the path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>>
>>>>>>
>>>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I do not understand why spark context has an option for loading jars
>>>>>>> at runtime.
>>>>>>>
>>>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>>> :
>>>>>>>
>>>>>>> object BroadcastTest {
>>>>>>>   def main(args: Array[String]) {
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Finally, how does this help a real world spark application?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Aureliano Buendia <bu...@gmail.com>.

On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <ce...@gmail.com> wrote:

> When developing I am using local[2] that launches a local cluster with 2
> workers. In most cases it is fine, I just encountered some strange
> behaviours for broadcasted variables, in local mode no broadcast is done
> (at least in 0.8).
>

That's not good. This could hide bugs in production.


> You also have access to the ui in that case at localhost:4040.
>

That server has a short life, it dies when the program exits.


>
> In dev mode I am directly launching my main class from intellij so no I
> don't need to build the fat jar.
>

Why is that it is not possible to work with spark://localhost:7077 while
developing? This allows to monitor and review the jobs, while keeping a
record of the past jobs.

I've never been able to connect to spark://localhost:7077 in development, I
get:

WARN cluster.ClusterScheduler: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory

The ui says the workers are alive and they do have plenty of memory. Also,
I tried the exact spark master name given by the ui with no luck
(apparently akka is too fragile and sensitive to this). Also, turning off
firewall on os x had no effect.


>
>
> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>
>> How about when developing the spark application, do you use "localhost",
>> or "spark://localhost:7077" for spark context master during development?
>>
>> Using "spark://localhost:7077" is a good way to simulate the production
>> driver and it provides the web ui. When using "spark://localhost:7077", is
>> it required to create the fat jar? Wouldn't that significantly slow down
>> the development cycle?
>>
>>
>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>
>>> It depends how you deploy, I don't find it so complicated...
>>>
>>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>>
>>> Inside I have something like that, saying which libs should be used in
>>> the fat jar (the others won't be present in the final artifact).
>>>
>>> <plugin>
>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>                 <version>2.1</version>
>>>                 <executions>
>>>                     <execution>
>>>                         <phase>package</phase>
>>>                         <goals>
>>>                             <goal>shade</goal>
>>>                         </goals>
>>>                         <configuration>
>>>                             <minimizeJar>true</minimizeJar>
>>>
>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>                             <artifactSet>
>>>                                 <includes>
>>>                                     <include>org.apache.hbase:*</include>
>>>
>>> <include>org.apache.hadoop:*</include>
>>>
>>> <include>com.typesafe:config</include>
>>>                                     <include>org.apache.avro:*</include>
>>>                                     <include>joda-time:*</include>
>>>                                     <include>org.joda:*</include>
>>>                                 </includes>
>>>                             </artifactSet>
>>>                             <filters>
>>>                                 <filter>
>>>                                     <artifact>*:*</artifact>
>>>                                     <excludes>
>>>                                         <exclude>META-INF/*.SF</exclude>
>>>                                         <exclude>META-INF/*.DSA</exclude>
>>>                                         <exclude>META-INF/*.RSA</exclude>
>>>                                     </excludes>
>>>                                 </filter>
>>>                             </filters>
>>>                         </configuration>
>>>                     </execution>
>>>                 </executions>
>>>             </plugin>
>>>
>>>
>>> 2) The App is the jar you have built, so you ship it to the driver node
>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>> plain old scp, etc) to run it you can do something like:
>>>
>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>> com.myproject.MyJob
>>>
>>> where MyJob is the entry point to your job it defines a main method.
>>>
>>> 3) I don't know whats the "common way" but I am doing things this way:
>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>> it to a node that plays the role of the driver, run it over mesos using the
>>> launch scripts + some conf.
>>>
>>>
>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>
>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>> has followed scala in that there are more than one way of accomplishing a
>>>> job, making scala an overcomplicated language)
>>>>
>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>>> is shipped with a separate sbt?
>>>>
>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>> assembly also create that jar?
>>>>
>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>> cannot find any example by googling. What's the most common way that people
>>>> use?
>>>>
>>>>
>>>>
>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is the list of the jars you use in your job, the driver will send
>>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>>> code and all the libs you depend on and then use this utility to get the
>>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>
>>>>>
>>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I do not understand why spark context has an option for loading jars
>>>>>> at runtime.
>>>>>>
>>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>> :
>>>>>>
>>>>>> object BroadcastTest {
>>>>>>   def main(args: Array[String]) {
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Finally, how does this help a real world spark application?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Eugen Cepoi <ce...@gmail.com>.

When developing I am using local[2] that launches a local cluster with 2
workers. In most cases it is fine, I just encountered some strange
behaviours for broadcasted variables, in local mode no broadcast is done
(at least in 0.8). You also have access to the ui in that case at
localhost:4040.

In dev mode I am directly launching my main class from intellij so no I
don't need to build the fat jar.


2014/1/2 Aureliano Buendia <bu...@gmail.com>

> How about when developing the spark application, do you use "localhost",
> or "spark://localhost:7077" for spark context master during development?
>
> Using "spark://localhost:7077" is a good way to simulate the production
> driver and it provides the web ui. When using "spark://localhost:7077", is
> it required to create the fat jar? Wouldn't that significantly slow down
> the development cycle?
>
>
> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>
>> It depends how you deploy, I don't find it so complicated...
>>
>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>
>> Inside I have something like that, saying which libs should be used in
>> the fat jar (the others won't be present in the final artifact).
>>
>> <plugin>
>>                 <groupId>org.apache.maven.plugins</groupId>
>>                 <artifactId>maven-shade-plugin</artifactId>
>>                 <version>2.1</version>
>>                 <executions>
>>                     <execution>
>>                         <phase>package</phase>
>>                         <goals>
>>                             <goal>shade</goal>
>>                         </goals>
>>                         <configuration>
>>                             <minimizeJar>true</minimizeJar>
>>
>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>                             <artifactSet>
>>                                 <includes>
>>                                     <include>org.apache.hbase:*</include>
>>                                     <include>org.apache.hadoop:*</include>
>>                                     <include>com.typesafe:config</include>
>>                                     <include>org.apache.avro:*</include>
>>                                     <include>joda-time:*</include>
>>                                     <include>org.joda:*</include>
>>                                 </includes>
>>                             </artifactSet>
>>                             <filters>
>>                                 <filter>
>>                                     <artifact>*:*</artifact>
>>                                     <excludes>
>>                                         <exclude>META-INF/*.SF</exclude>
>>                                         <exclude>META-INF/*.DSA</exclude>
>>                                         <exclude>META-INF/*.RSA</exclude>
>>                                     </excludes>
>>                                 </filter>
>>                             </filters>
>>                         </configuration>
>>                     </execution>
>>                 </executions>
>>             </plugin>
>>
>>
>> 2) The App is the jar you have built, so you ship it to the driver node
>> (it depends a lot on how you are planing to use it, debian packaging, a
>> plain old scp, etc) to run it you can do something like:
>>
>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>>
>> where MyJob is the entry point to your job it defines a main method.
>>
>> 3) I don't know whats the "common way" but I am doing things this way:
>> build the fat jar, provide some launch scripts, make debian packaging, ship
>> it to a node that plays the role of the driver, run it over mesos using the
>> launch scripts + some conf.
>>
>>
>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>
>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>> has followed scala in that there are more than one way of accomplishing a
>>> job, making scala an overcomplicated language)
>>>
>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>> is shipped with a separate sbt?
>>>
>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>> assembly also create that jar?
>>>
>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>> cannot find any example by googling. What's the most common way that people
>>> use?
>>>
>>>
>>>
>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> This is the list of the jars you use in your job, the driver will send
>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>> code and all the libs you depend on and then use this utility to get the
>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I do not understand why spark context has an option for loading jars
>>>>> at runtime.
>>>>>
>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>> :
>>>>>
>>>>> object BroadcastTest {
>>>>>   def main(args: Array[String]) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Finally, how does this help a real world spark application?
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Aureliano Buendia <bu...@gmail.com>.

How about when developing the spark application, do you use "localhost", or
"spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production
driver and it provides the web ui. When using "spark://localhost:7077", is
it required to create the fat jar? Wouldn't that significantly slow down
the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com> wrote:

> It depends how you deploy, I don't find it so complicated...
>
> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>
> Inside I have something like that, saying which libs should be used in the
> fat jar (the others won't be present in the final artifact).
>
> <plugin>
>                 <groupId>org.apache.maven.plugins</groupId>
>                 <artifactId>maven-shade-plugin</artifactId>
>                 <version>2.1</version>
>                 <executions>
>                     <execution>
>                         <phase>package</phase>
>                         <goals>
>                             <goal>shade</goal>
>                         </goals>
>                         <configuration>
>                             <minimizeJar>true</minimizeJar>
>
> <createDependencyReducedPom>false</createDependencyReducedPom>
>                             <artifactSet>
>                                 <includes>
>                                     <include>org.apache.hbase:*</include>
>                                     <include>org.apache.hadoop:*</include>
>                                     <include>com.typesafe:config</include>
>                                     <include>org.apache.avro:*</include>
>                                     <include>joda-time:*</include>
>                                     <include>org.joda:*</include>
>                                 </includes>
>                             </artifactSet>
>                             <filters>
>                                 <filter>
>                                     <artifact>*:*</artifact>
>                                     <excludes>
>                                         <exclude>META-INF/*.SF</exclude>
>                                         <exclude>META-INF/*.DSA</exclude>
>                                         <exclude>META-INF/*.RSA</exclude>
>                                     </excludes>
>                                 </filter>
>                             </filters>
>                         </configuration>
>                     </execution>
>                 </executions>
>             </plugin>
>
>
> 2) The App is the jar you have built, so you ship it to the driver node
> (it depends a lot on how you are planing to use it, debian packaging, a
> plain old scp, etc) to run it you can do something like:
>
> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>
> where MyJob is the entry point to your job it defines a main method.
>
> 3) I don't know whats the "common way" but I am doing things this way:
> build the fat jar, provide some launch scripts, make debian packaging, ship
> it to a node that plays the role of the driver, run it over mesos using the
> launch scripts + some conf.
>
>
> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>
>> I wasn't aware of jarOfClass. I wish there was only one good way of
>> deploying in spark, instead of many ambiguous methods. (seems like spark
>> has followed scala in that there are more than one way of accomplishing a
>> job, making scala an overcomplicated language)
>>
>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>> is shipped with a separate sbt?
>>
>> 2. Let's say we have the dependencies fat jar which is supposed to be
>> shipped to the workers. Now how do we deploy the main app which is supposed
>> to be executed on the driver? Make jar another jar out of it? Does sbt
>> assembly also create that jar?
>>
>> 3. Is calling sc.jarOfClass() the most common way of doing this? I cannot
>> find any example by googling. What's the most common way that people use?
>>
>>
>>
>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> This is the list of the jars you use in your job, the driver will send
>>> all those jars to each worker (otherwise the workers won't have the classes
>>> you need in your job). The easy way to go is to build a fat jar with your
>>> code and all the libs you depend on and then use this utility to get the
>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>
>>>
>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> I do not understand why spark context has an option for loading jars at
>>>> runtime.
>>>>
>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>> :
>>>>
>>>> object BroadcastTest {
>>>>   def main(args: Array[String]) {
>>>>
>>>>
>>>>
>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>
>>>>
>>>>
>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>
>>>>
>>>>
>>>>  }
>>>> }
>>>>
>>>>
>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>
>>>>
>>>>
>>>>
>>>> Finally, how does this help a real world spark application?
>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Eugen Cepoi <ce...@gmail.com>.

Indeed you don't need it, just make sure that it is in your classpath. But
anyway the jar is not so big, I mean compared to what next your job will
do, sending some mo over the network seems OK to me.


2014/1/5 Aureliano Buendia <bu...@gmail.com>

> Eugen, I noticed that you are including hadoop in your fat jar:
>
> <include>org.apache.hadoop:*</include>
>
> This would take a big chunk of the fat jar. Isn't this jar already
> included in spark?
>
>
> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>
>> It depends how you deploy, I don't find it so complicated...
>>
>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>
>> Inside I have something like that, saying which libs should be used in
>> the fat jar (the others won't be present in the final artifact).
>>
>> <plugin>
>>                 <groupId>org.apache.maven.plugins</groupId>
>>                 <artifactId>maven-shade-plugin</artifactId>
>>                 <version>2.1</version>
>>                 <executions>
>>                     <execution>
>>                         <phase>package</phase>
>>                         <goals>
>>                             <goal>shade</goal>
>>                         </goals>
>>                         <configuration>
>>                             <minimizeJar>true</minimizeJar>
>>
>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>                             <artifactSet>
>>                                 <includes>
>>                                     <include>org.apache.hbase:*</include>
>>                                     <include>org.apache.hadoop:*</include>
>>                                     <include>com.typesafe:config</include>
>>                                     <include>org.apache.avro:*</include>
>>                                     <include>joda-time:*</include>
>>                                     <include>org.joda:*</include>
>>                                 </includes>
>>                             </artifactSet>
>>                             <filters>
>>                                 <filter>
>>                                     <artifact>*:*</artifact>
>>                                     <excludes>
>>                                         <exclude>META-INF/*.SF</exclude>
>>                                         <exclude>META-INF/*.DSA</exclude>
>>                                         <exclude>META-INF/*.RSA</exclude>
>>                                     </excludes>
>>                                 </filter>
>>                             </filters>
>>                         </configuration>
>>                     </execution>
>>                 </executions>
>>             </plugin>
>>
>>
>> 2) The App is the jar you have built, so you ship it to the driver node
>> (it depends a lot on how you are planing to use it, debian packaging, a
>> plain old scp, etc) to run it you can do something like:
>>
>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>>
>> where MyJob is the entry point to your job it defines a main method.
>>
>> 3) I don't know whats the "common way" but I am doing things this way:
>> build the fat jar, provide some launch scripts, make debian packaging, ship
>> it to a node that plays the role of the driver, run it over mesos using the
>> launch scripts + some conf.
>>
>>
>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>
>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>> has followed scala in that there are more than one way of accomplishing a
>>> job, making scala an overcomplicated language)
>>>
>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>> is shipped with a separate sbt?
>>>
>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>> assembly also create that jar?
>>>
>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>> cannot find any example by googling. What's the most common way that people
>>> use?
>>>
>>>
>>>
>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> This is the list of the jars you use in your job, the driver will send
>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>> code and all the libs you depend on and then use this utility to get the
>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I do not understand why spark context has an option for loading jars
>>>>> at runtime.
>>>>>
>>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>> :
>>>>>
>>>>> object BroadcastTest {
>>>>>   def main(args: Array[String]) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Finally, how does this help a real world spark application?
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Aureliano Buendia <bu...@gmail.com>.

Eugen, I noticed that you are including hadoop in your fat jar:

<include>org.apache.hadoop:*</include>

This would take a big chunk of the fat jar. Isn't this jar already included
in spark?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <ce...@gmail.com> wrote:

> It depends how you deploy, I don't find it so complicated...
>
> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>
> Inside I have something like that, saying which libs should be used in the
> fat jar (the others won't be present in the final artifact).
>
> <plugin>
>                 <groupId>org.apache.maven.plugins</groupId>
>                 <artifactId>maven-shade-plugin</artifactId>
>                 <version>2.1</version>
>                 <executions>
>                     <execution>
>                         <phase>package</phase>
>                         <goals>
>                             <goal>shade</goal>
>                         </goals>
>                         <configuration>
>                             <minimizeJar>true</minimizeJar>
>
> <createDependencyReducedPom>false</createDependencyReducedPom>
>                             <artifactSet>
>                                 <includes>
>                                     <include>org.apache.hbase:*</include>
>                                     <include>org.apache.hadoop:*</include>
>                                     <include>com.typesafe:config</include>
>                                     <include>org.apache.avro:*</include>
>                                     <include>joda-time:*</include>
>                                     <include>org.joda:*</include>
>                                 </includes>
>                             </artifactSet>
>                             <filters>
>                                 <filter>
>                                     <artifact>*:*</artifact>
>                                     <excludes>
>                                         <exclude>META-INF/*.SF</exclude>
>                                         <exclude>META-INF/*.DSA</exclude>
>                                         <exclude>META-INF/*.RSA</exclude>
>                                     </excludes>
>                                 </filter>
>                             </filters>
>                         </configuration>
>                     </execution>
>                 </executions>
>             </plugin>
>
>
> 2) The App is the jar you have built, so you ship it to the driver node
> (it depends a lot on how you are planing to use it, debian packaging, a
> plain old scp, etc) to run it you can do something like:
>
> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>
> where MyJob is the entry point to your job it defines a main method.
>
> 3) I don't know whats the "common way" but I am doing things this way:
> build the fat jar, provide some launch scripts, make debian packaging, ship
> it to a node that plays the role of the driver, run it over mesos using the
> launch scripts + some conf.
>
>
> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>
>> I wasn't aware of jarOfClass. I wish there was only one good way of
>> deploying in spark, instead of many ambiguous methods. (seems like spark
>> has followed scala in that there are more than one way of accomplishing a
>> job, making scala an overcomplicated language)
>>
>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>> is shipped with a separate sbt?
>>
>> 2. Let's say we have the dependencies fat jar which is supposed to be
>> shipped to the workers. Now how do we deploy the main app which is supposed
>> to be executed on the driver? Make jar another jar out of it? Does sbt
>> assembly also create that jar?
>>
>> 3. Is calling sc.jarOfClass() the most common way of doing this? I cannot
>> find any example by googling. What's the most common way that people use?
>>
>>
>>
>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> This is the list of the jars you use in your job, the driver will send
>>> all those jars to each worker (otherwise the workers won't have the classes
>>> you need in your job). The easy way to go is to build a fat jar with your
>>> code and all the libs you depend on and then use this utility to get the
>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>
>>>
>>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> I do not understand why spark context has an option for loading jars at
>>>> runtime.
>>>>
>>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>> :
>>>>
>>>> object BroadcastTest {
>>>>   def main(args: Array[String]) {
>>>>
>>>>
>>>>
>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>
>>>>
>>>>
>>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>
>>>>
>>>>
>>>>  }
>>>> }
>>>>
>>>>
>>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>>
>>>>
>>>>
>>>>
>>>> Finally, how does this help a real world spark application?
>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Eugen Cepoi <ce...@gmail.com>.

It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the
fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>

<createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it
depends a lot on how you are planing to use it, debian packaging, a plain
old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way:
build the fat jar, provide some launch scripts, make debian packaging, ship
it to a node that plays the role of the driver, run it over mesos using the
launch scripts + some conf.


2014/1/2 Aureliano Buendia <bu...@gmail.com>

> I wasn't aware of jarOfClass. I wish there was only one good way of
> deploying in spark, instead of many ambiguous methods. (seems like spark
> has followed scala in that there are more than one way of accomplishing a
> job, making scala an overcomplicated language)
>
> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
> is shipped with a separate sbt?
>
> 2. Let's say we have the dependencies fat jar which is supposed to be
> shipped to the workers. Now how do we deploy the main app which is supposed
> to be executed on the driver? Make jar another jar out of it? Does sbt
> assembly also create that jar?
>
> 3. Is calling sc.jarOfClass() the most common way of doing this? I cannot
> find any example by googling. What's the most common way that people use?
>
>
>
> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com>wrote:
>
>> Hi,
>>
>> This is the list of the jars you use in your job, the driver will send
>> all those jars to each worker (otherwise the workers won't have the classes
>> you need in your job). The easy way to go is to build a fat jar with your
>> code and all the libs you depend on and then use this utility to get the
>> path: SparkContext.jarOfClass(YourJob.getClass)
>>
>>
>> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>>
>>> Hi,
>>>
>>> I do not understand why spark context has an option for loading jars at
>>> runtime.
>>>
>>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>> :
>>>
>>> object BroadcastTest {
>>>   def main(args: Array[String]) {
>>>
>>>
>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>
>>>
>>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>
>>>
>>>  }
>>> }
>>>
>>>
>>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>>
>>>
>>>
>>> Finally, how does this help a real world spark application?
>>>
>>>
>>
>

Re: Spark context jar confusions

Posted by Aureliano Buendia <bu...@gmail.com>.

I wasn't aware of jarOfClass. I wish there was only one good way of
deploying in spark, instead of many ambiguous methods. (seems like spark
has followed scala in that there are more than one way of accomplishing a
job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should
be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is
shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be
shipped to the workers. Now how do we deploy the main app which is supposed
to be executed on the driver? Make jar another jar out of it? Does sbt
assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot
find any example by googling. What's the most common way that people use?

On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <ce...@gmail.com> wrote:

> Hi,
>
> This is the list of the jars you use in your job, the driver will send all
> those jars to each worker (otherwise the workers won't have the classes you
> need in your job). The easy way to go is to build a fat jar with your code
> and all the libs you depend on and then use this utility to get the path:
> SparkContext.jarOfClass(YourJob.getClass)
>
>
> 2014/1/2 Aureliano Buendia <bu...@gmail.com>
>
>> Hi,
>>
>> I do not understand why spark context has an option for loading jars at
>> runtime.
>>
>> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>> :
>>
>> object BroadcastTest {
>>   def main(args: Array[String]) {
>>
>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>
>>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>
>>  }
>> }
>>
>>
>> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
>> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>>
>>
>> Finally, how does this help a real world spark application?
>>
>>
>

Re: Spark context jar confusions

Posted by Eugen Cepoi <ce...@gmail.com>.

Hi,

This is the list of the jars you use in your job, the driver will send all
those jars to each worker (otherwise the workers won't have the classes you
need in your job). The easy way to go is to build a fat jar with your code
and all the libs you depend on and then use this utility to get the path:
SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <bu...@gmail.com>

> Hi,
>
> I do not understand why spark context has an option for loading jars at
> runtime.
>
> As an example, consider this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
> :
>
> object BroadcastTest {
>   def main(args: Array[String]) {
>
>   val sc = new SparkContext(args(0), "Broadcast Test",
>       System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>
> }
> }
>
>
> This is *the* example, or *the* application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
> In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?
>
> Finally, how does this help a real world spark application?
>
>