You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Matt Forbes <ma...@tellapart.com> on 2014/08/04 19:23:00 UTC

Problems running modified spark version on ec2 cluster

I'm trying to run a forked version of mllib where I am experimenting with a
boosted trees implementation. Here is what I've tried, but can't seem to
get working properly:

*Directory layout:*

src/spark-dev  (spark github fork)
  pom.xml - I've tried changing the version to 1.2 arbitrarily in core and
mllib
src/forestry  (test driver)
  pom.xml - depends on spark-core and spark-mllib with version 1.2

*spark-defaults.conf:*

spark.master                    spark://
ec2-54-224-112-117.compute-1.amazonaws.com:7077
spark.verbose                   true
spark.files.userClassPathFirst  false  # I've tried both true and false here
spark.executor-memory           6G
spark.jars
 spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar

*Build and run script:*

MASTER=root@ec2-54-224-112-117.compute-1.amazonaws.com
PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
FORESTRY_DIR=~/src/forestry-main
SPARK_DIR=~/src/spark-dev
cd $SPARK_DIR
mvn -T8 -DskipTests -pl core,mllib,streaming install
cd $FORESTRY_DIR
mvn -T8 -DskipTests package
rsync --progress
~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER:
rsync --progress
~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER:
rsync --progress
~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar
$MASTER:
rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest
--verbose"

In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm
referencing from TreeTest in my test driver. The driver pulls some data
from s3, converts to LabeledPoint, and then calls
GradientBoostingTree.train(...) identically to how DecisionTree works. This
is all fine until it we call examples.map { x => tree.predict(x.features) }
where tree is a DecisionTree that I've also modified in my fork. At this
point, the workers blow up because they can't find a new method I've added
to the tree.model.Node class. My suspicion is that maybe the workers have
deserialized the DecisionTreeModel into a different version of mllib that
doesn't have my changes?

Is my setup all wrong? I'm using an EC2 cluster because it is so easy to
startup and manage, maybe I need to fully distribute my new version of
spark to all the workers before starting the job? Is there an easy way to
do that?

Re: Problems running modified spark version on ec2 cluster

Posted by Matt Forbes <ma...@tellapart.com>.

After rummaging through the worker instances I noticed they were using the
assembly jar (which I hadn't noticed before). Now instead of using the core
and mllib jars individually, I'm just overwriting the assembly jar in the
master and using spark-ec2/copy-dir. For posterity, my run script is:

MASTER=root@ec2-54-224-110-72.compute-1.amazonaws.com
PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
ASSEMBLY_SRC=spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
ASSEMBLY_DEST=spark-assembly-1.0.1-hadoop1.0.4.jar
FORESTRY_DIR=~/src/forestry-main
SPARK_DIR=~/src/spark-dev
cd $SPARK_DIR
mvn -T8 -DskipTests -pl core,mllib,assembly install
cd $FORESTRY_DIR
mvn -T8 -DskipTests package
rsync --progress ~/src/spark-dev/assembly/target/scala-2.10/$ASSEMBLY_SRC
$MASTER:spark/lib/$ASSEMBLY_DEST
rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
ssh $MASTER "spark-ec2/copy-dir --delete /root/spark/lib"
ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class
com.ttforbes.TreeTest --verbose"



On Mon, Aug 4, 2014 at 10:23 AM, Matt Forbes <ma...@tellapart.com> wrote:

> I'm trying to run a forked version of mllib where I am experimenting with
> a boosted trees implementation. Here is what I've tried, but can't seem to
> get working properly:
>
> *Directory layout:*
>
> src/spark-dev  (spark github fork)
>   pom.xml - I've tried changing the version to 1.2 arbitrarily in core and
> mllib
> src/forestry  (test driver)
>   pom.xml - depends on spark-core and spark-mllib with version 1.2
>
> *spark-defaults.conf:*
>
> spark.master                    spark://
> ec2-54-224-112-117.compute-1.amazonaws.com:7077
> spark.verbose                   true
> spark.files.userClassPathFirst  false  # I've tried both true and false
> here
> spark.executor-memory           6G
> spark.jars
>  spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar
>
> *Build and run script:*
>
> MASTER=root@ec2-54-224-112-117.compute-1.amazonaws.com
> PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
> FORESTRY_DIR=~/src/forestry-main
> SPARK_DIR=~/src/spark-dev
> cd $SPARK_DIR
> mvn -T8 -DskipTests -pl core,mllib,streaming install
> cd $FORESTRY_DIR
> mvn -T8 -DskipTests package
> rsync --progress
> ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER:
> rsync --progress
> ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER:
> rsync --progress
> ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar
> $MASTER:
> rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
> rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
> ssh $MASTER "spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest
> --verbose"
>
> In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm
> referencing from TreeTest in my test driver. The driver pulls some data
> from s3, converts to LabeledPoint, and then calls
> GradientBoostingTree.train(...) identically to how DecisionTree works. This
> is all fine until it we call examples.map { x => tree.predict(x.features) }
> where tree is a DecisionTree that I've also modified in my fork. At this
> point, the workers blow up because they can't find a new method I've added
> to the tree.model.Node class. My suspicion is that maybe the workers have
> deserialized the DecisionTreeModel into a different version of mllib that
> doesn't have my changes?
>
> Is my setup all wrong? I'm using an EC2 cluster because it is so easy to
> startup and manage, maybe I need to fully distribute my new version of
> spark to all the workers before starting the job? Is there an easy way to
> do that?
>
>
>
>
>
>
>