You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/10/19 19:49:03 UTC

Upgrade to Spark 1.1.0?

Several people have experienced problems using Spark 1.0.2 with Mahout so I tried it. Spark 1.0.1 is no longer a recommended version and so is a little harder to get and people seem to be using newer versions. I discovered that Mahout compiles with 1.0.2 in the pom and executes the tests but fails a simple test on a cluster. It has an anonymous function name error, which causes a class not found. This looks like a Scala thing but not sure. At first blush this means we can’t upgrade to Spark 1.0.2 without some relative deep diving so I’m giving up on it for now and trying Spark 1.1.0, the current stable version that actually had an RC cycle. It uses the same version of Scala as 1.0.1 

On Spark 1.1.0 Mahout builds and runs test fine but on a cluster I get a class not found for a random number generator used in mahout common. I think it’s because it is never packaged as a dependency in a “job” jar assembly so tried adding it to the spark pom. Not sure if this is the right way to solve this so if anyone has a better idea please speak up.

Getting off the dubious Spark 1.0.1 version is turning out to be a bit of work. Does anyone object to upgrading our Spark dependency? I’m not sure if Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading your Spark cluster.

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Mahout context does not include _all_ possible transitive dependencies.
Would not be lighting fast to take all legacy etc. dependencies.

There's an "ignored" unit test that asserts context path correctness. you
can "uningnore" it and run to verify it still works as ex[ected.The reason
it is set to "ingored" is because it requires mahout environment + already
built mahout in order to run successfully. i can probably look it up if you
don't find it immediately.

Now.
mahout context only includes what's really used in the drm algebra. Which
is just a handful of jars. Apache commons math is not one of them.

But, your driver can add it when creating mahout context, by tinkering
additionally with the method parameters there (such as spark config).
However, you may incounter a problem which may be that mahout assembly
currently may not build -- and copy -- commons math jar into any of mahout
tree.

Finally, i am against adding commons-math by default, as general algebra
does not depend on it. I'd suggest, in order of preference, (1) get rid of
relying on commons math random generator (surely, by now we should be ok
with scala.Random or even standard random?), or (2) add dependency in a
custom way per above.

If there's an extremely compelling reason why commons-math random gen
dependency cannot be eliminated, then a better way is to include commons
math into assembly (i think right now the only assembly that really copies
in dependencies is the examples; which is probably wrong as examples are
not the core product here), and add it explicitly to createMahoutContext
(or whatever that method's name was) code.

My understanding is the random from utils was mainly encouraged because it
is automatically made deterministic in tests. I am unaware any fundamental
deficiencies of scala random w.r.t its uses in existing methods. So perhaps
scala side needs its own "RandomUtils" for testing that do not rely on
commons math.

On Sun, Oct 19, 2014 at 4:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is
> the problem but anyway...
>
> I get a NoClassDefFoundError for RandomGenerator when running a driver
> from the CLI. But only when using a named master, even a standalone master.
> If I run using master = local[4] the job executes correctly but if I set
> the master to spark://Maclaurin.local:7077 though they are the same machine
> I get the NoClassDefFoundError. The classpath seems correct on the CLI and
> the jars do indeed contain the offending class (see below). There must be
> some difference in how classes are loaded between local[4] and
> spark://Maclaurin.local:7077?
>
> Any ideas?
>
> ===============
>
> The driver is in mahout-spark_2.10-1.0-SNAPSHOT-job.jar so it’s execution
> means it must be in the classpath. When I look at what’s in the jar I see
> RandomGenerator.
>
> Maclaurin:target pat$ jar tf mahout-spark_2.10-1.0-SNAPSHOT-job.jar | grep
> RandomGenerator
> cern/jet/random/engine/RandomGenerator.class
> org/apache/commons/math3/random/GaussianRandomGenerator.class
> org/apache/commons/math3/random/JDKRandomGenerator.class
> org/apache/commons/math3/random/UniformRandomGenerator.class
> org/apache/commons/math3/random/RandomGenerator.class  <==========!
> org/apache/commons/math3/random/NormalizedRandomGenerator.class
> org/apache/commons/math3/random/AbstractRandomGenerator.class
> org/apache/commons/math3/random/StableRandomGenerator.class
>
> But get the following error executing the job:
>
> 14/10/19 15:39:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 6.9 (TID 84, 192.168.0.2): java.lang.NoClassDefFoundError:
> org/apache/commons/math3/random/RandomGenerator
>         org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)
>
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:272)
>
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:267)
>
> org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
>
> org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
>         scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>         org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>         org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>         org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>         org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>         java.lang.Thread.run(Thread.java:695)
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the problem but anyway...

I get a NoClassDefFoundError for RandomGenerator when running a driver from the CLI. But only when using a named master, even a standalone master. If I run using master = local[4] the job executes correctly but if I set the master to spark://Maclaurin.local:7077 though they are the same machine I get the NoClassDefFoundError. The classpath seems correct on the CLI and the jars do indeed contain the offending class (see below). There must be some difference in how classes are loaded between local[4] and spark://Maclaurin.local:7077?

Any ideas?

===============

The driver is in mahout-spark_2.10-1.0-SNAPSHOT-job.jar so it’s execution means it must be in the classpath. When I look at what’s in the jar I see RandomGenerator.

Maclaurin:target pat$ jar tf mahout-spark_2.10-1.0-SNAPSHOT-job.jar | grep RandomGenerator
cern/jet/random/engine/RandomGenerator.class
org/apache/commons/math3/random/GaussianRandomGenerator.class
org/apache/commons/math3/random/JDKRandomGenerator.class
org/apache/commons/math3/random/UniformRandomGenerator.class
org/apache/commons/math3/random/RandomGenerator.class  <==========!
org/apache/commons/math3/random/NormalizedRandomGenerator.class
org/apache/commons/math3/random/AbstractRandomGenerator.class
org/apache/commons/math3/random/StableRandomGenerator.class

But get the following error executing the job:

14/10/19 15:39:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 6.9 (TID 84, 192.168.0.2): java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator
        org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)
        org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:272)
        org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$5.apply(SimilarityAnalysis.scala:267)
        org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
        org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
        org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
        org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
        org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        java.lang.Thread.run(Thread.java:695)



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Oct 21, 2014 at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Sorry to hear. I bet you’ll find a way.
>
> The Spark Jira trail leads to two suggestions:
> 1) use spark-submit to execute code with your own entry point (other than
> spark-shell) One theory points to not loading all needed Spark classes from
> calling code (Mahout in our case). I can hand check the jars for the anon
> function I am missing.\
>

Spark submit is for people who care not setting up their SparkConf. We do.
In fact, we do very much. We set a whole bunch of things there.

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Off the list I’ve heard of problems using the maven artifacts for Spark even when you are not building Spark. There have been reported problems in the serialization class UIDs generated when building Mahout. If you encounter those try the build method in the PR and report these to the Spark folks.

On Oct 21, 2014, at 3:48 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Right.

Something else that’s come up so I haven’t tried the shell tutorial yet. If anyone else wants to try it you can build Mahout from this PR:
https://github.com/apache/mahout/pull/61

On Oct 21, 2014, at 3:28 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

hm no they don't push different binary releases to maven. I assume they
only push the default one.

On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> ps i remember discussion for packaging binary spark distributions. So
> there's in fact a number of different spark artifact releases. However, i
> am not sure if they are pushing them to mvn repositories. (if they did,
> they might use different maven classifiers for those). If that's the case,
> then one plausible strategy here is to recommend rebuilding mahout with
> dependency to a classifier corresponding to the actual spark binary release
> used.
> 
> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> if you are using mahout shell or command line drivers (which i dont) it
>> would seem the correct thing to do is for mahout script simply to take
>> spark dependencies from installed $SPARK_HOME rather than from Mahout's
>> assembly. In fact that would be consistent with what other projects are
>> doing in similar situation. it should also probably make things compatible
>> between minor releases of spark.
>> 
>> But i think you are right in a sense that the problem is that spark jars
>> are not uniquely encompassed by maven artifact id and version, unlike with
>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
>> be one and only one released artifact in existence -- but one's local build
>> may create incompatible variations).
>> 
>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> The problem is not in building Spark it is in building Mahout using the
>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>>> in the repos.
>>> 
>>> For the rest of us, though the process below seems like an error prone
>>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>>> by Spark imo.
>>> 
>>> BTW The cache is laid out differently on linux but I don’t think you
>>> need to delete is anyway.
>>> 
>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> 
>>> fwiw i never built spark using maven. Always use sbt assembly.
>>> 
>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> 
>>>> Ok, the mystery is solved.
>>>> 
>>>> The safe sequence from my limited testing is:
>>>> 1) delete ~/.m2/repository/org/spark and mahout
>>>> 2) build Spark for your version of Hadoop *but do not use "mvn package
>>>> ...”* use “mvn install …” This will put a copy of the exact bits you
>>> need
>>>> into the maven cache for building mahout against. In my case using
>>> hadoop
>>>> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
>>> you
>>>> run tests on Spark some failures can safely be ignored according to the
>>>> Spark guys so check before giving up.
>>>> 3) build mahout with “mvn clean install"
>>>> 
>>>> This will create mahout from exactly the same bits you will run on your
>>>> cluster. It got rid of a missing anon function for me. The problem
>>> occurs
>>>> when you use a different version of Spark on your cluster than you
>>> used to
>>>> build Mahout and this is rather hidden by Maven. Maven downloads from
>>> repos
>>>> any dependency that is not in the local .m2 cache and so you have to
>>> make
>>>> sure your version of Spark is there so Maven wont download one that is
>>>> incompatible. Unless you really know what you are doing I’d build both
>>>> Spark and Mahout for now
>>>> 
>>>> BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>>> more
>>>> testing.
>>>> 
>>>> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>> Sorry to hear. I bet you’ll find a way.
>>>> 
>>>> The Spark Jira trail leads to two suggestions:
>>>> 1) use spark-submit to execute code with your own entry point (other
>>> than
>>>> spark-shell) One theory points to not loading all needed Spark classes
>>> from
>>>> calling code (Mahout in our case). I can hand check the jars for the
>>> anon
>>>> function I am missing.
>>>> 2) there may be different class names in the running code (created by
>>>> building Spark locally) and the  version referenced in the Mahout POM.
>>> If
>>>> this turns out to be true it means we can’t rely on building Spark
>>> locally.
>>>> Is there a maven target that puts the artifacts of the Spark build in
>>> the
>>>> .m2/repository local cache? That would be an easy way to test this
>>> theory.
>>>> 
>>>> either of these could cause missing classes.
>>>> 
>>>> 
>>>> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>> 
>>>> no i havent used it with anything but 1.0.1 and 0.9.x .
>>>> 
>>>> on a side note, I just have changed my employer. It is one of these big
>>>> guys that make it very difficult to do any contributions. So I am not
>>> sure
>>>> how much of anything i will be able to share/contribute.
>>>> 
>>>> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> But unless you have the time to devote to errors avoid it. I’ve built
>>>>> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>>>>> missing class errors. The 1.x branch seems to have some kind of
>>> peculiar
>>>>> build order dependencies. The errors sometimes don’t show up until
>>>> runtime,
>>>>> passing all build tests.
>>>>> 
>>>>> Dmitriy, have you successfully used any Spark version other than
>>> 1.0.1 on
>>>>> a cluster? If so do you recall the exact order and from what sources
>>> you
>>>>> built?
>>>>> 
>>>>> 
>>>>> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>>> 
>>>>> You can't use spark client of one version and have the backend of
>>>> another.
>>>>> You can try to change spark dependency in mahout poms to match your
>>>> backend
>>>>> (or vice versa, you can change your backend to match what's on the
>>>> client).
>>>>> 
>>>>> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>>>> balijamahesh.mca@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> Here are the errors I get which I run in a pseudo distributed mode,
>>>>>> 
>>>>>> Spark 1.0.2 and Mahout latest code (Clone)
>>>>>> 
>>>>>> When I run the command in page,
>>>>>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>>>>> 
>>>>>> val drmX = drmData(::, 0 until 4)
>>>>>> 
>>>>>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>>>>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>>>>> local class serialVersionUID = -6766554341038829528
>>>>>>   at
>>>>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>>>>   at
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>>   at
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>>   at
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>>>>   at
>>>>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>>>>   at
>>>>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>>>>   at
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>>>>   at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>   at java.lang.Thread.run(Thread.java:701)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>>>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>>>>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>>>>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>>>>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>>>>>> -6766554341038829528
>>>>>> 
>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>> 
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>> 
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>> 
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>>>> 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>>>> 
>>>>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>>>> 
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>>>> 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>>>> 
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>   java.lang.Thread.run(Thread.java:701)
>>>>>> Driver stacktrace:
>>>>>>   at org.apache.spark.scheduler.DAGScheduler.org
>>>>>> 
>>>>> 
>>>> 
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>   at
>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>   at scala.Option.foreach(Option.scala:236)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>>>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>>>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>>>   at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>   at
>>>>>> 
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>> 
>>>>>> Best,
>>>>>> Mahesh Balija.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Is anyone else nervous about ignoring this issue or relying on
>>>>>> non-build
>>>>>>>> (hand run) test driven transitive dependency checking. I hope
>>> someone
>>>>>>> else
>>>>>>>> will chime in.
>>>>>>>> 
>>>>>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
>>>> set
>>>>>>> up
>>>>>>>> the build machine to do this? I’d feel better about eyeballing
>>> deps if
>>>>>> we
>>>>>>>> could have a TEST_MASTER automatically run during builds at Apache.
>>>>>> Maybe
>>>>>>>> the regular unit tests are OK for building locally ourselves.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>>>>> whether
>>>>>>> we
>>>>>>>>>> have missing classes or not. The job.jar at least used the pom
>>>>>>>> dependencies
>>>>>>>>>> to guarantee every needed class was present. So the job.jar
>>> seems to
>>>>>>>> solve
>>>>>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No, as i wrote spark doesn't  work with job jar format. Neither
>>> as it
>>>>>>>> turns
>>>>>>>>> out more recent hadoop MR btw.
>>>>>>>> 
>>>>>>>> Not speaking literally of the format. Spark understands jars and
>>> maven
>>>>>>> can
>>>>>>>> build one from transitive dependencies.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES
>>> to
>>>>>>>> startup
>>>>>>>>> tasks with all of it just on copy time). This is absolutely not
>>> the
>>>>>> way
>>>>>>>> to
>>>>>>>>> go with this.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Lack of guarantee to load seems like a bigger problem than startup
>>>>>> time.
>>>>>>>> Clearly we can’t just ignore this.
>>>>>>>> 
>>>>>>> 
>>>>>>> Nope. given highly iterative nature and dynamic task allocation in
>>> this
>>>>>>> environment, one is looking to effects similar to Map Reduce. This
>>> is
>>>>> not
>>>>>>> the only reason why I never go to MR anymore, but that's one of main
>>>>>> ones.
>>>>>>> 
>>>>>>> How about experiment: why don't you create assembly that copies ALL
>>>>>>> transitive dependencies in one folder, and then try to broadcast it
>>>> from
>>>>>>> single point (front end) to well... let's start with 20 machines.
>>> (of
>>>>>>> course we ideally want to into 10^3 ..10^4 range -- but why bother
>>> if
>>>> we
>>>>>>> can't do it for 20).
>>>>>>> 
>>>>>>> Or, heck, let's try to simply parallel-copy it between too machines
>>> 20
>>>>>>> times that are not collocated on the same subnet.
>>>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> There may be any number of bugs waiting for the time we try
>>> running
>>>>>>> on a
>>>>>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No. Assuming any given method is tested on all its execution
>>> paths,
>>>>>>> there
>>>>>>>>> will be no bugs. The bugs of that sort will only appear if the
>>> user
>>>>>> is
>>>>>>>>> using algebra directly and calls something that is not on the
>>> path,
>>>>>>> from
>>>>>>>>> the closure. In which case our answer to this is the same as for
>>> the
>>>>>>>> solver
>>>>>>>>> methodology developers -- use customized SparkConf while creating
>>>>>>> context
>>>>>>>>> to include stuff you really want.
>>>>>>>>> 
>>>>>>>>> Also another right answer to this is that we probably should
>>>>>> reasonably
>>>>>>>>> provide the toolset here. For example, all the stats stuff found
>>> in R
>>>>>>>> base
>>>>>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Huh? this is not true. The one I ran into was found by calling
>>>>>> something
>>>>>>>> in math from something in math-scala. It led outside and you can
>>>>>>> encounter
>>>>>>>> such things even in algebra.  In fact you have no idea if these
>>>>>> problems
>>>>>>>> exists except for the fact you have used it a lot personally.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> You ran it with your own code that never existed before.
>>>>>>> 
>>>>>>> But there's difference between released Mahout code (which is what
>>> you
>>>>>> are
>>>>>>> working on) and the user code. Released code must run thru remote
>>> tests
>>>>>> as
>>>>>>> you suggested and thus guarantee there are no such problems with
>>> post
>>>>>>> release code.
>>>>>>> 
>>>>>>> For users, we only can provide a way for them to load stuff that
>>> they
>>>>>>> decide to use. We don't have apriori knowledge what they will use.
>>> It
>>>> is
>>>>>>> the same thing that spark does, and the same thing that MR does,
>>>> doesn't
>>>>>>> it?
>>>>>>> 
>>>>>>> Of course mahout should drop rigorously the stuff it doesn't load,
>>> from
>>>>>> the
>>>>>>> scala scope. No argue about that. In fact that's what i suggested
>>> as #1
>>>>>>> solution. But there's nothing much to do here but to go dependency
>>>>>>> cleansing for math and spark code. Part of the reason there's so
>>> much
>>>> is
>>>>>>> because newer modules still bring in everything from mrLegacy.
>>>>>>> 
>>>>>>> You are right in saying it is hard to guess what else dependencies
>>> are
>>>>> in
>>>>>>> the util/legacy code that are actually used. but that's not a
>>>>>> justification
>>>>>>> for brute force "copy them all" approach that virtually guarantees
>>>>>> ruining
>>>>>>> one of the foremost legacy issues this work intended to address.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Off the list I’ve heard of problems using the maven artifacts for Spark even when you are not building Spark. There have been reported problems in the serialization class UIDs generated when building Mahout. If you encounter those try the build method in the PR and report these to the Spark folks.

On Oct 21, 2014, at 3:48 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Right.

Something else that’s come up so I haven’t tried the shell tutorial yet. If anyone else wants to try it you can build Mahout from this PR:
https://github.com/apache/mahout/pull/61

On Oct 21, 2014, at 3:28 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

hm no they don't push different binary releases to maven. I assume they
only push the default one.

On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> ps i remember discussion for packaging binary spark distributions. So
> there's in fact a number of different spark artifact releases. However, i
> am not sure if they are pushing them to mvn repositories. (if they did,
> they might use different maven classifiers for those). If that's the case,
> then one plausible strategy here is to recommend rebuilding mahout with
> dependency to a classifier corresponding to the actual spark binary release
> used.
> 
> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> if you are using mahout shell or command line drivers (which i dont) it
>> would seem the correct thing to do is for mahout script simply to take
>> spark dependencies from installed $SPARK_HOME rather than from Mahout's
>> assembly. In fact that would be consistent with what other projects are
>> doing in similar situation. it should also probably make things compatible
>> between minor releases of spark.
>> 
>> But i think you are right in a sense that the problem is that spark jars
>> are not uniquely encompassed by maven artifact id and version, unlike with
>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
>> be one and only one released artifact in existence -- but one's local build
>> may create incompatible variations).
>> 
>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> The problem is not in building Spark it is in building Mahout using the
>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>>> in the repos.
>>> 
>>> For the rest of us, though the process below seems like an error prone
>>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>>> by Spark imo.
>>> 
>>> BTW The cache is laid out differently on linux but I don’t think you
>>> need to delete is anyway.
>>> 
>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> 
>>> fwiw i never built spark using maven. Always use sbt assembly.
>>> 
>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> 
>>>> Ok, the mystery is solved.
>>>> 
>>>> The safe sequence from my limited testing is:
>>>> 1) delete ~/.m2/repository/org/spark and mahout
>>>> 2) build Spark for your version of Hadoop *but do not use "mvn package
>>>> ...”* use “mvn install …” This will put a copy of the exact bits you
>>> need
>>>> into the maven cache for building mahout against. In my case using
>>> hadoop
>>>> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
>>> you
>>>> run tests on Spark some failures can safely be ignored according to the
>>>> Spark guys so check before giving up.
>>>> 3) build mahout with “mvn clean install"
>>>> 
>>>> This will create mahout from exactly the same bits you will run on your
>>>> cluster. It got rid of a missing anon function for me. The problem
>>> occurs
>>>> when you use a different version of Spark on your cluster than you
>>> used to
>>>> build Mahout and this is rather hidden by Maven. Maven downloads from
>>> repos
>>>> any dependency that is not in the local .m2 cache and so you have to
>>> make
>>>> sure your version of Spark is there so Maven wont download one that is
>>>> incompatible. Unless you really know what you are doing I’d build both
>>>> Spark and Mahout for now
>>>> 
>>>> BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>>> more
>>>> testing.
>>>> 
>>>> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>> Sorry to hear. I bet you’ll find a way.
>>>> 
>>>> The Spark Jira trail leads to two suggestions:
>>>> 1) use spark-submit to execute code with your own entry point (other
>>> than
>>>> spark-shell) One theory points to not loading all needed Spark classes
>>> from
>>>> calling code (Mahout in our case). I can hand check the jars for the
>>> anon
>>>> function I am missing.
>>>> 2) there may be different class names in the running code (created by
>>>> building Spark locally) and the  version referenced in the Mahout POM.
>>> If
>>>> this turns out to be true it means we can’t rely on building Spark
>>> locally.
>>>> Is there a maven target that puts the artifacts of the Spark build in
>>> the
>>>> .m2/repository local cache? That would be an easy way to test this
>>> theory.
>>>> 
>>>> either of these could cause missing classes.
>>>> 
>>>> 
>>>> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>> 
>>>> no i havent used it with anything but 1.0.1 and 0.9.x .
>>>> 
>>>> on a side note, I just have changed my employer. It is one of these big
>>>> guys that make it very difficult to do any contributions. So I am not
>>> sure
>>>> how much of anything i will be able to share/contribute.
>>>> 
>>>> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> But unless you have the time to devote to errors avoid it. I’ve built
>>>>> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>>>>> missing class errors. The 1.x branch seems to have some kind of
>>> peculiar
>>>>> build order dependencies. The errors sometimes don’t show up until
>>>> runtime,
>>>>> passing all build tests.
>>>>> 
>>>>> Dmitriy, have you successfully used any Spark version other than
>>> 1.0.1 on
>>>>> a cluster? If so do you recall the exact order and from what sources
>>> you
>>>>> built?
>>>>> 
>>>>> 
>>>>> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>>> 
>>>>> You can't use spark client of one version and have the backend of
>>>> another.
>>>>> You can try to change spark dependency in mahout poms to match your
>>>> backend
>>>>> (or vice versa, you can change your backend to match what's on the
>>>> client).
>>>>> 
>>>>> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>>>> balijamahesh.mca@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> Here are the errors I get which I run in a pseudo distributed mode,
>>>>>> 
>>>>>> Spark 1.0.2 and Mahout latest code (Clone)
>>>>>> 
>>>>>> When I run the command in page,
>>>>>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>>>>> 
>>>>>> val drmX = drmData(::, 0 until 4)
>>>>>> 
>>>>>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>>>>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>>>>> local class serialVersionUID = -6766554341038829528
>>>>>>   at
>>>>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>>>>   at
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>>   at
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>>   at
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>>>>   at
>>>>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>>>>   at
>>>>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>>>>   at
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>>>>   at
>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>>>>   at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>   at java.lang.Thread.run(Thread.java:701)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>>>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>>>>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>>>>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>>>>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>>>>>> -6766554341038829528
>>>>>> 
>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>> 
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>> 
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>> 
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>>>> 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>>>> 
>>>>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>>>> 
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>>>> 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>   java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>>>> 
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>   java.lang.Thread.run(Thread.java:701)
>>>>>> Driver stacktrace:
>>>>>>   at org.apache.spark.scheduler.DAGScheduler.org
>>>>>> 
>>>>> 
>>>> 
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>   at
>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>   at scala.Option.foreach(Option.scala:236)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>>>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>>>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>>>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>>>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>>>   at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>   at
>>>>>> 
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>   at
>>>>>> 
>>>>> 
>>>> 
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>> 
>>>>>> Best,
>>>>>> Mahesh Balija.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Is anyone else nervous about ignoring this issue or relying on
>>>>>> non-build
>>>>>>>> (hand run) test driven transitive dependency checking. I hope
>>> someone
>>>>>>> else
>>>>>>>> will chime in.
>>>>>>>> 
>>>>>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
>>>> set
>>>>>>> up
>>>>>>>> the build machine to do this? I’d feel better about eyeballing
>>> deps if
>>>>>> we
>>>>>>>> could have a TEST_MASTER automatically run during builds at Apache.
>>>>>> Maybe
>>>>>>>> the regular unit tests are OK for building locally ourselves.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>>>>> whether
>>>>>>> we
>>>>>>>>>> have missing classes or not. The job.jar at least used the pom
>>>>>>>> dependencies
>>>>>>>>>> to guarantee every needed class was present. So the job.jar
>>> seems to
>>>>>>>> solve
>>>>>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No, as i wrote spark doesn't  work with job jar format. Neither
>>> as it
>>>>>>>> turns
>>>>>>>>> out more recent hadoop MR btw.
>>>>>>>> 
>>>>>>>> Not speaking literally of the format. Spark understands jars and
>>> maven
>>>>>>> can
>>>>>>>> build one from transitive dependencies.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES
>>> to
>>>>>>>> startup
>>>>>>>>> tasks with all of it just on copy time). This is absolutely not
>>> the
>>>>>> way
>>>>>>>> to
>>>>>>>>> go with this.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Lack of guarantee to load seems like a bigger problem than startup
>>>>>> time.
>>>>>>>> Clearly we can’t just ignore this.
>>>>>>>> 
>>>>>>> 
>>>>>>> Nope. given highly iterative nature and dynamic task allocation in
>>> this
>>>>>>> environment, one is looking to effects similar to Map Reduce. This
>>> is
>>>>> not
>>>>>>> the only reason why I never go to MR anymore, but that's one of main
>>>>>> ones.
>>>>>>> 
>>>>>>> How about experiment: why don't you create assembly that copies ALL
>>>>>>> transitive dependencies in one folder, and then try to broadcast it
>>>> from
>>>>>>> single point (front end) to well... let's start with 20 machines.
>>> (of
>>>>>>> course we ideally want to into 10^3 ..10^4 range -- but why bother
>>> if
>>>> we
>>>>>>> can't do it for 20).
>>>>>>> 
>>>>>>> Or, heck, let's try to simply parallel-copy it between too machines
>>> 20
>>>>>>> times that are not collocated on the same subnet.
>>>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> There may be any number of bugs waiting for the time we try
>>> running
>>>>>>> on a
>>>>>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No. Assuming any given method is tested on all its execution
>>> paths,
>>>>>>> there
>>>>>>>>> will be no bugs. The bugs of that sort will only appear if the
>>> user
>>>>>> is
>>>>>>>>> using algebra directly and calls something that is not on the
>>> path,
>>>>>>> from
>>>>>>>>> the closure. In which case our answer to this is the same as for
>>> the
>>>>>>>> solver
>>>>>>>>> methodology developers -- use customized SparkConf while creating
>>>>>>> context
>>>>>>>>> to include stuff you really want.
>>>>>>>>> 
>>>>>>>>> Also another right answer to this is that we probably should
>>>>>> reasonably
>>>>>>>>> provide the toolset here. For example, all the stats stuff found
>>> in R
>>>>>>>> base
>>>>>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Huh? this is not true. The one I ran into was found by calling
>>>>>> something
>>>>>>>> in math from something in math-scala. It led outside and you can
>>>>>>> encounter
>>>>>>>> such things even in algebra.  In fact you have no idea if these
>>>>>> problems
>>>>>>>> exists except for the fact you have used it a lot personally.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> You ran it with your own code that never existed before.
>>>>>>> 
>>>>>>> But there's difference between released Mahout code (which is what
>>> you
>>>>>> are
>>>>>>> working on) and the user code. Released code must run thru remote
>>> tests
>>>>>> as
>>>>>>> you suggested and thus guarantee there are no such problems with
>>> post
>>>>>>> release code.
>>>>>>> 
>>>>>>> For users, we only can provide a way for them to load stuff that
>>> they
>>>>>>> decide to use. We don't have apriori knowledge what they will use.
>>> It
>>>> is
>>>>>>> the same thing that spark does, and the same thing that MR does,
>>>> doesn't
>>>>>>> it?
>>>>>>> 
>>>>>>> Of course mahout should drop rigorously the stuff it doesn't load,
>>> from
>>>>>> the
>>>>>>> scala scope. No argue about that. In fact that's what i suggested
>>> as #1
>>>>>>> solution. But there's nothing much to do here but to go dependency
>>>>>>> cleansing for math and spark code. Part of the reason there's so
>>> much
>>>> is
>>>>>>> because newer modules still bring in everything from mrLegacy.
>>>>>>> 
>>>>>>> You are right in saying it is hard to guess what else dependencies
>>> are
>>>>> in
>>>>>>> the util/legacy code that are actually used. but that's not a
>>>>>> justification
>>>>>>> for brute force "copy them all" approach that virtually guarantees
>>>>>> ruining
>>>>>>> one of the foremost legacy issues this work intended to address.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Right.

Something else that’s come up so I haven’t tried the shell tutorial yet. If anyone else wants to try it you can build Mahout from this PR:
https://github.com/apache/mahout/pull/61

On Oct 21, 2014, at 3:28 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

hm no they don't push different binary releases to maven. I assume they
only push the default one.

On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> ps i remember discussion for packaging binary spark distributions. So
> there's in fact a number of different spark artifact releases. However, i
> am not sure if they are pushing them to mvn repositories. (if they did,
> they might use different maven classifiers for those). If that's the case,
> then one plausible strategy here is to recommend rebuilding mahout with
> dependency to a classifier corresponding to the actual spark binary release
> used.
> 
> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> if you are using mahout shell or command line drivers (which i dont) it
>> would seem the correct thing to do is for mahout script simply to take
>> spark dependencies from installed $SPARK_HOME rather than from Mahout's
>> assembly. In fact that would be consistent with what other projects are
>> doing in similar situation. it should also probably make things compatible
>> between minor releases of spark.
>> 
>> But i think you are right in a sense that the problem is that spark jars
>> are not uniquely encompassed by maven artifact id and version, unlike with
>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
>> be one and only one released artifact in existence -- but one's local build
>> may create incompatible variations).
>> 
>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> The problem is not in building Spark it is in building Mahout using the
>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>>> in the repos.
>>> 
>>> For the rest of us, though the process below seems like an error prone
>>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>>> by Spark imo.
>>> 
>>> BTW The cache is laid out differently on linux but I don’t think you
>>> need to delete is anyway.
>>> 
>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> 
>>> fwiw i never built spark using maven. Always use sbt assembly.
>>> 
>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> 
>>>> Ok, the mystery is solved.
>>>> 
>>>> The safe sequence from my limited testing is:
>>>> 1) delete ~/.m2/repository/org/spark and mahout
>>>> 2) build Spark for your version of Hadoop *but do not use "mvn package
>>>> ...”* use “mvn install …” This will put a copy of the exact bits you
>>> need
>>>> into the maven cache for building mahout against. In my case using
>>> hadoop
>>>> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
>>> you
>>>> run tests on Spark some failures can safely be ignored according to the
>>>> Spark guys so check before giving up.
>>>> 3) build mahout with “mvn clean install"
>>>> 
>>>> This will create mahout from exactly the same bits you will run on your
>>>> cluster. It got rid of a missing anon function for me. The problem
>>> occurs
>>>> when you use a different version of Spark on your cluster than you
>>> used to
>>>> build Mahout and this is rather hidden by Maven. Maven downloads from
>>> repos
>>>> any dependency that is not in the local .m2 cache and so you have to
>>> make
>>>> sure your version of Spark is there so Maven wont download one that is
>>>> incompatible. Unless you really know what you are doing I’d build both
>>>> Spark and Mahout for now
>>>> 
>>>> BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>>> more
>>>> testing.
>>>> 
>>>> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>> Sorry to hear. I bet you’ll find a way.
>>>> 
>>>> The Spark Jira trail leads to two suggestions:
>>>> 1) use spark-submit to execute code with your own entry point (other
>>> than
>>>> spark-shell) One theory points to not loading all needed Spark classes
>>> from
>>>> calling code (Mahout in our case). I can hand check the jars for the
>>> anon
>>>> function I am missing.
>>>> 2) there may be different class names in the running code (created by
>>>> building Spark locally) and the  version referenced in the Mahout POM.
>>> If
>>>> this turns out to be true it means we can’t rely on building Spark
>>> locally.
>>>> Is there a maven target that puts the artifacts of the Spark build in
>>> the
>>>> .m2/repository local cache? That would be an easy way to test this
>>> theory.
>>>> 
>>>> either of these could cause missing classes.
>>>> 
>>>> 
>>>> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>> 
>>>> no i havent used it with anything but 1.0.1 and 0.9.x .
>>>> 
>>>> on a side note, I just have changed my employer. It is one of these big
>>>> guys that make it very difficult to do any contributions. So I am not
>>> sure
>>>> how much of anything i will be able to share/contribute.
>>>> 
>>>> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> But unless you have the time to devote to errors avoid it. I’ve built
>>>>> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>>>>> missing class errors. The 1.x branch seems to have some kind of
>>> peculiar
>>>>> build order dependencies. The errors sometimes don’t show up until
>>>> runtime,
>>>>> passing all build tests.
>>>>> 
>>>>> Dmitriy, have you successfully used any Spark version other than
>>> 1.0.1 on
>>>>> a cluster? If so do you recall the exact order and from what sources
>>> you
>>>>> built?
>>>>> 
>>>>> 
>>>>> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>>> 
>>>>> You can't use spark client of one version and have the backend of
>>>> another.
>>>>> You can try to change spark dependency in mahout poms to match your
>>>> backend
>>>>> (or vice versa, you can change your backend to match what's on the
>>>> client).
>>>>> 
>>>>> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>>>> balijamahesh.mca@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> Here are the errors I get which I run in a pseudo distributed mode,
>>>>>> 
>>>>>> Spark 1.0.2 and Mahout latest code (Clone)
>>>>>> 
>>>>>> When I run the command in page,
>>>>>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>>>>> 
>>>>>> val drmX = drmData(::, 0 until 4)
>>>>>> 
>>>>>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>>>>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>>>>> local class serialVersionUID = -6766554341038829528
>>>>>>    at
>>>>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>>>>    at
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>>    at
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>>    at
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>>    at
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>>    at
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>>>>    at
>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>    at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>>>>    at
>>>>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>>>>    at
>>>>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>>>>    at
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>>>>    at
>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>    at
>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>>>>    at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>    at java.lang.Thread.run(Thread.java:701)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>>>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>>>>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>>>>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>>>>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>>>>>> -6766554341038829528
>>>>>> 
>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>> 
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>>>> 
>>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>>>> 
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>>>> 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>    java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>>>> 
>>>>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>>>> 
>>>>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>>>> 
>>>>>> 
>>>> 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>>>> 
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>>>>    java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>>>> 
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>    java.lang.Thread.run(Thread.java:701)
>>>>>> Driver stacktrace:
>>>>>>    at org.apache.spark.scheduler.DAGScheduler.org
>>>>>> 
>>>>> 
>>>> 
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>    at
>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>    at scala.Option.foreach(Option.scala:236)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>>>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>>>    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>>>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>>>    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>>>    at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>    at
>>>>>> 
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>    at
>>>>>> 
>>>>> 
>>>> 
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>> 
>>>>>> Best,
>>>>>> Mahesh Balija.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Is anyone else nervous about ignoring this issue or relying on
>>>>>> non-build
>>>>>>>> (hand run) test driven transitive dependency checking. I hope
>>> someone
>>>>>>> else
>>>>>>>> will chime in.
>>>>>>>> 
>>>>>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
>>>> set
>>>>>>> up
>>>>>>>> the build machine to do this? I’d feel better about eyeballing
>>> deps if
>>>>>> we
>>>>>>>> could have a TEST_MASTER automatically run during builds at Apache.
>>>>>> Maybe
>>>>>>>> the regular unit tests are OK for building locally ourselves.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>>>>> whether
>>>>>>> we
>>>>>>>>>> have missing classes or not. The job.jar at least used the pom
>>>>>>>> dependencies
>>>>>>>>>> to guarantee every needed class was present. So the job.jar
>>> seems to
>>>>>>>> solve
>>>>>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No, as i wrote spark doesn't  work with job jar format. Neither
>>> as it
>>>>>>>> turns
>>>>>>>>> out more recent hadoop MR btw.
>>>>>>>> 
>>>>>>>> Not speaking literally of the format. Spark understands jars and
>>> maven
>>>>>>> can
>>>>>>>> build one from transitive dependencies.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES
>>> to
>>>>>>>> startup
>>>>>>>>> tasks with all of it just on copy time). This is absolutely not
>>> the
>>>>>> way
>>>>>>>> to
>>>>>>>>> go with this.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Lack of guarantee to load seems like a bigger problem than startup
>>>>>> time.
>>>>>>>> Clearly we can’t just ignore this.
>>>>>>>> 
>>>>>>> 
>>>>>>> Nope. given highly iterative nature and dynamic task allocation in
>>> this
>>>>>>> environment, one is looking to effects similar to Map Reduce. This
>>> is
>>>>> not
>>>>>>> the only reason why I never go to MR anymore, but that's one of main
>>>>>> ones.
>>>>>>> 
>>>>>>> How about experiment: why don't you create assembly that copies ALL
>>>>>>> transitive dependencies in one folder, and then try to broadcast it
>>>> from
>>>>>>> single point (front end) to well... let's start with 20 machines.
>>> (of
>>>>>>> course we ideally want to into 10^3 ..10^4 range -- but why bother
>>> if
>>>> we
>>>>>>> can't do it for 20).
>>>>>>> 
>>>>>>> Or, heck, let's try to simply parallel-copy it between too machines
>>> 20
>>>>>>> times that are not collocated on the same subnet.
>>>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> There may be any number of bugs waiting for the time we try
>>> running
>>>>>>> on a
>>>>>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No. Assuming any given method is tested on all its execution
>>> paths,
>>>>>>> there
>>>>>>>>> will be no bugs. The bugs of that sort will only appear if the
>>> user
>>>>>> is
>>>>>>>>> using algebra directly and calls something that is not on the
>>> path,
>>>>>>> from
>>>>>>>>> the closure. In which case our answer to this is the same as for
>>> the
>>>>>>>> solver
>>>>>>>>> methodology developers -- use customized SparkConf while creating
>>>>>>> context
>>>>>>>>> to include stuff you really want.
>>>>>>>>> 
>>>>>>>>> Also another right answer to this is that we probably should
>>>>>> reasonably
>>>>>>>>> provide the toolset here. For example, all the stats stuff found
>>> in R
>>>>>>>> base
>>>>>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Huh? this is not true. The one I ran into was found by calling
>>>>>> something
>>>>>>>> in math from something in math-scala. It led outside and you can
>>>>>>> encounter
>>>>>>>> such things even in algebra.  In fact you have no idea if these
>>>>>> problems
>>>>>>>> exists except for the fact you have used it a lot personally.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> You ran it with your own code that never existed before.
>>>>>>> 
>>>>>>> But there's difference between released Mahout code (which is what
>>> you
>>>>>> are
>>>>>>> working on) and the user code. Released code must run thru remote
>>> tests
>>>>>> as
>>>>>>> you suggested and thus guarantee there are no such problems with
>>> post
>>>>>>> release code.
>>>>>>> 
>>>>>>> For users, we only can provide a way for them to load stuff that
>>> they
>>>>>>> decide to use. We don't have apriori knowledge what they will use.
>>> It
>>>> is
>>>>>>> the same thing that spark does, and the same thing that MR does,
>>>> doesn't
>>>>>>> it?
>>>>>>> 
>>>>>>> Of course mahout should drop rigorously the stuff it doesn't load,
>>> from
>>>>>> the
>>>>>>> scala scope. No argue about that. In fact that's what i suggested
>>> as #1
>>>>>>> solution. But there's nothing much to do here but to go dependency
>>>>>>> cleansing for math and spark code. Part of the reason there's so
>>> much
>>>> is
>>>>>>> because newer modules still bring in everything from mrLegacy.
>>>>>>> 
>>>>>>> You are right in saying it is hard to guess what else dependencies
>>> are
>>>>> in
>>>>>>> the util/legacy code that are actually used. but that's not a
>>>>>> justification
>>>>>>> for brute force "copy them all" approach that virtually guarantees
>>>>>> ruining
>>>>>>> one of the foremost legacy issues this work intended to address.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

either way, for now compiling spark (with push to local maven) and then
mahout (which would use local maven artifacts) on the same machine and then
re-distributing artifacts to worker nodes should work regardless of
parameters of compilation.

On Tue, Oct 21, 2014 at 3:28 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> hm no they don't push different binary releases to maven. I assume they
> only push the default one.
>
> On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
>> ps i remember discussion for packaging binary spark distributions. So
>> there's in fact a number of different spark artifact releases. However, i
>> am not sure if they are pushing them to mvn repositories. (if they did,
>> they might use different maven classifiers for those). If that's the case,
>> then one plausible strategy here is to recommend rebuilding mahout with
>> dependency to a classifier corresponding to the actual spark binary release
>> used.
>>
>> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>>> if you are using mahout shell or command line drivers (which i dont) it
>>> would seem the correct thing to do is for mahout script simply to take
>>> spark dependencies from installed $SPARK_HOME rather than from Mahout's
>>> assembly. In fact that would be consistent with what other projects are
>>> doing in similar situation. it should also probably make things compatible
>>> between minor releases of spark.
>>>
>>> But i think you are right in a sense that the problem is that spark jars
>>> are not uniquely encompassed by maven artifact id and version, unlike with
>>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
>>> be one and only one released artifact in existence -- but one's local build
>>> may create incompatible variations).
>>>
>>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>
>>>> The problem is not in building Spark it is in building Mahout using the
>>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>>>> in the repos.
>>>>
>>>> For the rest of us, though the process below seems like an error prone
>>>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>>>> by Spark imo.
>>>>
>>>> BTW The cache is laid out differently on linux but I don’t think you
>>>> need to delete is anyway.
>>>>
>>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>
>>>> fwiw i never built spark using maven. Always use sbt assembly.
>>>>
>>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>
>>>> > Ok, the mystery is solved.
>>>> >
>>>> > The safe sequence from my limited testing is:
>>>> > 1) delete ~/.m2/repository/org/spark and mahout
>>>> > 2) build Spark for your version of Hadoop *but do not use "mvn package
>>>> > ...”* use “mvn install …” This will put a copy of the exact bits you
>>>> need
>>>> > into the maven cache for building mahout against. In my case using
>>>> hadoop
>>>> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install”
>>>> If you
>>>> > run tests on Spark some failures can safely be ignored according to
>>>> the
>>>> > Spark guys so check before giving up.
>>>> > 3) build mahout with “mvn clean install"
>>>> >
>>>> > This will create mahout from exactly the same bits you will run on
>>>> your
>>>> > cluster. It got rid of a missing anon function for me. The problem
>>>> occurs
>>>> > when you use a different version of Spark on your cluster than you
>>>> used to
>>>> > build Mahout and this is rather hidden by Maven. Maven downloads from
>>>> repos
>>>> > any dependency that is not in the local .m2 cache and so you have to
>>>> make
>>>> > sure your version of Spark is there so Maven wont download one that is
>>>> > incompatible. Unless you really know what you are doing I’d build both
>>>> > Spark and Mahout for now
>>>> >
>>>> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>>>> more
>>>> > testing.
>>>> >
>>>> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>> >
>>>> > Sorry to hear. I bet you’ll find a way.
>>>> >
>>>> > The Spark Jira trail leads to two suggestions:
>>>> > 1) use spark-submit to execute code with your own entry point (other
>>>> than
>>>> > spark-shell) One theory points to not loading all needed Spark
>>>> classes from
>>>> > calling code (Mahout in our case). I can hand check the jars for the
>>>> anon
>>>> > function I am missing.
>>>> > 2) there may be different class names in the running code (created by
>>>> > building Spark locally) and the  version referenced in the Mahout
>>>> POM. If
>>>> > this turns out to be true it means we can’t rely on building Spark
>>>> locally.
>>>> > Is there a maven target that puts the artifacts of the Spark build in
>>>> the
>>>> > .m2/repository local cache? That would be an easy way to test this
>>>> theory.
>>>> >
>>>> > either of these could cause missing classes.
>>>> >
>>>> >
>>>> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>> >
>>>> > no i havent used it with anything but 1.0.1 and 0.9.x .
>>>> >
>>>> > on a side note, I just have changed my employer. It is one of these
>>>> big
>>>> > guys that make it very difficult to do any contributions. So I am not
>>>> sure
>>>> > how much of anything i will be able to share/contribute.
>>>> >
>>>> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>> >
>>>> >> But unless you have the time to devote to errors avoid it. I’ve built
>>>> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these
>>>> and
>>>> >> missing class errors. The 1.x branch seems to have some kind of
>>>> peculiar
>>>> >> build order dependencies. The errors sometimes don’t show up until
>>>> > runtime,
>>>> >> passing all build tests.
>>>> >>
>>>> >> Dmitriy, have you successfully used any Spark version other than
>>>> 1.0.1 on
>>>> >> a cluster? If so do you recall the exact order and from what sources
>>>> you
>>>> >> built?
>>>> >>
>>>> >>
>>>> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> You can't use spark client of one version and have the backend of
>>>> > another.
>>>> >> You can try to change spark dependency in mahout poms to match your
>>>> > backend
>>>> >> (or vice versa, you can change your backend to match what's on the
>>>> > client).
>>>> >>
>>>> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>>>> > balijamahesh.mca@gmail.com
>>>> >>>
>>>> >> wrote:
>>>> >>
>>>> >>> Hi All,
>>>> >>>
>>>> >>> Here are the errors I get which I run in a pseudo distributed mode,
>>>> >>>
>>>> >>> Spark 1.0.2 and Mahout latest code (Clone)
>>>> >>>
>>>> >>> When I run the command in page,
>>>> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>>> >>>
>>>> >>> val drmX = drmData(::, 0 until 4)
>>>> >>>
>>>> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>>> >>> incompatible: stream classdesc serialVersionUID =
>>>> 385418487991259089,
>>>> >>> local class serialVersionUID = -6766554341038829528
>>>> >>>     at
>>>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>> >>>     at
>>>> >>>
>>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>> >>>     at
>>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>> >>>     at
>>>> >>>
>>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>> >>>     at
>>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>> >>>     at
>>>> >>>
>>>> >
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>> >>>     at
>>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>> >>>     at
>>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>> >>>     at
>>>> >>>
>>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>> >>>     at
>>>> >>>
>>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>> >>>     at
>>>> >>>
>>>> >
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>> >>>     at
>>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>> >>>     at
>>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>> >>>     at
>>>> >>>
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >>>     at java.lang.Thread.run(Thread.java:701)
>>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>>> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>>> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>>> >>> TID 6 on host mahesh-VirtualBox.local:
>>>> java.io.InvalidClassException:
>>>> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>>> >>> serialVersionUID = 385418487991259089, local class serialVersionUID
>>>> =
>>>> >>> -6766554341038829528
>>>> >>>
>>>>  java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>> >>>
>>>> >>>
>>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>> >>>
>>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>> >>>
>>>> >>>
>>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>> >>>
>>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>> >>>
>>>> >>>
>>>> >
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>> >>>
>>>>  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>> >>>
>>>> >>>
>>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>> >>>
>>>> >>>
>>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>> >>>
>>>> >>>
>>>> >
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>> >>>
>>>>  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>> >>>
>>>> >>>
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >>>     java.lang.Thread.run(Thread.java:701)
>>>> >>> Driver stacktrace:
>>>> >>>     at org.apache.spark.scheduler.DAGScheduler.org
>>>> >>>
>>>> >>
>>>> >
>>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>> >>>     at
>>>> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>> >>>     at scala.Option.foreach(Option.scala:236)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>> >>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>> >>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>> >>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>> >>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>> >>>     at
>>>> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>> >>>     at
>>>> >>>
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >>>     at
>>>> >>>
>>>> >>
>>>> >
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >>>
>>>> >>> Best,
>>>> >>> Mahesh Balija.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <
>>>> dlieu.7@gmail.com>
>>>> >>> wrote:
>>>> >>>
>>>> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pat@occamsmachete.com
>>>> >
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>> Is anyone else nervous about ignoring this issue or relying on
>>>> >>> non-build
>>>> >>>>> (hand run) test driven transitive dependency checking. I hope
>>>> someone
>>>> >>>> else
>>>> >>>>> will chime in.
>>>> >>>>>
>>>> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can
>>>> we
>>>> > set
>>>> >>>> up
>>>> >>>>> the build machine to do this? I’d feel better about eyeballing
>>>> deps if
>>>> >>> we
>>>> >>>>> could have a TEST_MASTER automatically run during builds at
>>>> Apache.
>>>> >>> Maybe
>>>> >>>>> the regular unit tests are OK for building locally ourselves.
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <
>>>> dlieu.7@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>>>> pat@occamsmachete.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>>> >>> whether
>>>> >>>> we
>>>> >>>>>>> have missing classes or not. The job.jar at least used the pom
>>>> >>>>> dependencies
>>>> >>>>>>> to guarantee every needed class was present. So the job.jar
>>>> seems to
>>>> >>>>> solve
>>>> >>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>> No, as i wrote spark doesn't  work with job jar format. Neither
>>>> as it
>>>> >>>>> turns
>>>> >>>>>> out more recent hadoop MR btw.
>>>> >>>>>
>>>> >>>>> Not speaking literally of the format. Spark understands jars and
>>>> maven
>>>> >>>> can
>>>> >>>>> build one from transitive dependencies.
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES
>>>> to
>>>> >>>>> startup
>>>> >>>>>> tasks with all of it just on copy time). This is absolutely not
>>>> the
>>>> >>> way
>>>> >>>>> to
>>>> >>>>>> go with this.
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>> Lack of guarantee to load seems like a bigger problem than startup
>>>> >>> time.
>>>> >>>>> Clearly we can’t just ignore this.
>>>> >>>>>
>>>> >>>>
>>>> >>>> Nope. given highly iterative nature and dynamic task allocation in
>>>> this
>>>> >>>> environment, one is looking to effects similar to Map Reduce. This
>>>> is
>>>> >> not
>>>> >>>> the only reason why I never go to MR anymore, but that's one of
>>>> main
>>>> >>> ones.
>>>> >>>>
>>>> >>>> How about experiment: why don't you create assembly that copies ALL
>>>> >>>> transitive dependencies in one folder, and then try to broadcast it
>>>> > from
>>>> >>>> single point (front end) to well... let's start with 20 machines.
>>>> (of
>>>> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother
>>>> if
>>>> > we
>>>> >>>> can't do it for 20).
>>>> >>>>
>>>> >>>> Or, heck, let's try to simply parallel-copy it between too
>>>> machines 20
>>>> >>>> times that are not collocated on the same subnet.
>>>> >>>>
>>>> >>>>
>>>> >>>>>>
>>>> >>>>>>> There may be any number of bugs waiting for the time we try
>>>> running
>>>> >>>> on a
>>>> >>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> No. Assuming any given method is tested on all its execution
>>>> paths,
>>>> >>>> there
>>>> >>>>>> will be no bugs. The bugs of that sort will only appear if the
>>>> user
>>>> >>> is
>>>> >>>>>> using algebra directly and calls something that is not on the
>>>> path,
>>>> >>>> from
>>>> >>>>>> the closure. In which case our answer to this is the same as for
>>>> the
>>>> >>>>> solver
>>>> >>>>>> methodology developers -- use customized SparkConf while creating
>>>> >>>> context
>>>> >>>>>> to include stuff you really want.
>>>> >>>>>>
>>>> >>>>>> Also another right answer to this is that we probably should
>>>> >>> reasonably
>>>> >>>>>> provide the toolset here. For example, all the stats stuff found
>>>> in R
>>>> >>>>> base
>>>> >>>>>> and R stat packages so the user is not compelled to go
>>>> non-native.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>> Huh? this is not true. The one I ran into was found by calling
>>>> >>> something
>>>> >>>>> in math from something in math-scala. It led outside and you can
>>>> >>>> encounter
>>>> >>>>> such things even in algebra.  In fact you have no idea if these
>>>> >>> problems
>>>> >>>>> exists except for the fact you have used it a lot personally.
>>>> >>>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> You ran it with your own code that never existed before.
>>>> >>>>
>>>> >>>> But there's difference between released Mahout code (which is what
>>>> you
>>>> >>> are
>>>> >>>> working on) and the user code. Released code must run thru remote
>>>> tests
>>>> >>> as
>>>> >>>> you suggested and thus guarantee there are no such problems with
>>>> post
>>>> >>>> release code.
>>>> >>>>
>>>> >>>> For users, we only can provide a way for them to load stuff that
>>>> they
>>>> >>>> decide to use. We don't have apriori knowledge what they will use.
>>>> It
>>>> > is
>>>> >>>> the same thing that spark does, and the same thing that MR does,
>>>> > doesn't
>>>> >>>> it?
>>>> >>>>
>>>> >>>> Of course mahout should drop rigorously the stuff it doesn't load,
>>>> from
>>>> >>> the
>>>> >>>> scala scope. No argue about that. In fact that's what i suggested
>>>> as #1
>>>> >>>> solution. But there's nothing much to do here but to go dependency
>>>> >>>> cleansing for math and spark code. Part of the reason there's so
>>>> much
>>>> > is
>>>> >>>> because newer modules still bring in everything from mrLegacy.
>>>> >>>>
>>>> >>>> You are right in saying it is hard to guess what else dependencies
>>>> are
>>>> >> in
>>>> >>>> the util/legacy code that are actually used. but that's not a
>>>> >>> justification
>>>> >>>> for brute force "copy them all" approach that virtually guarantees
>>>> >>> ruining
>>>> >>>> one of the foremost legacy issues this work intended to address.
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

hm no they don't push different binary releases to maven. I assume they
only push the default one.

On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> ps i remember discussion for packaging binary spark distributions. So
> there's in fact a number of different spark artifact releases. However, i
> am not sure if they are pushing them to mvn repositories. (if they did,
> they might use different maven classifiers for those). If that's the case,
> then one plausible strategy here is to recommend rebuilding mahout with
> dependency to a classifier corresponding to the actual spark binary release
> used.
>
> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
>> if you are using mahout shell or command line drivers (which i dont) it
>> would seem the correct thing to do is for mahout script simply to take
>> spark dependencies from installed $SPARK_HOME rather than from Mahout's
>> assembly. In fact that would be consistent with what other projects are
>> doing in similar situation. it should also probably make things compatible
>> between minor releases of spark.
>>
>> But i think you are right in a sense that the problem is that spark jars
>> are not uniquely encompassed by maven artifact id and version, unlike with
>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
>> be one and only one released artifact in existence -- but one's local build
>> may create incompatible variations).
>>
>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>>> The problem is not in building Spark it is in building Mahout using the
>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>>> in the repos.
>>>
>>> For the rest of us, though the process below seems like an error prone
>>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>>> by Spark imo.
>>>
>>> BTW The cache is laid out differently on linux but I don’t think you
>>> need to delete is anyway.
>>>
>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>
>>> fwiw i never built spark using maven. Always use sbt assembly.
>>>
>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>
>>> > Ok, the mystery is solved.
>>> >
>>> > The safe sequence from my limited testing is:
>>> > 1) delete ~/.m2/repository/org/spark and mahout
>>> > 2) build Spark for your version of Hadoop *but do not use "mvn package
>>> > ...”* use “mvn install …” This will put a copy of the exact bits you
>>> need
>>> > into the maven cache for building mahout against. In my case using
>>> hadoop
>>> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
>>> you
>>> > run tests on Spark some failures can safely be ignored according to the
>>> > Spark guys so check before giving up.
>>> > 3) build mahout with “mvn clean install"
>>> >
>>> > This will create mahout from exactly the same bits you will run on your
>>> > cluster. It got rid of a missing anon function for me. The problem
>>> occurs
>>> > when you use a different version of Spark on your cluster than you
>>> used to
>>> > build Mahout and this is rather hidden by Maven. Maven downloads from
>>> repos
>>> > any dependency that is not in the local .m2 cache and so you have to
>>> make
>>> > sure your version of Spark is there so Maven wont download one that is
>>> > incompatible. Unless you really know what you are doing I’d build both
>>> > Spark and Mahout for now
>>> >
>>> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>>> more
>>> > testing.
>>> >
>>> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> >
>>> > Sorry to hear. I bet you’ll find a way.
>>> >
>>> > The Spark Jira trail leads to two suggestions:
>>> > 1) use spark-submit to execute code with your own entry point (other
>>> than
>>> > spark-shell) One theory points to not loading all needed Spark classes
>>> from
>>> > calling code (Mahout in our case). I can hand check the jars for the
>>> anon
>>> > function I am missing.
>>> > 2) there may be different class names in the running code (created by
>>> > building Spark locally) and the  version referenced in the Mahout POM.
>>> If
>>> > this turns out to be true it means we can’t rely on building Spark
>>> locally.
>>> > Is there a maven target that puts the artifacts of the Spark build in
>>> the
>>> > .m2/repository local cache? That would be an easy way to test this
>>> theory.
>>> >
>>> > either of these could cause missing classes.
>>> >
>>> >
>>> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >
>>> > no i havent used it with anything but 1.0.1 and 0.9.x .
>>> >
>>> > on a side note, I just have changed my employer. It is one of these big
>>> > guys that make it very difficult to do any contributions. So I am not
>>> sure
>>> > how much of anything i will be able to share/contribute.
>>> >
>>> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> >
>>> >> But unless you have the time to devote to errors avoid it. I’ve built
>>> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>>> >> missing class errors. The 1.x branch seems to have some kind of
>>> peculiar
>>> >> build order dependencies. The errors sometimes don’t show up until
>>> > runtime,
>>> >> passing all build tests.
>>> >>
>>> >> Dmitriy, have you successfully used any Spark version other than
>>> 1.0.1 on
>>> >> a cluster? If so do you recall the exact order and from what sources
>>> you
>>> >> built?
>>> >>
>>> >>
>>> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> >>
>>> >> You can't use spark client of one version and have the backend of
>>> > another.
>>> >> You can try to change spark dependency in mahout poms to match your
>>> > backend
>>> >> (or vice versa, you can change your backend to match what's on the
>>> > client).
>>> >>
>>> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>>> > balijamahesh.mca@gmail.com
>>> >>>
>>> >> wrote:
>>> >>
>>> >>> Hi All,
>>> >>>
>>> >>> Here are the errors I get which I run in a pseudo distributed mode,
>>> >>>
>>> >>> Spark 1.0.2 and Mahout latest code (Clone)
>>> >>>
>>> >>> When I run the command in page,
>>> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>> >>>
>>> >>> val drmX = drmData(::, 0 until 4)
>>> >>>
>>> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>> >>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>> >>> local class serialVersionUID = -6766554341038829528
>>> >>>     at
>>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>> >>>     at
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>     at
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>     at
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     at
>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>> >>>     at
>>> >>>
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>> >>>     at
>>> >>>
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>> >>>     at
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     at
>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>> >>>     at
>>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>     at java.lang.Thread.run(Thread.java:701)
>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>> >>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>> >>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>>> >>> -6766554341038829528
>>> >>>
>>>  java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>> >>>
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>> >>>
>>>  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>> >>>
>>> >>>
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>> >>>
>>> >>>
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>> >>>
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>> >>>
>>>  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>> >>>
>>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>     java.lang.Thread.run(Thread.java:701)
>>> >>> Driver stacktrace:
>>> >>>     at org.apache.spark.scheduler.DAGScheduler.org
>>> >>>
>>> >>
>>> >
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> >>>     at
>>> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>> >>>     at scala.Option.foreach(Option.scala:236)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>> >>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> >>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> >>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> >>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> >>>     at
>>> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> >>>     at
>>> >>>
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> >>>
>>> >>> Best,
>>> >>> Mahesh Balija.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> >>> wrote:
>>> >>>
>>> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> >>> wrote:
>>> >>>>
>>> >>>>> Is anyone else nervous about ignoring this issue or relying on
>>> >>> non-build
>>> >>>>> (hand run) test driven transitive dependency checking. I hope
>>> someone
>>> >>>> else
>>> >>>>> will chime in.
>>> >>>>>
>>> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
>>> > set
>>> >>>> up
>>> >>>>> the build machine to do this? I’d feel better about eyeballing
>>> deps if
>>> >>> we
>>> >>>>> could have a TEST_MASTER automatically run during builds at Apache.
>>> >>> Maybe
>>> >>>>> the regular unit tests are OK for building locally ourselves.
>>> >>>>>
>>> >>>>>>
>>> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>> >>> whether
>>> >>>> we
>>> >>>>>>> have missing classes or not. The job.jar at least used the pom
>>> >>>>> dependencies
>>> >>>>>>> to guarantee every needed class was present. So the job.jar
>>> seems to
>>> >>>>> solve
>>> >>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> No, as i wrote spark doesn't  work with job jar format. Neither
>>> as it
>>> >>>>> turns
>>> >>>>>> out more recent hadoop MR btw.
>>> >>>>>
>>> >>>>> Not speaking literally of the format. Spark understands jars and
>>> maven
>>> >>>> can
>>> >>>>> build one from transitive dependencies.
>>> >>>>>
>>> >>>>>>
>>> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES
>>> to
>>> >>>>> startup
>>> >>>>>> tasks with all of it just on copy time). This is absolutely not
>>> the
>>> >>> way
>>> >>>>> to
>>> >>>>>> go with this.
>>> >>>>>>
>>> >>>>>
>>> >>>>> Lack of guarantee to load seems like a bigger problem than startup
>>> >>> time.
>>> >>>>> Clearly we can’t just ignore this.
>>> >>>>>
>>> >>>>
>>> >>>> Nope. given highly iterative nature and dynamic task allocation in
>>> this
>>> >>>> environment, one is looking to effects similar to Map Reduce. This
>>> is
>>> >> not
>>> >>>> the only reason why I never go to MR anymore, but that's one of main
>>> >>> ones.
>>> >>>>
>>> >>>> How about experiment: why don't you create assembly that copies ALL
>>> >>>> transitive dependencies in one folder, and then try to broadcast it
>>> > from
>>> >>>> single point (front end) to well... let's start with 20 machines.
>>> (of
>>> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother
>>> if
>>> > we
>>> >>>> can't do it for 20).
>>> >>>>
>>> >>>> Or, heck, let's try to simply parallel-copy it between too machines
>>> 20
>>> >>>> times that are not collocated on the same subnet.
>>> >>>>
>>> >>>>
>>> >>>>>>
>>> >>>>>>> There may be any number of bugs waiting for the time we try
>>> running
>>> >>>> on a
>>> >>>>>>> node machine that doesn’t have some class in it’s classpath.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> No. Assuming any given method is tested on all its execution
>>> paths,
>>> >>>> there
>>> >>>>>> will be no bugs. The bugs of that sort will only appear if the
>>> user
>>> >>> is
>>> >>>>>> using algebra directly and calls something that is not on the
>>> path,
>>> >>>> from
>>> >>>>>> the closure. In which case our answer to this is the same as for
>>> the
>>> >>>>> solver
>>> >>>>>> methodology developers -- use customized SparkConf while creating
>>> >>>> context
>>> >>>>>> to include stuff you really want.
>>> >>>>>>
>>> >>>>>> Also another right answer to this is that we probably should
>>> >>> reasonably
>>> >>>>>> provide the toolset here. For example, all the stats stuff found
>>> in R
>>> >>>>> base
>>> >>>>>> and R stat packages so the user is not compelled to go non-native.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>> Huh? this is not true. The one I ran into was found by calling
>>> >>> something
>>> >>>>> in math from something in math-scala. It led outside and you can
>>> >>>> encounter
>>> >>>>> such things even in algebra.  In fact you have no idea if these
>>> >>> problems
>>> >>>>> exists except for the fact you have used it a lot personally.
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>> You ran it with your own code that never existed before.
>>> >>>>
>>> >>>> But there's difference between released Mahout code (which is what
>>> you
>>> >>> are
>>> >>>> working on) and the user code. Released code must run thru remote
>>> tests
>>> >>> as
>>> >>>> you suggested and thus guarantee there are no such problems with
>>> post
>>> >>>> release code.
>>> >>>>
>>> >>>> For users, we only can provide a way for them to load stuff that
>>> they
>>> >>>> decide to use. We don't have apriori knowledge what they will use.
>>> It
>>> > is
>>> >>>> the same thing that spark does, and the same thing that MR does,
>>> > doesn't
>>> >>>> it?
>>> >>>>
>>> >>>> Of course mahout should drop rigorously the stuff it doesn't load,
>>> from
>>> >>> the
>>> >>>> scala scope. No argue about that. In fact that's what i suggested
>>> as #1
>>> >>>> solution. But there's nothing much to do here but to go dependency
>>> >>>> cleansing for math and spark code. Part of the reason there's so
>>> much
>>> > is
>>> >>>> because newer modules still bring in everything from mrLegacy.
>>> >>>>
>>> >>>> You are right in saying it is hard to guess what else dependencies
>>> are
>>> >> in
>>> >>>> the util/legacy code that are actually used. but that's not a
>>> >>> justification
>>> >>>> for brute force "copy them all" approach that virtually guarantees
>>> >>> ruining
>>> >>>> one of the foremost legacy issues this work intended to address.
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>>
>>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

ps i remember discussion for packaging binary spark distributions. So
there's in fact a number of different spark artifact releases. However, i
am not sure if they are pushing them to mvn repositories. (if they did,
they might use different maven classifiers for those). If that's the case,
then one plausible strategy here is to recommend rebuilding mahout with
dependency to a classifier corresponding to the actual spark binary release
used.

On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> if you are using mahout shell or command line drivers (which i dont) it
> would seem the correct thing to do is for mahout script simply to take
> spark dependencies from installed $SPARK_HOME rather than from Mahout's
> assembly. In fact that would be consistent with what other projects are
> doing in similar situation. it should also probably make things compatible
> between minor releases of spark.
>
> But i think you are right in a sense that the problem is that spark jars
> are not uniquely encompassed by maven artifact id and version, unlike with
> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
> be one and only one released artifact in existence -- but one's local build
> may create incompatible variations).
>
> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> The problem is not in building Spark it is in building Mahout using the
>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>> in the repos.
>>
>> For the rest of us, though the process below seems like an error prone
>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>> by Spark imo.
>>
>> BTW The cache is laid out differently on linux but I don’t think you need
>> to delete is anyway.
>>
>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> fwiw i never built spark using maven. Always use sbt assembly.
>>
>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>> > Ok, the mystery is solved.
>> >
>> > The safe sequence from my limited testing is:
>> > 1) delete ~/.m2/repository/org/spark and mahout
>> > 2) build Spark for your version of Hadoop *but do not use "mvn package
>> > ...”* use “mvn install …” This will put a copy of the exact bits you
>> need
>> > into the maven cache for building mahout against. In my case using
>> hadoop
>> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
>> you
>> > run tests on Spark some failures can safely be ignored according to the
>> > Spark guys so check before giving up.
>> > 3) build mahout with “mvn clean install"
>> >
>> > This will create mahout from exactly the same bits you will run on your
>> > cluster. It got rid of a missing anon function for me. The problem
>> occurs
>> > when you use a different version of Spark on your cluster than you used
>> to
>> > build Mahout and this is rather hidden by Maven. Maven downloads from
>> repos
>> > any dependency that is not in the local .m2 cache and so you have to
>> make
>> > sure your version of Spark is there so Maven wont download one that is
>> > incompatible. Unless you really know what you are doing I’d build both
>> > Spark and Mahout for now
>> >
>> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>> more
>> > testing.
>> >
>> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> >
>> > Sorry to hear. I bet you’ll find a way.
>> >
>> > The Spark Jira trail leads to two suggestions:
>> > 1) use spark-submit to execute code with your own entry point (other
>> than
>> > spark-shell) One theory points to not loading all needed Spark classes
>> from
>> > calling code (Mahout in our case). I can hand check the jars for the
>> anon
>> > function I am missing.
>> > 2) there may be different class names in the running code (created by
>> > building Spark locally) and the  version referenced in the Mahout POM.
>> If
>> > this turns out to be true it means we can’t rely on building Spark
>> locally.
>> > Is there a maven target that puts the artifacts of the Spark build in
>> the
>> > .m2/repository local cache? That would be an easy way to test this
>> theory.
>> >
>> > either of these could cause missing classes.
>> >
>> >
>> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >
>> > no i havent used it with anything but 1.0.1 and 0.9.x .
>> >
>> > on a side note, I just have changed my employer. It is one of these big
>> > guys that make it very difficult to do any contributions. So I am not
>> sure
>> > how much of anything i will be able to share/contribute.
>> >
>> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> >
>> >> But unless you have the time to devote to errors avoid it. I’ve built
>> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>> >> missing class errors. The 1.x branch seems to have some kind of
>> peculiar
>> >> build order dependencies. The errors sometimes don’t show up until
>> > runtime,
>> >> passing all build tests.
>> >>
>> >> Dmitriy, have you successfully used any Spark version other than 1.0.1
>> on
>> >> a cluster? If so do you recall the exact order and from what sources
>> you
>> >> built?
>> >>
>> >>
>> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> >>
>> >> You can't use spark client of one version and have the backend of
>> > another.
>> >> You can try to change spark dependency in mahout poms to match your
>> > backend
>> >> (or vice versa, you can change your backend to match what's on the
>> > client).
>> >>
>> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>> > balijamahesh.mca@gmail.com
>> >>>
>> >> wrote:
>> >>
>> >>> Hi All,
>> >>>
>> >>> Here are the errors I get which I run in a pseudo distributed mode,
>> >>>
>> >>> Spark 1.0.2 and Mahout latest code (Clone)
>> >>>
>> >>> When I run the command in page,
>> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>> >>>
>> >>> val drmX = drmData(::, 0 until 4)
>> >>>
>> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>> >>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>> >>> local class serialVersionUID = -6766554341038829528
>> >>>     at
>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>> >>>     at
>> >>>
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> >>>     at
>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> >>>     at
>> >>>
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> >>>     at
>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> >>>     at
>> >>>
>> >
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>> >>>     at
>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>> >>>     at
>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>> >>>     at
>> >>>
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>> >>>     at
>> >>>
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>> >>>     at
>> >>>
>> >
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>> >>>     at
>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>> >>>     at
>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>> >>>     at
>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>> >>>     at
>> >>>
>> >>
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> >>>     at
>> >>>
>> >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>>     at java.lang.Thread.run(Thread.java:701)
>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
>> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>> >>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>> >>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>> >>> -6766554341038829528
>> >>>     java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>> >>>
>> >>>
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> >>>
>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> >>>
>> >>>
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> >>>
>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> >>>
>> >>>
>> >
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>> >>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> >>>
>> >>>
>> >>
>> >
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> >>>
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>> >>>
>> >>>
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>> >>>
>> >>>
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>> >>>
>> >>>
>> >
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>> >>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> >>>
>> >>>
>> >>
>> >
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> >>>
>> >>>
>> >>
>> >
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>> >>>
>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>> >>>
>> >>>
>> >>
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> >>>
>> >>>
>> >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>>     java.lang.Thread.run(Thread.java:701)
>> >>> Driver stacktrace:
>> >>>     at org.apache.spark.scheduler.DAGScheduler.org
>> >>>
>> >>
>> >
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>> >>>     at
>> >>>
>> >>
>> >
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> >>>     at
>> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>> >>>     at scala.Option.foreach(Option.scala:236)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>> >>>     at
>> >>>
>> >>
>> >
>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>> >>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> >>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> >>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> >>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> >>>     at
>> >>>
>> >>
>> >
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> >>>     at
>> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >>>     at
>> >>>
>> >>
>> >
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> >>>     at
>> >>>
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >>>     at
>> >>>
>> >>
>> >
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >>>
>> >>> Best,
>> >>> Mahesh Balija.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> >>> wrote:
>> >>>>
>> >>>>> Is anyone else nervous about ignoring this issue or relying on
>> >>> non-build
>> >>>>> (hand run) test driven transitive dependency checking. I hope
>> someone
>> >>>> else
>> >>>>> will chime in.
>> >>>>>
>> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
>> > set
>> >>>> up
>> >>>>> the build machine to do this? I’d feel better about eyeballing deps
>> if
>> >>> we
>> >>>>> could have a TEST_MASTER automatically run during builds at Apache.
>> >>> Maybe
>> >>>>> the regular unit tests are OK for building locally ourselves.
>> >>>>>
>> >>>>>>
>> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>> pat@occamsmachete.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>> >>> whether
>> >>>> we
>> >>>>>>> have missing classes or not. The job.jar at least used the pom
>> >>>>> dependencies
>> >>>>>>> to guarantee every needed class was present. So the job.jar seems
>> to
>> >>>>> solve
>> >>>>>>> the problem but may ship some unnecessary duplicate code, right?
>> >>>>>>>
>> >>>>>>
>> >>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as
>> it
>> >>>>> turns
>> >>>>>> out more recent hadoop MR btw.
>> >>>>>
>> >>>>> Not speaking literally of the format. Spark understands jars and
>> maven
>> >>>> can
>> >>>>> build one from transitive dependencies.
>> >>>>>
>> >>>>>>
>> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>> >>>>> startup
>> >>>>>> tasks with all of it just on copy time). This is absolutely not the
>> >>> way
>> >>>>> to
>> >>>>>> go with this.
>> >>>>>>
>> >>>>>
>> >>>>> Lack of guarantee to load seems like a bigger problem than startup
>> >>> time.
>> >>>>> Clearly we can’t just ignore this.
>> >>>>>
>> >>>>
>> >>>> Nope. given highly iterative nature and dynamic task allocation in
>> this
>> >>>> environment, one is looking to effects similar to Map Reduce. This is
>> >> not
>> >>>> the only reason why I never go to MR anymore, but that's one of main
>> >>> ones.
>> >>>>
>> >>>> How about experiment: why don't you create assembly that copies ALL
>> >>>> transitive dependencies in one folder, and then try to broadcast it
>> > from
>> >>>> single point (front end) to well... let's start with 20 machines. (of
>> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if
>> > we
>> >>>> can't do it for 20).
>> >>>>
>> >>>> Or, heck, let's try to simply parallel-copy it between too machines
>> 20
>> >>>> times that are not collocated on the same subnet.
>> >>>>
>> >>>>
>> >>>>>>
>> >>>>>>> There may be any number of bugs waiting for the time we try
>> running
>> >>>> on a
>> >>>>>>> node machine that doesn’t have some class in it’s classpath.
>> >>>>>>
>> >>>>>>
>> >>>>>> No. Assuming any given method is tested on all its execution paths,
>> >>>> there
>> >>>>>> will be no bugs. The bugs of that sort will only appear if the user
>> >>> is
>> >>>>>> using algebra directly and calls something that is not on the path,
>> >>>> from
>> >>>>>> the closure. In which case our answer to this is the same as for
>> the
>> >>>>> solver
>> >>>>>> methodology developers -- use customized SparkConf while creating
>> >>>> context
>> >>>>>> to include stuff you really want.
>> >>>>>>
>> >>>>>> Also another right answer to this is that we probably should
>> >>> reasonably
>> >>>>>> provide the toolset here. For example, all the stats stuff found
>> in R
>> >>>>> base
>> >>>>>> and R stat packages so the user is not compelled to go non-native.
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>> Huh? this is not true. The one I ran into was found by calling
>> >>> something
>> >>>>> in math from something in math-scala. It led outside and you can
>> >>>> encounter
>> >>>>> such things even in algebra.  In fact you have no idea if these
>> >>> problems
>> >>>>> exists except for the fact you have used it a lot personally.
>> >>>>>
>> >>>>
>> >>>>
>> >>>> You ran it with your own code that never existed before.
>> >>>>
>> >>>> But there's difference between released Mahout code (which is what
>> you
>> >>> are
>> >>>> working on) and the user code. Released code must run thru remote
>> tests
>> >>> as
>> >>>> you suggested and thus guarantee there are no such problems with post
>> >>>> release code.
>> >>>>
>> >>>> For users, we only can provide a way for them to load stuff that they
>> >>>> decide to use. We don't have apriori knowledge what they will use. It
>> > is
>> >>>> the same thing that spark does, and the same thing that MR does,
>> > doesn't
>> >>>> it?
>> >>>>
>> >>>> Of course mahout should drop rigorously the stuff it doesn't load,
>> from
>> >>> the
>> >>>> scala scope. No argue about that. In fact that's what i suggested as
>> #1
>> >>>> solution. But there's nothing much to do here but to go dependency
>> >>>> cleansing for math and spark code. Part of the reason there's so much
>> > is
>> >>>> because newer modules still bring in everything from mrLegacy.
>> >>>>
>> >>>> You are right in saying it is hard to guess what else dependencies
>> are
>> >> in
>> >>>> the util/legacy code that are actually used. but that's not a
>> >>> justification
>> >>>> for brute force "copy them all" approach that virtually guarantees
>> >>> ruining
>> >>>> one of the foremost legacy issues this work intended to address.
>> >>>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

if you are using mahout shell or command line drivers (which i dont) it
would seem the correct thing to do is for mahout script simply to take
spark dependencies from installed $SPARK_HOME rather than from Mahout's
assembly. In fact that would be consistent with what other projects are
doing in similar situation. it should also probably make things compatible
between minor releases of spark.

But i think you are right in a sense that the problem is that spark jars
are not uniquely encompassed by maven artifact id and version, unlike with
most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
be one and only one released artifact in existence -- but one's local build
may create incompatible variations).

On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The problem is not in building Spark it is in building Mahout using the
> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
> in the repos.
>
> For the rest of us, though the process below seems like an error prone
> hack to me it does work on Linux and BSD/mac. It should really be addressed
> by Spark imo.
>
> BTW The cache is laid out differently on linux but I don’t think you need
> to delete is anyway.
>
> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> fwiw i never built spark using maven. Always use sbt assembly.
>
> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > Ok, the mystery is solved.
> >
> > The safe sequence from my limited testing is:
> > 1) delete ~/.m2/repository/org/spark and mahout
> > 2) build Spark for your version of Hadoop *but do not use "mvn package
> > ...”* use “mvn install …” This will put a copy of the exact bits you need
> > into the maven cache for building mahout against. In my case using hadoop
> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
> you
> > run tests on Spark some failures can safely be ignored according to the
> > Spark guys so check before giving up.
> > 3) build mahout with “mvn clean install"
> >
> > This will create mahout from exactly the same bits you will run on your
> > cluster. It got rid of a missing anon function for me. The problem occurs
> > when you use a different version of Spark on your cluster than you used
> to
> > build Mahout and this is rather hidden by Maven. Maven downloads from
> repos
> > any dependency that is not in the local .m2 cache and so you have to make
> > sure your version of Spark is there so Maven wont download one that is
> > incompatible. Unless you really know what you are doing I’d build both
> > Spark and Mahout for now
> >
> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some more
> > testing.
> >
> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > Sorry to hear. I bet you’ll find a way.
> >
> > The Spark Jira trail leads to two suggestions:
> > 1) use spark-submit to execute code with your own entry point (other than
> > spark-shell) One theory points to not loading all needed Spark classes
> from
> > calling code (Mahout in our case). I can hand check the jars for the anon
> > function I am missing.
> > 2) there may be different class names in the running code (created by
> > building Spark locally) and the  version referenced in the Mahout POM. If
> > this turns out to be true it means we can’t rely on building Spark
> locally.
> > Is there a maven target that puts the artifacts of the Spark build in the
> > .m2/repository local cache? That would be an easy way to test this
> theory.
> >
> > either of these could cause missing classes.
> >
> >
> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > no i havent used it with anything but 1.0.1 and 0.9.x .
> >
> > on a side note, I just have changed my employer. It is one of these big
> > guys that make it very difficult to do any contributions. So I am not
> sure
> > how much of anything i will be able to share/contribute.
> >
> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> But unless you have the time to devote to errors avoid it. I’ve built
> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> >> missing class errors. The 1.x branch seems to have some kind of peculiar
> >> build order dependencies. The errors sometimes don’t show up until
> > runtime,
> >> passing all build tests.
> >>
> >> Dmitriy, have you successfully used any Spark version other than 1.0.1
> on
> >> a cluster? If so do you recall the exact order and from what sources you
> >> built?
> >>
> >>
> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>
> >> You can't use spark client of one version and have the backend of
> > another.
> >> You can try to change spark dependency in mahout poms to match your
> > backend
> >> (or vice versa, you can change your backend to match what's on the
> > client).
> >>
> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
> > balijamahesh.mca@gmail.com
> >>>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> Here are the errors I get which I run in a pseudo distributed mode,
> >>>
> >>> Spark 1.0.2 and Mahout latest code (Clone)
> >>>
> >>> When I run the command in page,
> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> >>>
> >>> val drmX = drmData(::, 0 until 4)
> >>>
> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> >>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
> >>> local class serialVersionUID = -6766554341038829528
> >>>     at
> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >>>     at
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>     at
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>     at
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>     at
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>     at
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >>>     at
> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     at
> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >>>     at
> >>>
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >>>     at
> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >>>     at
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >>>     at
> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     at
> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >>>     at
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >>>     at
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>>     at
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>     at java.lang.Thread.run(Thread.java:701)
> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> >>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> >>> serialVersionUID = 385418487991259089, local class serialVersionUID =
> >>> -6766554341038829528
> >>>     java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >>>
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>
> >>>
> >>
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >>>
> >>>
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >>>
> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >>>
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>
> >>>
> >>
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >>>
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >>>
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>>
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>     java.lang.Thread.run(Thread.java:701)
> >>> Driver stacktrace:
> >>>     at org.apache.spark.scheduler.DAGScheduler.org
> >>>
> >>
> >
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> >>>     at
> >>>
> >>
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >>>     at
> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >>>     at scala.Option.foreach(Option.scala:236)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> >>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> >>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> >>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> >>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> >>>     at
> >>>
> >>
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> >>>     at
> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >>>     at
> >>>
> >>
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> >>>     at
> >>>
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >>>     at
> >>>
> >>
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>>
> >>> Best,
> >>> Mahesh Balija.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >>> wrote:
> >>>
> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>
> >>>>> Is anyone else nervous about ignoring this issue or relying on
> >>> non-build
> >>>>> (hand run) test driven transitive dependency checking. I hope someone
> >>>> else
> >>>>> will chime in.
> >>>>>
> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
> > set
> >>>> up
> >>>>> the build machine to do this? I’d feel better about eyeballing deps
> if
> >>> we
> >>>>> could have a TEST_MASTER automatically run during builds at Apache.
> >>> Maybe
> >>>>> the regular unit tests are OK for building locally ourselves.
> >>>>>
> >>>>>>
> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pat@occamsmachete.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure
> >>> whether
> >>>> we
> >>>>>>> have missing classes or not. The job.jar at least used the pom
> >>>>> dependencies
> >>>>>>> to guarantee every needed class was present. So the job.jar seems
> to
> >>>>> solve
> >>>>>>> the problem but may ship some unnecessary duplicate code, right?
> >>>>>>>
> >>>>>>
> >>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as
> it
> >>>>> turns
> >>>>>> out more recent hadoop MR btw.
> >>>>>
> >>>>> Not speaking literally of the format. Spark understands jars and
> maven
> >>>> can
> >>>>> build one from transitive dependencies.
> >>>>>
> >>>>>>
> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
> >>>>> startup
> >>>>>> tasks with all of it just on copy time). This is absolutely not the
> >>> way
> >>>>> to
> >>>>>> go with this.
> >>>>>>
> >>>>>
> >>>>> Lack of guarantee to load seems like a bigger problem than startup
> >>> time.
> >>>>> Clearly we can’t just ignore this.
> >>>>>
> >>>>
> >>>> Nope. given highly iterative nature and dynamic task allocation in
> this
> >>>> environment, one is looking to effects similar to Map Reduce. This is
> >> not
> >>>> the only reason why I never go to MR anymore, but that's one of main
> >>> ones.
> >>>>
> >>>> How about experiment: why don't you create assembly that copies ALL
> >>>> transitive dependencies in one folder, and then try to broadcast it
> > from
> >>>> single point (front end) to well... let's start with 20 machines. (of
> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if
> > we
> >>>> can't do it for 20).
> >>>>
> >>>> Or, heck, let's try to simply parallel-copy it between too machines 20
> >>>> times that are not collocated on the same subnet.
> >>>>
> >>>>
> >>>>>>
> >>>>>>> There may be any number of bugs waiting for the time we try running
> >>>> on a
> >>>>>>> node machine that doesn’t have some class in it’s classpath.
> >>>>>>
> >>>>>>
> >>>>>> No. Assuming any given method is tested on all its execution paths,
> >>>> there
> >>>>>> will be no bugs. The bugs of that sort will only appear if the user
> >>> is
> >>>>>> using algebra directly and calls something that is not on the path,
> >>>> from
> >>>>>> the closure. In which case our answer to this is the same as for the
> >>>>> solver
> >>>>>> methodology developers -- use customized SparkConf while creating
> >>>> context
> >>>>>> to include stuff you really want.
> >>>>>>
> >>>>>> Also another right answer to this is that we probably should
> >>> reasonably
> >>>>>> provide the toolset here. For example, all the stats stuff found in
> R
> >>>>> base
> >>>>>> and R stat packages so the user is not compelled to go non-native.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Huh? this is not true. The one I ran into was found by calling
> >>> something
> >>>>> in math from something in math-scala. It led outside and you can
> >>>> encounter
> >>>>> such things even in algebra.  In fact you have no idea if these
> >>> problems
> >>>>> exists except for the fact you have used it a lot personally.
> >>>>>
> >>>>
> >>>>
> >>>> You ran it with your own code that never existed before.
> >>>>
> >>>> But there's difference between released Mahout code (which is what you
> >>> are
> >>>> working on) and the user code. Released code must run thru remote
> tests
> >>> as
> >>>> you suggested and thus guarantee there are no such problems with post
> >>>> release code.
> >>>>
> >>>> For users, we only can provide a way for them to load stuff that they
> >>>> decide to use. We don't have apriori knowledge what they will use. It
> > is
> >>>> the same thing that spark does, and the same thing that MR does,
> > doesn't
> >>>> it?
> >>>>
> >>>> Of course mahout should drop rigorously the stuff it doesn't load,
> from
> >>> the
> >>>> scala scope. No argue about that. In fact that's what i suggested as
> #1
> >>>> solution. But there's nothing much to do here but to go dependency
> >>>> cleansing for math and spark code. Part of the reason there's so much
> > is
> >>>> because newer modules still bring in everything from mrLegacy.
> >>>>
> >>>> You are right in saying it is hard to guess what else dependencies are
> >> in
> >>>> the util/legacy code that are actually used. but that's not a
> >>> justification
> >>>> for brute force "copy them all" approach that virtually guarantees
> >>> ruining
> >>>> one of the foremost legacy issues this work intended to address.
> >>>>
> >>>
> >>
> >>
> >
> >
> >
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

if you are using mahout shell or command line drivers (which i dont) it
would seem the correct thing to do is for mahout script simply to take
spark dependencies from installed $SPARK_HOME rather than from Mahout's
assembly. In fact that would be consistent with what other projects are
doing in similar situation. it should also probably make things compatible
between minor releases of spark.

But i think you are right in a sense that the problem is that spark jars
are not uniquely encompassed by maven artifact id and version, unlike with
most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
be one and only one released artifact in existence -- but one's local build
may create incompatible variations).

On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The problem is not in building Spark it is in building Mahout using the
> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
> in the repos.
>
> For the rest of us, though the process below seems like an error prone
> hack to me it does work on Linux and BSD/mac. It should really be addressed
> by Spark imo.
>
> BTW The cache is laid out differently on linux but I don’t think you need
> to delete is anyway.
>
> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> fwiw i never built spark using maven. Always use sbt assembly.
>
> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
> > Ok, the mystery is solved.
> >
> > The safe sequence from my limited testing is:
> > 1) delete ~/.m2/repository/org/spark and mahout
> > 2) build Spark for your version of Hadoop *but do not use "mvn package
> > ...”* use “mvn install …” This will put a copy of the exact bits you need
> > into the maven cache for building mahout against. In my case using hadoop
> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
> you
> > run tests on Spark some failures can safely be ignored according to the
> > Spark guys so check before giving up.
> > 3) build mahout with “mvn clean install"
> >
> > This will create mahout from exactly the same bits you will run on your
> > cluster. It got rid of a missing anon function for me. The problem occurs
> > when you use a different version of Spark on your cluster than you used
> to
> > build Mahout and this is rather hidden by Maven. Maven downloads from
> repos
> > any dependency that is not in the local .m2 cache and so you have to make
> > sure your version of Spark is there so Maven wont download one that is
> > incompatible. Unless you really know what you are doing I’d build both
> > Spark and Mahout for now
> >
> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some more
> > testing.
> >
> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > Sorry to hear. I bet you’ll find a way.
> >
> > The Spark Jira trail leads to two suggestions:
> > 1) use spark-submit to execute code with your own entry point (other than
> > spark-shell) One theory points to not loading all needed Spark classes
> from
> > calling code (Mahout in our case). I can hand check the jars for the anon
> > function I am missing.
> > 2) there may be different class names in the running code (created by
> > building Spark locally) and the  version referenced in the Mahout POM. If
> > this turns out to be true it means we can’t rely on building Spark
> locally.
> > Is there a maven target that puts the artifacts of the Spark build in the
> > .m2/repository local cache? That would be an easy way to test this
> theory.
> >
> > either of these could cause missing classes.
> >
> >
> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > no i havent used it with anything but 1.0.1 and 0.9.x .
> >
> > on a side note, I just have changed my employer. It is one of these big
> > guys that make it very difficult to do any contributions. So I am not
> sure
> > how much of anything i will be able to share/contribute.
> >
> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> But unless you have the time to devote to errors avoid it. I’ve built
> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> >> missing class errors. The 1.x branch seems to have some kind of peculiar
> >> build order dependencies. The errors sometimes don’t show up until
> > runtime,
> >> passing all build tests.
> >>
> >> Dmitriy, have you successfully used any Spark version other than 1.0.1
> on
> >> a cluster? If so do you recall the exact order and from what sources you
> >> built?
> >>
> >>
> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>
> >> You can't use spark client of one version and have the backend of
> > another.
> >> You can try to change spark dependency in mahout poms to match your
> > backend
> >> (or vice versa, you can change your backend to match what's on the
> > client).
> >>
> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
> > balijamahesh.mca@gmail.com
> >>>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> Here are the errors I get which I run in a pseudo distributed mode,
> >>>
> >>> Spark 1.0.2 and Mahout latest code (Clone)
> >>>
> >>> When I run the command in page,
> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> >>>
> >>> val drmX = drmData(::, 0 until 4)
> >>>
> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> >>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
> >>> local class serialVersionUID = -6766554341038829528
> >>>     at
> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >>>     at
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>     at
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>     at
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>     at
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>     at
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >>>     at
> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     at
> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >>>     at
> >>>
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >>>     at
> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >>>     at
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >>>     at
> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     at
> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >>>     at
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >>>     at
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>>     at
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>     at java.lang.Thread.run(Thread.java:701)
> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> >>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> >>> serialVersionUID = 385418487991259089, local class serialVersionUID =
> >>> -6766554341038829528
> >>>     java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >>>
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>
> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>>
> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>>
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>
> >>>
> >>
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >>>
> >>>
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >>>
> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >>>
> >>>
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>>
> >>>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>>
> >>>
> >>
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >>>
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >>>
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>>
> >>>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>     java.lang.Thread.run(Thread.java:701)
> >>> Driver stacktrace:
> >>>     at org.apache.spark.scheduler.DAGScheduler.org
> >>>
> >>
> >
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> >>>     at
> >>>
> >>
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >>>     at
> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >>>     at scala.Option.foreach(Option.scala:236)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> >>>     at
> >>>
> >>
> >
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> >>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> >>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> >>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> >>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> >>>     at
> >>>
> >>
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> >>>     at
> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >>>     at
> >>>
> >>
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> >>>     at
> >>>
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >>>     at
> >>>
> >>
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>>
> >>> Best,
> >>> Mahesh Balija.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >>> wrote:
> >>>
> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>
> >>>>> Is anyone else nervous about ignoring this issue or relying on
> >>> non-build
> >>>>> (hand run) test driven transitive dependency checking. I hope someone
> >>>> else
> >>>>> will chime in.
> >>>>>
> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
> > set
> >>>> up
> >>>>> the build machine to do this? I’d feel better about eyeballing deps
> if
> >>> we
> >>>>> could have a TEST_MASTER automatically run during builds at Apache.
> >>> Maybe
> >>>>> the regular unit tests are OK for building locally ourselves.
> >>>>>
> >>>>>>
> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pat@occamsmachete.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure
> >>> whether
> >>>> we
> >>>>>>> have missing classes or not. The job.jar at least used the pom
> >>>>> dependencies
> >>>>>>> to guarantee every needed class was present. So the job.jar seems
> to
> >>>>> solve
> >>>>>>> the problem but may ship some unnecessary duplicate code, right?
> >>>>>>>
> >>>>>>
> >>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as
> it
> >>>>> turns
> >>>>>> out more recent hadoop MR btw.
> >>>>>
> >>>>> Not speaking literally of the format. Spark understands jars and
> maven
> >>>> can
> >>>>> build one from transitive dependencies.
> >>>>>
> >>>>>>
> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
> >>>>> startup
> >>>>>> tasks with all of it just on copy time). This is absolutely not the
> >>> way
> >>>>> to
> >>>>>> go with this.
> >>>>>>
> >>>>>
> >>>>> Lack of guarantee to load seems like a bigger problem than startup
> >>> time.
> >>>>> Clearly we can’t just ignore this.
> >>>>>
> >>>>
> >>>> Nope. given highly iterative nature and dynamic task allocation in
> this
> >>>> environment, one is looking to effects similar to Map Reduce. This is
> >> not
> >>>> the only reason why I never go to MR anymore, but that's one of main
> >>> ones.
> >>>>
> >>>> How about experiment: why don't you create assembly that copies ALL
> >>>> transitive dependencies in one folder, and then try to broadcast it
> > from
> >>>> single point (front end) to well... let's start with 20 machines. (of
> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if
> > we
> >>>> can't do it for 20).
> >>>>
> >>>> Or, heck, let's try to simply parallel-copy it between too machines 20
> >>>> times that are not collocated on the same subnet.
> >>>>
> >>>>
> >>>>>>
> >>>>>>> There may be any number of bugs waiting for the time we try running
> >>>> on a
> >>>>>>> node machine that doesn’t have some class in it’s classpath.
> >>>>>>
> >>>>>>
> >>>>>> No. Assuming any given method is tested on all its execution paths,
> >>>> there
> >>>>>> will be no bugs. The bugs of that sort will only appear if the user
> >>> is
> >>>>>> using algebra directly and calls something that is not on the path,
> >>>> from
> >>>>>> the closure. In which case our answer to this is the same as for the
> >>>>> solver
> >>>>>> methodology developers -- use customized SparkConf while creating
> >>>> context
> >>>>>> to include stuff you really want.
> >>>>>>
> >>>>>> Also another right answer to this is that we probably should
> >>> reasonably
> >>>>>> provide the toolset here. For example, all the stats stuff found in
> R
> >>>>> base
> >>>>>> and R stat packages so the user is not compelled to go non-native.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Huh? this is not true. The one I ran into was found by calling
> >>> something
> >>>>> in math from something in math-scala. It led outside and you can
> >>>> encounter
> >>>>> such things even in algebra.  In fact you have no idea if these
> >>> problems
> >>>>> exists except for the fact you have used it a lot personally.
> >>>>>
> >>>>
> >>>>
> >>>> You ran it with your own code that never existed before.
> >>>>
> >>>> But there's difference between released Mahout code (which is what you
> >>> are
> >>>> working on) and the user code. Released code must run thru remote
> tests
> >>> as
> >>>> you suggested and thus guarantee there are no such problems with post
> >>>> release code.
> >>>>
> >>>> For users, we only can provide a way for them to load stuff that they
> >>>> decide to use. We don't have apriori knowledge what they will use. It
> > is
> >>>> the same thing that spark does, and the same thing that MR does,
> > doesn't
> >>>> it?
> >>>>
> >>>> Of course mahout should drop rigorously the stuff it doesn't load,
> from
> >>> the
> >>>> scala scope. No argue about that. In fact that's what i suggested as
> #1
> >>>> solution. But there's nothing much to do here but to go dependency
> >>>> cleansing for math and spark code. Part of the reason there's so much
> > is
> >>>> because newer modules still bring in everything from mrLegacy.
> >>>>
> >>>> You are right in saying it is hard to guess what else dependencies are
> >> in
> >>>> the util/legacy code that are actually used. but that's not a
> >>> justification
> >>>> for brute force "copy them all" approach that virtually guarantees
> >>> ruining
> >>>> one of the foremost legacy issues this work intended to address.
> >>>>
> >>>
> >>
> >>
> >
> >
> >
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The problem is not in building Spark it is in building Mahout using the
> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
> in the repos.
>

This should be true for MapR as well.

Re: Upgrade to Spark 1.1.0?

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> The problem is not in building Spark it is in building Mahout using the
> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
> in the repos.
>

This should be true for MapR as well.

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

The problem is not in building Spark it is in building Mahout using the correct Spark jars. If you are using CDH and hadoop 2 the correct jars are in the repos.

For the rest of us, though the process below seems like an error prone hack to me it does work on Linux and BSD/mac. It should really be addressed by Spark imo.

BTW The cache is laid out differently on linux but I don’t think you need to delete is anyway.

On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

fwiw i never built spark using maven. Always use sbt assembly.

On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, the mystery is solved.
> 
> The safe sequence from my limited testing is:
> 1) delete ~/.m2/repository/org/spark and mahout
> 2) build Spark for your version of Hadoop *but do not use "mvn package
> ...”* use “mvn install …” This will put a copy of the exact bits you need
> into the maven cache for building mahout against. In my case using hadoop
> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you
> run tests on Spark some failures can safely be ignored according to the
> Spark guys so check before giving up.
> 3) build mahout with “mvn clean install"
> 
> This will create mahout from exactly the same bits you will run on your
> cluster. It got rid of a missing anon function for me. The problem occurs
> when you use a different version of Spark on your cluster than you used to
> build Mahout and this is rather hidden by Maven. Maven downloads from repos
> any dependency that is not in the local .m2 cache and so you have to make
> sure your version of Spark is there so Maven wont download one that is
> incompatible. Unless you really know what you are doing I’d build both
> Spark and Mahout for now
> 
> BTW I will check in the Spark 1.1.0 version of Mahout once I do some more
> testing.
> 
> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Sorry to hear. I bet you’ll find a way.
> 
> The Spark Jira trail leads to two suggestions:
> 1) use spark-submit to execute code with your own entry point (other than
> spark-shell) One theory points to not loading all needed Spark classes from
> calling code (Mahout in our case). I can hand check the jars for the anon
> function I am missing.
> 2) there may be different class names in the running code (created by
> building Spark locally) and the  version referenced in the Mahout POM. If
> this turns out to be true it means we can’t rely on building Spark locally.
> Is there a maven target that puts the artifacts of the Spark build in the
> .m2/repository local cache? That would be an easy way to test this theory.
> 
> either of these could cause missing classes.
> 
> 
> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> no i havent used it with anything but 1.0.1 and 0.9.x .
> 
> on a side note, I just have changed my employer. It is one of these big
> guys that make it very difficult to do any contributions. So I am not sure
> how much of anything i will be able to share/contribute.
> 
> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> But unless you have the time to devote to errors avoid it. I’ve built
>> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>> missing class errors. The 1.x branch seems to have some kind of peculiar
>> build order dependencies. The errors sometimes don’t show up until
> runtime,
>> passing all build tests.
>> 
>> Dmitriy, have you successfully used any Spark version other than 1.0.1 on
>> a cluster? If so do you recall the exact order and from what sources you
>> built?
>> 
>> 
>> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> You can't use spark client of one version and have the backend of
> another.
>> You can try to change spark dependency in mahout poms to match your
> backend
>> (or vice versa, you can change your backend to match what's on the
> client).
>> 
>> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com
>>> 
>> wrote:
>> 
>>> Hi All,
>>> 
>>> Here are the errors I get which I run in a pseudo distributed mode,
>>> 
>>> Spark 1.0.2 and Mahout latest code (Clone)
>>> 
>>> When I run the command in page,
>>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>> 
>>> val drmX = drmData(::, 0 until 4)
>>> 
>>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>> local class serialVersionUID = -6766554341038829528
>>>     at
>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>     at
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>     at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>     at
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>     at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>     at
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>     at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>     at
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>     at
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>     at
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>     at
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>     at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>     at
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>     at
>>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>     at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>     at
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>     at
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>     at java.lang.Thread.run(Thread.java:701)
>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>>> -6766554341038829528
>>>     java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> 
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> 
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> 
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> 
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> 
>>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>> 
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> 
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> 
>>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>> 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>> 
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> 
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>     java.lang.Thread.run(Thread.java:701)
>>> Driver stacktrace:
>>>     at org.apache.spark.scheduler.DAGScheduler.org
>>> 
>> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>     at
>>> 
>> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>     at
>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>     at scala.Option.foreach(Option.scala:236)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>     at
>>> 
>> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>     at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>     at
>>> 
>> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>     at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>     at
>>> 
>> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 
>>> Best,
>>> Mahesh Balija.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> 
>>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Is anyone else nervous about ignoring this issue or relying on
>>> non-build
>>>>> (hand run) test driven transitive dependency checking. I hope someone
>>>> else
>>>>> will chime in.
>>>>> 
>>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
> set
>>>> up
>>>>> the build machine to do this? I’d feel better about eyeballing deps if
>>> we
>>>>> could have a TEST_MASTER automatically run during builds at Apache.
>>> Maybe
>>>>> the regular unit tests are OK for building locally ourselves.
>>>>> 
>>>>>> 
>>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>> whether
>>>> we
>>>>>>> have missing classes or not. The job.jar at least used the pom
>>>>> dependencies
>>>>>>> to guarantee every needed class was present. So the job.jar seems to
>>>>> solve
>>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>>> 
>>>>>> 
>>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>>>> turns
>>>>>> out more recent hadoop MR btw.
>>>>> 
>>>>> Not speaking literally of the format. Spark understands jars and maven
>>>> can
>>>>> build one from transitive dependencies.
>>>>> 
>>>>>> 
>>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>>>> startup
>>>>>> tasks with all of it just on copy time). This is absolutely not the
>>> way
>>>>> to
>>>>>> go with this.
>>>>>> 
>>>>> 
>>>>> Lack of guarantee to load seems like a bigger problem than startup
>>> time.
>>>>> Clearly we can’t just ignore this.
>>>>> 
>>>> 
>>>> Nope. given highly iterative nature and dynamic task allocation in this
>>>> environment, one is looking to effects similar to Map Reduce. This is
>> not
>>>> the only reason why I never go to MR anymore, but that's one of main
>>> ones.
>>>> 
>>>> How about experiment: why don't you create assembly that copies ALL
>>>> transitive dependencies in one folder, and then try to broadcast it
> from
>>>> single point (front end) to well... let's start with 20 machines. (of
>>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if
> we
>>>> can't do it for 20).
>>>> 
>>>> Or, heck, let's try to simply parallel-copy it between too machines 20
>>>> times that are not collocated on the same subnet.
>>>> 
>>>> 
>>>>>> 
>>>>>>> There may be any number of bugs waiting for the time we try running
>>>> on a
>>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>>> 
>>>>>> 
>>>>>> No. Assuming any given method is tested on all its execution paths,
>>>> there
>>>>>> will be no bugs. The bugs of that sort will only appear if the user
>>> is
>>>>>> using algebra directly and calls something that is not on the path,
>>>> from
>>>>>> the closure. In which case our answer to this is the same as for the
>>>>> solver
>>>>>> methodology developers -- use customized SparkConf while creating
>>>> context
>>>>>> to include stuff you really want.
>>>>>> 
>>>>>> Also another right answer to this is that we probably should
>>> reasonably
>>>>>> provide the toolset here. For example, all the stats stuff found in R
>>>>> base
>>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> Huh? this is not true. The one I ran into was found by calling
>>> something
>>>>> in math from something in math-scala. It led outside and you can
>>>> encounter
>>>>> such things even in algebra.  In fact you have no idea if these
>>> problems
>>>>> exists except for the fact you have used it a lot personally.
>>>>> 
>>>> 
>>>> 
>>>> You ran it with your own code that never existed before.
>>>> 
>>>> But there's difference between released Mahout code (which is what you
>>> are
>>>> working on) and the user code. Released code must run thru remote tests
>>> as
>>>> you suggested and thus guarantee there are no such problems with post
>>>> release code.
>>>> 
>>>> For users, we only can provide a way for them to load stuff that they
>>>> decide to use. We don't have apriori knowledge what they will use. It
> is
>>>> the same thing that spark does, and the same thing that MR does,
> doesn't
>>>> it?
>>>> 
>>>> Of course mahout should drop rigorously the stuff it doesn't load, from
>>> the
>>>> scala scope. No argue about that. In fact that's what i suggested as #1
>>>> solution. But there's nothing much to do here but to go dependency
>>>> cleansing for math and spark code. Part of the reason there's so much
> is
>>>> because newer modules still bring in everything from mrLegacy.
>>>> 
>>>> You are right in saying it is hard to guess what else dependencies are
>> in
>>>> the util/legacy code that are actually used. but that's not a
>>> justification
>>>> for brute force "copy them all" approach that virtually guarantees
>>> ruining
>>>> one of the foremost legacy issues this work intended to address.
>>>> 
>>> 
>> 
>> 
> 
> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

The problem is not in building Spark it is in building Mahout using the correct Spark jars. If you are using CDH and hadoop 2 the correct jars are in the repos.

For the rest of us, though the process below seems like an error prone hack to me it does work on Linux and BSD/mac. It should really be addressed by Spark imo.

BTW The cache is laid out differently on linux but I don’t think you need to delete is anyway.

On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

fwiw i never built spark using maven. Always use sbt assembly.

On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, the mystery is solved.
> 
> The safe sequence from my limited testing is:
> 1) delete ~/.m2/repository/org/spark and mahout
> 2) build Spark for your version of Hadoop *but do not use "mvn package
> ...”* use “mvn install …” This will put a copy of the exact bits you need
> into the maven cache for building mahout against. In my case using hadoop
> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you
> run tests on Spark some failures can safely be ignored according to the
> Spark guys so check before giving up.
> 3) build mahout with “mvn clean install"
> 
> This will create mahout from exactly the same bits you will run on your
> cluster. It got rid of a missing anon function for me. The problem occurs
> when you use a different version of Spark on your cluster than you used to
> build Mahout and this is rather hidden by Maven. Maven downloads from repos
> any dependency that is not in the local .m2 cache and so you have to make
> sure your version of Spark is there so Maven wont download one that is
> incompatible. Unless you really know what you are doing I’d build both
> Spark and Mahout for now
> 
> BTW I will check in the Spark 1.1.0 version of Mahout once I do some more
> testing.
> 
> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> Sorry to hear. I bet you’ll find a way.
> 
> The Spark Jira trail leads to two suggestions:
> 1) use spark-submit to execute code with your own entry point (other than
> spark-shell) One theory points to not loading all needed Spark classes from
> calling code (Mahout in our case). I can hand check the jars for the anon
> function I am missing.
> 2) there may be different class names in the running code (created by
> building Spark locally) and the  version referenced in the Mahout POM. If
> this turns out to be true it means we can’t rely on building Spark locally.
> Is there a maven target that puts the artifacts of the Spark build in the
> .m2/repository local cache? That would be an easy way to test this theory.
> 
> either of these could cause missing classes.
> 
> 
> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> no i havent used it with anything but 1.0.1 and 0.9.x .
> 
> on a side note, I just have changed my employer. It is one of these big
> guys that make it very difficult to do any contributions. So I am not sure
> how much of anything i will be able to share/contribute.
> 
> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> But unless you have the time to devote to errors avoid it. I’ve built
>> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>> missing class errors. The 1.x branch seems to have some kind of peculiar
>> build order dependencies. The errors sometimes don’t show up until
> runtime,
>> passing all build tests.
>> 
>> Dmitriy, have you successfully used any Spark version other than 1.0.1 on
>> a cluster? If so do you recall the exact order and from what sources you
>> built?
>> 
>> 
>> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> You can't use spark client of one version and have the backend of
> another.
>> You can try to change spark dependency in mahout poms to match your
> backend
>> (or vice versa, you can change your backend to match what's on the
> client).
>> 
>> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com
>>> 
>> wrote:
>> 
>>> Hi All,
>>> 
>>> Here are the errors I get which I run in a pseudo distributed mode,
>>> 
>>> Spark 1.0.2 and Mahout latest code (Clone)
>>> 
>>> When I run the command in page,
>>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>> 
>>> val drmX = drmData(::, 0 until 4)
>>> 
>>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>> local class serialVersionUID = -6766554341038829528
>>>     at
>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>>     at
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>     at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>     at
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>>     at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>>     at
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>     at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>     at
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>>     at
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>>     at
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>>     at
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>     at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>>     at
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>>     at
>>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>>     at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>>     at
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>>     at
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>     at java.lang.Thread.run(Thread.java:701)
>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>>> -6766554341038829528
>>>     java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> 
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> 
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> 
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> 
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> 
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> 
>>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>> 
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>> 
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>> 
>>> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>>     java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> 
>>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> 
>>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>> 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>> 
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> 
>>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>     java.lang.Thread.run(Thread.java:701)
>>> Driver stacktrace:
>>>     at org.apache.spark.scheduler.DAGScheduler.org
>>> 
>> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>     at
>>> 
>> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>     at
>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>     at scala.Option.foreach(Option.scala:236)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>     at
>>> 
>> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>     at
>>> 
>> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>     at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>     at
>>> 
>> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>     at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>     at
>>> 
>> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 
>>> Best,
>>> Mahesh Balija.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> 
>>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Is anyone else nervous about ignoring this issue or relying on
>>> non-build
>>>>> (hand run) test driven transitive dependency checking. I hope someone
>>>> else
>>>>> will chime in.
>>>>> 
>>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
> set
>>>> up
>>>>> the build machine to do this? I’d feel better about eyeballing deps if
>>> we
>>>>> could have a TEST_MASTER automatically run during builds at Apache.
>>> Maybe
>>>>> the regular unit tests are OK for building locally ourselves.
>>>>> 
>>>>>> 
>>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>>> whether
>>>> we
>>>>>>> have missing classes or not. The job.jar at least used the pom
>>>>> dependencies
>>>>>>> to guarantee every needed class was present. So the job.jar seems to
>>>>> solve
>>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>>> 
>>>>>> 
>>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>>>> turns
>>>>>> out more recent hadoop MR btw.
>>>>> 
>>>>> Not speaking literally of the format. Spark understands jars and maven
>>>> can
>>>>> build one from transitive dependencies.
>>>>> 
>>>>>> 
>>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>>>> startup
>>>>>> tasks with all of it just on copy time). This is absolutely not the
>>> way
>>>>> to
>>>>>> go with this.
>>>>>> 
>>>>> 
>>>>> Lack of guarantee to load seems like a bigger problem than startup
>>> time.
>>>>> Clearly we can’t just ignore this.
>>>>> 
>>>> 
>>>> Nope. given highly iterative nature and dynamic task allocation in this
>>>> environment, one is looking to effects similar to Map Reduce. This is
>> not
>>>> the only reason why I never go to MR anymore, but that's one of main
>>> ones.
>>>> 
>>>> How about experiment: why don't you create assembly that copies ALL
>>>> transitive dependencies in one folder, and then try to broadcast it
> from
>>>> single point (front end) to well... let's start with 20 machines. (of
>>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if
> we
>>>> can't do it for 20).
>>>> 
>>>> Or, heck, let's try to simply parallel-copy it between too machines 20
>>>> times that are not collocated on the same subnet.
>>>> 
>>>> 
>>>>>> 
>>>>>>> There may be any number of bugs waiting for the time we try running
>>>> on a
>>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>>> 
>>>>>> 
>>>>>> No. Assuming any given method is tested on all its execution paths,
>>>> there
>>>>>> will be no bugs. The bugs of that sort will only appear if the user
>>> is
>>>>>> using algebra directly and calls something that is not on the path,
>>>> from
>>>>>> the closure. In which case our answer to this is the same as for the
>>>>> solver
>>>>>> methodology developers -- use customized SparkConf while creating
>>>> context
>>>>>> to include stuff you really want.
>>>>>> 
>>>>>> Also another right answer to this is that we probably should
>>> reasonably
>>>>>> provide the toolset here. For example, all the stats stuff found in R
>>>>> base
>>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> Huh? this is not true. The one I ran into was found by calling
>>> something
>>>>> in math from something in math-scala. It led outside and you can
>>>> encounter
>>>>> such things even in algebra.  In fact you have no idea if these
>>> problems
>>>>> exists except for the fact you have used it a lot personally.
>>>>> 
>>>> 
>>>> 
>>>> You ran it with your own code that never existed before.
>>>> 
>>>> But there's difference between released Mahout code (which is what you
>>> are
>>>> working on) and the user code. Released code must run thru remote tests
>>> as
>>>> you suggested and thus guarantee there are no such problems with post
>>>> release code.
>>>> 
>>>> For users, we only can provide a way for them to load stuff that they
>>>> decide to use. We don't have apriori knowledge what they will use. It
> is
>>>> the same thing that spark does, and the same thing that MR does,
> doesn't
>>>> it?
>>>> 
>>>> Of course mahout should drop rigorously the stuff it doesn't load, from
>>> the
>>>> scala scope. No argue about that. In fact that's what i suggested as #1
>>>> solution. But there's nothing much to do here but to go dependency
>>>> cleansing for math and spark code. Part of the reason there's so much
> is
>>>> because newer modules still bring in everything from mrLegacy.
>>>> 
>>>> You are right in saying it is hard to guess what else dependencies are
>> in
>>>> the util/legacy code that are actually used. but that's not a
>>> justification
>>>> for brute force "copy them all" approach that virtually guarantees
>>> ruining
>>>> one of the foremost legacy issues this work intended to address.
>>>> 
>>> 
>> 
>> 
> 
> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

fwiw i never built spark using maven. Always use sbt assembly.

On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ok, the mystery is solved.
>
> The safe sequence from my limited testing is:
> 1) delete ~/.m2/repository/org/spark and mahout
> 2) build Spark for your version of Hadoop *but do not use "mvn package
> ...”* use “mvn install …” This will put a copy of the exact bits you need
> into the maven cache for building mahout against. In my case using hadoop
> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you
> run tests on Spark some failures can safely be ignored according to the
> Spark guys so check before giving up.
> 3) build mahout with “mvn clean install"
>
> This will create mahout from exactly the same bits you will run on your
> cluster. It got rid of a missing anon function for me. The problem occurs
> when you use a different version of Spark on your cluster than you used to
> build Mahout and this is rather hidden by Maven. Maven downloads from repos
> any dependency that is not in the local .m2 cache and so you have to make
> sure your version of Spark is there so Maven wont download one that is
> incompatible. Unless you really know what you are doing I’d build both
> Spark and Mahout for now
>
> BTW I will check in the Spark 1.1.0 version of Mahout once I do some more
> testing.
>
> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> Sorry to hear. I bet you’ll find a way.
>
> The Spark Jira trail leads to two suggestions:
> 1) use spark-submit to execute code with your own entry point (other than
> spark-shell) One theory points to not loading all needed Spark classes from
> calling code (Mahout in our case). I can hand check the jars for the anon
> function I am missing.
> 2) there may be different class names in the running code (created by
> building Spark locally) and the  version referenced in the Mahout POM. If
> this turns out to be true it means we can’t rely on building Spark locally.
> Is there a maven target that puts the artifacts of the Spark build in the
> .m2/repository local cache? That would be an easy way to test this theory.
>
> either of these could cause missing classes.
>
>
> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> no i havent used it with anything but 1.0.1 and 0.9.x .
>
> on a side note, I just have changed my employer. It is one of these big
> guys that make it very difficult to do any contributions. So I am not sure
> how much of anything i will be able to share/contribute.
>
> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > But unless you have the time to devote to errors avoid it. I’ve built
> > everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> > missing class errors. The 1.x branch seems to have some kind of peculiar
> > build order dependencies. The errors sometimes don’t show up until
> runtime,
> > passing all build tests.
> >
> > Dmitriy, have you successfully used any Spark version other than 1.0.1 on
> > a cluster? If so do you recall the exact order and from what sources you
> > built?
> >
> >
> > On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > You can't use spark client of one version and have the backend of
> another.
> > You can try to change spark dependency in mahout poms to match your
> backend
> > (or vice versa, you can change your backend to match what's on the
> client).
> >
> > On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com
> >>
> > wrote:
> >
> >> Hi All,
> >>
> >> Here are the errors I get which I run in a pseudo distributed mode,
> >>
> >> Spark 1.0.2 and Mahout latest code (Clone)
> >>
> >> When I run the command in page,
> >> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> >>
> >> val drmX = drmData(::, 0 until 4)
> >>
> >> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> >> incompatible: stream classdesc serialVersionUID = 385418487991259089,
> >> local class serialVersionUID = -6766554341038829528
> >>      at
> >> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >>      at
> >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>      at
> >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>      at
> >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>      at
> >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>      at
> >>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >>      at
> >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>      at
> > java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>      at
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>      at
> >>
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >>      at
> >> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >>      at
> >> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >>      at
> >>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >>      at
> >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>      at
> > java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>      at
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>      at
> >>
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >>      at
> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >>      at
> >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>      at
> >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>      at java.lang.Thread.run(Thread.java:701)
> >> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> >> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> >> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> >> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> >> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> >> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> >> org.apache.spark.SparkException: Job aborted due to stage failure:
> >> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> >> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> >> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> >> serialVersionUID = 385418487991259089, local class serialVersionUID =
> >> -6766554341038829528
> >>      java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >>
> >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>
> >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>
> >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >>
> >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >>
> >>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >>      java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>      java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>
> >>
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >>
> >> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >>
> >> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >>
> >>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >>      java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >>      java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >>
> >>
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >>
> >>
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >>
> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>
> >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>      java.lang.Thread.run(Thread.java:701)
> >> Driver stacktrace:
> >>      at org.apache.spark.scheduler.DAGScheduler.org
> >>
> >
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> >>      at
> >>
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >>      at
> >> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >>      at scala.Option.foreach(Option.scala:236)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> >>      at
> >>
> >
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> >>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> >>      at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> >>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> >>      at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> >>      at
> >>
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> >>      at
> >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >>      at
> >>
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> >>      at
> >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >>      at
> >>
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>
> >> Best,
> >> Mahesh Balija.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>
> >>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>>
> >>>> Is anyone else nervous about ignoring this issue or relying on
> >> non-build
> >>>> (hand run) test driven transitive dependency checking. I hope someone
> >>> else
> >>>> will chime in.
> >>>>
> >>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we
> set
> >>> up
> >>>> the build machine to do this? I’d feel better about eyeballing deps if
> >> we
> >>>> could have a TEST_MASTER automatically run during builds at Apache.
> >> Maybe
> >>>> the regular unit tests are OK for building locally ourselves.
> >>>>
> >>>>>
> >>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>>> wrote:
> >>>>>
> >>>>>> Maybe a more fundamental issue is that we don’t know for sure
> >> whether
> >>> we
> >>>>>> have missing classes or not. The job.jar at least used the pom
> >>>> dependencies
> >>>>>> to guarantee every needed class was present. So the job.jar seems to
> >>>> solve
> >>>>>> the problem but may ship some unnecessary duplicate code, right?
> >>>>>>
> >>>>>
> >>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
> >>>> turns
> >>>>> out more recent hadoop MR btw.
> >>>>
> >>>> Not speaking literally of the format. Spark understands jars and maven
> >>> can
> >>>> build one from transitive dependencies.
> >>>>
> >>>>>
> >>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
> >>>> startup
> >>>>> tasks with all of it just on copy time). This is absolutely not the
> >> way
> >>>> to
> >>>>> go with this.
> >>>>>
> >>>>
> >>>> Lack of guarantee to load seems like a bigger problem than startup
> >> time.
> >>>> Clearly we can’t just ignore this.
> >>>>
> >>>
> >>> Nope. given highly iterative nature and dynamic task allocation in this
> >>> environment, one is looking to effects similar to Map Reduce. This is
> > not
> >>> the only reason why I never go to MR anymore, but that's one of main
> >> ones.
> >>>
> >>> How about experiment: why don't you create assembly that copies ALL
> >>> transitive dependencies in one folder, and then try to broadcast it
> from
> >>> single point (front end) to well... let's start with 20 machines. (of
> >>> course we ideally want to into 10^3 ..10^4 range -- but why bother if
> we
> >>> can't do it for 20).
> >>>
> >>> Or, heck, let's try to simply parallel-copy it between too machines 20
> >>> times that are not collocated on the same subnet.
> >>>
> >>>
> >>>>>
> >>>>>> There may be any number of bugs waiting for the time we try running
> >>> on a
> >>>>>> node machine that doesn’t have some class in it’s classpath.
> >>>>>
> >>>>>
> >>>>> No. Assuming any given method is tested on all its execution paths,
> >>> there
> >>>>> will be no bugs. The bugs of that sort will only appear if the user
> >> is
> >>>>> using algebra directly and calls something that is not on the path,
> >>> from
> >>>>> the closure. In which case our answer to this is the same as for the
> >>>> solver
> >>>>> methodology developers -- use customized SparkConf while creating
> >>> context
> >>>>> to include stuff you really want.
> >>>>>
> >>>>> Also another right answer to this is that we probably should
> >> reasonably
> >>>>> provide the toolset here. For example, all the stats stuff found in R
> >>>> base
> >>>>> and R stat packages so the user is not compelled to go non-native.
> >>>>>
> >>>>>
> >>>>
> >>>> Huh? this is not true. The one I ran into was found by calling
> >> something
> >>>> in math from something in math-scala. It led outside and you can
> >>> encounter
> >>>> such things even in algebra.  In fact you have no idea if these
> >> problems
> >>>> exists except for the fact you have used it a lot personally.
> >>>>
> >>>
> >>>
> >>> You ran it with your own code that never existed before.
> >>>
> >>> But there's difference between released Mahout code (which is what you
> >> are
> >>> working on) and the user code. Released code must run thru remote tests
> >> as
> >>> you suggested and thus guarantee there are no such problems with post
> >>> release code.
> >>>
> >>> For users, we only can provide a way for them to load stuff that they
> >>> decide to use. We don't have apriori knowledge what they will use. It
> is
> >>> the same thing that spark does, and the same thing that MR does,
> doesn't
> >>> it?
> >>>
> >>> Of course mahout should drop rigorously the stuff it doesn't load, from
> >> the
> >>> scala scope. No argue about that. In fact that's what i suggested as #1
> >>> solution. But there's nothing much to do here but to go dependency
> >>> cleansing for math and spark code. Part of the reason there's so much
> is
> >>> because newer modules still bring in everything from mrLegacy.
> >>>
> >>> You are right in saying it is hard to guess what else dependencies are
> > in
> >>> the util/legacy code that are actually used. but that's not a
> >> justification
> >>> for brute force "copy them all" approach that virtually guarantees
> >> ruining
> >>> one of the foremost legacy issues this work intended to address.
> >>>
> >>
> >
> >
>
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ok, the mystery is solved. 

The safe sequence from my limited testing is:
1) delete ~/.m2/repository/org/spark and mahout
2) build Spark for your version of Hadoop *but do not use "mvn package ...”* use “mvn install …” This will put a copy of the exact bits you need into the maven cache for building mahout against. In my case using hadoop 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on Spark some failures can safely be ignored according to the Spark guys so check before giving up. 
3) build mahout with “mvn clean install"

This will create mahout from exactly the same bits you will run on your cluster. It got rid of a missing anon function for me. The problem occurs when you use a different version of Spark on your cluster than you used to build Mahout and this is rather hidden by Maven. Maven downloads from repos any dependency that is not in the local .m2 cache and so you have to make sure your version of Spark is there so Maven wont download one that is incompatible. Unless you really know what you are doing I’d build both Spark and Mahout for now

BTW I will check in the Spark 1.1.0 version of Mahout once I do some more testing.

On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Sorry to hear. I bet you’ll find a way.

The Spark Jira trail leads to two suggestions:
1) use spark-submit to execute code with your own entry point (other than spark-shell) One theory points to not loading all needed Spark classes from calling code (Mahout in our case). I can hand check the jars for the anon function I am missing.
2) there may be different class names in the running code (created by building Spark locally) and the  version referenced in the Mahout POM. If this turns out to be true it means we can’t rely on building Spark locally. Is there a maven target that puts the artifacts of the Spark build in the .m2/repository local cache? That would be an easy way to test this theory.

either of these could cause missing classes.


On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

no i havent used it with anything but 1.0.1 and 0.9.x .

on a side note, I just have changed my employer. It is one of these big
guys that make it very difficult to do any contributions. So I am not sure
how much of anything i will be able to share/contribute.

On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> But unless you have the time to devote to errors avoid it. I’ve built
> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> missing class errors. The 1.x branch seems to have some kind of peculiar
> build order dependencies. The errors sometimes don’t show up until runtime,
> passing all build tests.
> 
> Dmitriy, have you successfully used any Spark version other than 1.0.1 on
> a cluster? If so do you recall the exact order and from what sources you
> built?
> 
> 
> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> You can't use spark client of one version and have the backend of another.
> You can try to change spark dependency in mahout poms to match your backend
> (or vice versa, you can change your backend to match what's on the client).
> 
> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <balijamahesh.mca@gmail.com
>> 
> wrote:
> 
>> Hi All,
>> 
>> Here are the errors I get which I run in a pseudo distributed mode,
>> 
>> Spark 1.0.2 and Mahout latest code (Clone)
>> 
>> When I run the command in page,
>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>> 
>> val drmX = drmData(::, 0 until 4)
>> 
>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>> local class serialVersionUID = -6766554341038829528
>>      at
>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>      at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>      at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>      at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>      at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>      at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>      at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>      at
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>      at
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>      at
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>      at
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>      at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>      at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>      at
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>      at
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>      at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>      at
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>      at
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      at java.lang.Thread.run(Thread.java:701)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>> org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>> -6766554341038829528
>>      java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>> 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> 
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> 
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>      java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>> 
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>> 
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>> 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>      java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      java.lang.Thread.run(Thread.java:701)
>> Driver stacktrace:
>>      at org.apache.spark.scheduler.DAGScheduler.org
>> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>      at
>> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>      at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>      at scala.Option.foreach(Option.scala:236)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>      at
>> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>      at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>      at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>      at
>> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>      at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>      at
>> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>      at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>      at
>> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 
>> Best,
>> Mahesh Balija.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Is anyone else nervous about ignoring this issue or relying on
>> non-build
>>>> (hand run) test driven transitive dependency checking. I hope someone
>>> else
>>>> will chime in.
>>>> 
>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>>> up
>>>> the build machine to do this? I’d feel better about eyeballing deps if
>> we
>>>> could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>>>> the regular unit tests are OK for building locally ourselves.
>>>> 
>>>>> 
>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>> 
>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>> whether
>>> we
>>>>>> have missing classes or not. The job.jar at least used the pom
>>>> dependencies
>>>>>> to guarantee every needed class was present. So the job.jar seems to
>>>> solve
>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>> 
>>>>> 
>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>>> turns
>>>>> out more recent hadoop MR btw.
>>>> 
>>>> Not speaking literally of the format. Spark understands jars and maven
>>> can
>>>> build one from transitive dependencies.
>>>> 
>>>>> 
>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>>> startup
>>>>> tasks with all of it just on copy time). This is absolutely not the
>> way
>>>> to
>>>>> go with this.
>>>>> 
>>>> 
>>>> Lack of guarantee to load seems like a bigger problem than startup
>> time.
>>>> Clearly we can’t just ignore this.
>>>> 
>>> 
>>> Nope. given highly iterative nature and dynamic task allocation in this
>>> environment, one is looking to effects similar to Map Reduce. This is
> not
>>> the only reason why I never go to MR anymore, but that's one of main
>> ones.
>>> 
>>> How about experiment: why don't you create assembly that copies ALL
>>> transitive dependencies in one folder, and then try to broadcast it from
>>> single point (front end) to well... let's start with 20 machines. (of
>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>>> can't do it for 20).
>>> 
>>> Or, heck, let's try to simply parallel-copy it between too machines 20
>>> times that are not collocated on the same subnet.
>>> 
>>> 
>>>>> 
>>>>>> There may be any number of bugs waiting for the time we try running
>>> on a
>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>> 
>>>>> 
>>>>> No. Assuming any given method is tested on all its execution paths,
>>> there
>>>>> will be no bugs. The bugs of that sort will only appear if the user
>> is
>>>>> using algebra directly and calls something that is not on the path,
>>> from
>>>>> the closure. In which case our answer to this is the same as for the
>>>> solver
>>>>> methodology developers -- use customized SparkConf while creating
>>> context
>>>>> to include stuff you really want.
>>>>> 
>>>>> Also another right answer to this is that we probably should
>> reasonably
>>>>> provide the toolset here. For example, all the stats stuff found in R
>>>> base
>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>> 
>>>>> 
>>>> 
>>>> Huh? this is not true. The one I ran into was found by calling
>> something
>>>> in math from something in math-scala. It led outside and you can
>>> encounter
>>>> such things even in algebra.  In fact you have no idea if these
>> problems
>>>> exists except for the fact you have used it a lot personally.
>>>> 
>>> 
>>> 
>>> You ran it with your own code that never existed before.
>>> 
>>> But there's difference between released Mahout code (which is what you
>> are
>>> working on) and the user code. Released code must run thru remote tests
>> as
>>> you suggested and thus guarantee there are no such problems with post
>>> release code.
>>> 
>>> For users, we only can provide a way for them to load stuff that they
>>> decide to use. We don't have apriori knowledge what they will use. It is
>>> the same thing that spark does, and the same thing that MR does, doesn't
>>> it?
>>> 
>>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>>> scala scope. No argue about that. In fact that's what i suggested as #1
>>> solution. But there's nothing much to do here but to go dependency
>>> cleansing for math and spark code. Part of the reason there's so much is
>>> because newer modules still bring in everything from mrLegacy.
>>> 
>>> You are right in saying it is hard to guess what else dependencies are
> in
>>> the util/legacy code that are actually used. but that's not a
>> justification
>>> for brute force "copy them all" approach that virtually guarantees
>> ruining
>>> one of the foremost legacy issues this work intended to address.
>>> 
>> 
> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ok, the mystery is solved. 

The safe sequence from my limited testing is:
1) delete ~/.m2/repository/org/spark and mahout
2) build Spark for your version of Hadoop *but do not use "mvn package ...”* use “mvn install …” This will put a copy of the exact bits you need into the maven cache for building mahout against. In my case using hadoop 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you run tests on Spark some failures can safely be ignored according to the Spark guys so check before giving up. 
3) build mahout with “mvn clean install"

This will create mahout from exactly the same bits you will run on your cluster. It got rid of a missing anon function for me. The problem occurs when you use a different version of Spark on your cluster than you used to build Mahout and this is rather hidden by Maven. Maven downloads from repos any dependency that is not in the local .m2 cache and so you have to make sure your version of Spark is there so Maven wont download one that is incompatible. Unless you really know what you are doing I’d build both Spark and Mahout for now

BTW I will check in the Spark 1.1.0 version of Mahout once I do some more testing.

On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Sorry to hear. I bet you’ll find a way.

The Spark Jira trail leads to two suggestions:
1) use spark-submit to execute code with your own entry point (other than spark-shell) One theory points to not loading all needed Spark classes from calling code (Mahout in our case). I can hand check the jars for the anon function I am missing.
2) there may be different class names in the running code (created by building Spark locally) and the  version referenced in the Mahout POM. If this turns out to be true it means we can’t rely on building Spark locally. Is there a maven target that puts the artifacts of the Spark build in the .m2/repository local cache? That would be an easy way to test this theory.

either of these could cause missing classes.


On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

no i havent used it with anything but 1.0.1 and 0.9.x .

on a side note, I just have changed my employer. It is one of these big
guys that make it very difficult to do any contributions. So I am not sure
how much of anything i will be able to share/contribute.

On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> But unless you have the time to devote to errors avoid it. I’ve built
> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> missing class errors. The 1.x branch seems to have some kind of peculiar
> build order dependencies. The errors sometimes don’t show up until runtime,
> passing all build tests.
> 
> Dmitriy, have you successfully used any Spark version other than 1.0.1 on
> a cluster? If so do you recall the exact order and from what sources you
> built?
> 
> 
> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> You can't use spark client of one version and have the backend of another.
> You can try to change spark dependency in mahout poms to match your backend
> (or vice versa, you can change your backend to match what's on the client).
> 
> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <balijamahesh.mca@gmail.com
>> 
> wrote:
> 
>> Hi All,
>> 
>> Here are the errors I get which I run in a pseudo distributed mode,
>> 
>> Spark 1.0.2 and Mahout latest code (Clone)
>> 
>> When I run the command in page,
>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>> 
>> val drmX = drmData(::, 0 until 4)
>> 
>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>> local class serialVersionUID = -6766554341038829528
>>      at
>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>      at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>      at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>      at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>      at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>      at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>      at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>      at
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>      at
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>      at
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>      at
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>      at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>      at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>      at
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>      at
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>      at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>      at
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>      at
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      at java.lang.Thread.run(Thread.java:701)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>> org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>> -6766554341038829528
>>      java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>> 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> 
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> 
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>      java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>> 
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>> 
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>> 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>      java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>      java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      java.lang.Thread.run(Thread.java:701)
>> Driver stacktrace:
>>      at org.apache.spark.scheduler.DAGScheduler.org
>> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>      at
>> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>      at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>      at scala.Option.foreach(Option.scala:236)
>>      at
>> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>      at
>> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>      at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>      at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>      at
>> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>      at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>      at
>> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>      at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>      at
>> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 
>> Best,
>> Mahesh Balija.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Is anyone else nervous about ignoring this issue or relying on
>> non-build
>>>> (hand run) test driven transitive dependency checking. I hope someone
>>> else
>>>> will chime in.
>>>> 
>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>>> up
>>>> the build machine to do this? I’d feel better about eyeballing deps if
>> we
>>>> could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>>>> the regular unit tests are OK for building locally ourselves.
>>>> 
>>>>> 
>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>> 
>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>> whether
>>> we
>>>>>> have missing classes or not. The job.jar at least used the pom
>>>> dependencies
>>>>>> to guarantee every needed class was present. So the job.jar seems to
>>>> solve
>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>> 
>>>>> 
>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>>> turns
>>>>> out more recent hadoop MR btw.
>>>> 
>>>> Not speaking literally of the format. Spark understands jars and maven
>>> can
>>>> build one from transitive dependencies.
>>>> 
>>>>> 
>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>>> startup
>>>>> tasks with all of it just on copy time). This is absolutely not the
>> way
>>>> to
>>>>> go with this.
>>>>> 
>>>> 
>>>> Lack of guarantee to load seems like a bigger problem than startup
>> time.
>>>> Clearly we can’t just ignore this.
>>>> 
>>> 
>>> Nope. given highly iterative nature and dynamic task allocation in this
>>> environment, one is looking to effects similar to Map Reduce. This is
> not
>>> the only reason why I never go to MR anymore, but that's one of main
>> ones.
>>> 
>>> How about experiment: why don't you create assembly that copies ALL
>>> transitive dependencies in one folder, and then try to broadcast it from
>>> single point (front end) to well... let's start with 20 machines. (of
>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>>> can't do it for 20).
>>> 
>>> Or, heck, let's try to simply parallel-copy it between too machines 20
>>> times that are not collocated on the same subnet.
>>> 
>>> 
>>>>> 
>>>>>> There may be any number of bugs waiting for the time we try running
>>> on a
>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>> 
>>>>> 
>>>>> No. Assuming any given method is tested on all its execution paths,
>>> there
>>>>> will be no bugs. The bugs of that sort will only appear if the user
>> is
>>>>> using algebra directly and calls something that is not on the path,
>>> from
>>>>> the closure. In which case our answer to this is the same as for the
>>>> solver
>>>>> methodology developers -- use customized SparkConf while creating
>>> context
>>>>> to include stuff you really want.
>>>>> 
>>>>> Also another right answer to this is that we probably should
>> reasonably
>>>>> provide the toolset here. For example, all the stats stuff found in R
>>>> base
>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>> 
>>>>> 
>>>> 
>>>> Huh? this is not true. The one I ran into was found by calling
>> something
>>>> in math from something in math-scala. It led outside and you can
>>> encounter
>>>> such things even in algebra.  In fact you have no idea if these
>> problems
>>>> exists except for the fact you have used it a lot personally.
>>>> 
>>> 
>>> 
>>> You ran it with your own code that never existed before.
>>> 
>>> But there's difference between released Mahout code (which is what you
>> are
>>> working on) and the user code. Released code must run thru remote tests
>> as
>>> you suggested and thus guarantee there are no such problems with post
>>> release code.
>>> 
>>> For users, we only can provide a way for them to load stuff that they
>>> decide to use. We don't have apriori knowledge what they will use. It is
>>> the same thing that spark does, and the same thing that MR does, doesn't
>>> it?
>>> 
>>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>>> scala scope. No argue about that. In fact that's what i suggested as #1
>>> solution. But there's nothing much to do here but to go dependency
>>> cleansing for math and spark code. Part of the reason there's so much is
>>> because newer modules still bring in everything from mrLegacy.
>>> 
>>> You are right in saying it is hard to guess what else dependencies are
> in
>>> the util/legacy code that are actually used. but that's not a
>> justification
>>> for brute force "copy them all" approach that virtually guarantees
>> ruining
>>> one of the foremost legacy issues this work intended to address.
>>> 
>> 
> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sorry to hear. I bet you’ll find a way.

The Spark Jira trail leads to two suggestions:
1) use spark-submit to execute code with your own entry point (other than spark-shell) One theory points to not loading all needed Spark classes from calling code (Mahout in our case). I can hand check the jars for the anon function I am missing.
2) there may be different class names in the running code (created by building Spark locally) and the  version referenced in the Mahout POM. If this turns out to be true it means we can’t rely on building Spark locally. Is there a maven target that puts the artifacts of the Spark build in the .m2/repository local cache? That would be an easy way to test this theory.

either of these could cause missing classes.


On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

no i havent used it with anything but 1.0.1 and 0.9.x .

on a side note, I just have changed my employer. It is one of these big
guys that make it very difficult to do any contributions. So I am not sure
how much of anything i will be able to share/contribute.

On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> But unless you have the time to devote to errors avoid it. I’ve built
> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> missing class errors. The 1.x branch seems to have some kind of peculiar
> build order dependencies. The errors sometimes don’t show up until runtime,
> passing all build tests.
> 
> Dmitriy, have you successfully used any Spark version other than 1.0.1 on
> a cluster? If so do you recall the exact order and from what sources you
> built?
> 
> 
> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> You can't use spark client of one version and have the backend of another.
> You can try to change spark dependency in mahout poms to match your backend
> (or vice versa, you can change your backend to match what's on the client).
> 
> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <balijamahesh.mca@gmail.com
>> 
> wrote:
> 
>> Hi All,
>> 
>> Here are the errors I get which I run in a pseudo distributed mode,
>> 
>> Spark 1.0.2 and Mahout latest code (Clone)
>> 
>> When I run the command in page,
>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>> 
>> val drmX = drmData(::, 0 until 4)
>> 
>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>> local class serialVersionUID = -6766554341038829528
>>       at
>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>       at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>       at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>       at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>       at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>       at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>       at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>       at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>       at
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>       at
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>       at
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>       at
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>       at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>       at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>       at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>       at
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>       at
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>       at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>       at
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>       at
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>       at java.lang.Thread.run(Thread.java:701)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>> org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>> serialVersionUID = 385418487991259089, local class serialVersionUID =
>> -6766554341038829528
>>       java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>> 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> 
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> 
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>> 
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>> 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>       java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>       java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> 
>> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>> 
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>> 
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>> 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>       java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>       java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>> 
>> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>> 
>> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>> 
>> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> 
>> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>       java.lang.Thread.run(Thread.java:701)
>> Driver stacktrace:
>>       at org.apache.spark.scheduler.DAGScheduler.org
>> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>       at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>       at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>       at
>> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>       at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>       at
>> 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>       at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>       at
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>       at scala.Option.foreach(Option.scala:236)
>>       at
>> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>       at
>> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>       at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>       at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>       at
>> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>       at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>       at
>> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>       at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>       at
>> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 
>> Best,
>> Mahesh Balija.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Is anyone else nervous about ignoring this issue or relying on
>> non-build
>>>> (hand run) test driven transitive dependency checking. I hope someone
>>> else
>>>> will chime in.
>>>> 
>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>>> up
>>>> the build machine to do this? I’d feel better about eyeballing deps if
>> we
>>>> could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>>>> the regular unit tests are OK for building locally ourselves.
>>>> 
>>>>> 
>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>>> 
>>>>>> Maybe a more fundamental issue is that we don’t know for sure
>> whether
>>> we
>>>>>> have missing classes or not. The job.jar at least used the pom
>>>> dependencies
>>>>>> to guarantee every needed class was present. So the job.jar seems to
>>>> solve
>>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>>> 
>>>>> 
>>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>>> turns
>>>>> out more recent hadoop MR btw.
>>>> 
>>>> Not speaking literally of the format. Spark understands jars and maven
>>> can
>>>> build one from transitive dependencies.
>>>> 
>>>>> 
>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>>> startup
>>>>> tasks with all of it just on copy time). This is absolutely not the
>> way
>>>> to
>>>>> go with this.
>>>>> 
>>>> 
>>>> Lack of guarantee to load seems like a bigger problem than startup
>> time.
>>>> Clearly we can’t just ignore this.
>>>> 
>>> 
>>> Nope. given highly iterative nature and dynamic task allocation in this
>>> environment, one is looking to effects similar to Map Reduce. This is
> not
>>> the only reason why I never go to MR anymore, but that's one of main
>> ones.
>>> 
>>> How about experiment: why don't you create assembly that copies ALL
>>> transitive dependencies in one folder, and then try to broadcast it from
>>> single point (front end) to well... let's start with 20 machines. (of
>>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>>> can't do it for 20).
>>> 
>>> Or, heck, let's try to simply parallel-copy it between too machines 20
>>> times that are not collocated on the same subnet.
>>> 
>>> 
>>>>> 
>>>>>> There may be any number of bugs waiting for the time we try running
>>> on a
>>>>>> node machine that doesn’t have some class in it’s classpath.
>>>>> 
>>>>> 
>>>>> No. Assuming any given method is tested on all its execution paths,
>>> there
>>>>> will be no bugs. The bugs of that sort will only appear if the user
>> is
>>>>> using algebra directly and calls something that is not on the path,
>>> from
>>>>> the closure. In which case our answer to this is the same as for the
>>>> solver
>>>>> methodology developers -- use customized SparkConf while creating
>>> context
>>>>> to include stuff you really want.
>>>>> 
>>>>> Also another right answer to this is that we probably should
>> reasonably
>>>>> provide the toolset here. For example, all the stats stuff found in R
>>>> base
>>>>> and R stat packages so the user is not compelled to go non-native.
>>>>> 
>>>>> 
>>>> 
>>>> Huh? this is not true. The one I ran into was found by calling
>> something
>>>> in math from something in math-scala. It led outside and you can
>>> encounter
>>>> such things even in algebra.  In fact you have no idea if these
>> problems
>>>> exists except for the fact you have used it a lot personally.
>>>> 
>>> 
>>> 
>>> You ran it with your own code that never existed before.
>>> 
>>> But there's difference between released Mahout code (which is what you
>> are
>>> working on) and the user code. Released code must run thru remote tests
>> as
>>> you suggested and thus guarantee there are no such problems with post
>>> release code.
>>> 
>>> For users, we only can provide a way for them to load stuff that they
>>> decide to use. We don't have apriori knowledge what they will use. It is
>>> the same thing that spark does, and the same thing that MR does, doesn't
>>> it?
>>> 
>>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>>> scala scope. No argue about that. In fact that's what i suggested as #1
>>> solution. But there's nothing much to do here but to go dependency
>>> cleansing for math and spark code. Part of the reason there's so much is
>>> because newer modules still bring in everything from mrLegacy.
>>> 
>>> You are right in saying it is hard to guess what else dependencies are
> in
>>> the util/legacy code that are actually used. but that's not a
>> justification
>>> for brute force "copy them all" approach that virtually guarantees
>> ruining
>>> one of the foremost legacy issues this work intended to address.
>>> 
>> 
> 
>

RE: Upgrade to Spark 1.1.0?

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Dmitry,I've been spending quite a bit of time coming up with the mahout-spark interface, I was wondering if it makes sense to do a google hangout with some Q&A for new folks who are going to be making changes to the code.  For me I'd like to understand the longterm goal of having the scala bindings, it doesnt seem like it makes sense to have them be backed by the old mahout colt libraries, is there any thought to just having the core mahout code be written in scala with a DSL on top?
Regarding the google hangout Q&A I can help organize if necessary. 

> Date: Tue, 21 Oct 2014 09:52:25 -0700
> Subject: Re: Upgrade to Spark 1.1.0?
> From: dlieu.7@gmail.com
> To: dev@mahout.apache.org
> 
> no i havent used it with anything but 1.0.1 and 0.9.x .
> 
> on a side note, I just have changed my employer. It is one of these big
> guys that make it very difficult to do any contributions. So I am not sure
> how much of anything i will be able to share/contribute.
> 
> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> > But unless you have the time to devote to errors avoid it. I’ve built
> > everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> > missing class errors. The 1.x branch seems to have some kind of peculiar
> > build order dependencies. The errors sometimes don’t show up until runtime,
> > passing all build tests.
> >
> > Dmitriy, have you successfully used any Spark version other than 1.0.1 on
> > a cluster? If so do you recall the exact order and from what sources you
> > built?
> >
> >
> > On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > You can't use spark client of one version and have the backend of another.
> > You can try to change spark dependency in mahout poms to match your backend
> > (or vice versa, you can change your backend to match what's on the client).
> >
> > On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> > >
> > wrote:
> >
> > > Hi All,
> > >
> > > Here are the errors I get which I run in a pseudo distributed mode,
> > >
> > > Spark 1.0.2 and Mahout latest code (Clone)
> > >
> > > When I run the command in page,
> > > https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> > >
> > > val drmX = drmData(::, 0 until 4)
> > >
> > > java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> > > incompatible: stream classdesc serialVersionUID = 385418487991259089,
> > > local class serialVersionUID = -6766554341038829528
> > >        at
> > > java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> > >        at
> > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> > >        at
> > > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> > >        at
> > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> > >        at
> > > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> > >        at
> > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> > >        at
> > > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> > >        at
> > java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> > >        at
> > >
> > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> > >        at
> > >
> > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> > >        at
> > > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> > >        at
> > > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> > >        at
> > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> > >        at
> > > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> > >        at
> > java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> > >        at
> > >
> > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> > >        at
> > >
> > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> > >        at
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> > >        at
> > >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > >        at
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > >        at java.lang.Thread.run(Thread.java:701)
> > > 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> > > 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> > > 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> > > 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> > > 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> > > 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> > > org.apache.spark.SparkException: Job aborted due to stage failure:
> > > Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> > > TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> > > org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> > > serialVersionUID = 385418487991259089, local class serialVersionUID =
> > > -6766554341038829528
> > >        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> > >
> > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> > >
> > > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> > >
> > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> > >
> > > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> > >
> > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> > >        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> > >        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> > >
> > >
> > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> > >
> > >
> > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> > >
> > > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> > >
> > > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> > >
> > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> > >        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> > >        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> > >
> > >
> > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> > >
> > >
> > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> > >
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> > >
> > >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > >
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > >        java.lang.Thread.run(Thread.java:701)
> > > Driver stacktrace:
> > >        at org.apache.spark.scheduler.DAGScheduler.org
> > >
> > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> > >        at
> > >
> > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > >        at
> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> > >        at scala.Option.foreach(Option.scala:236)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> > >        at
> > >
> > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> > >        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> > >        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> > >        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> > >        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> > >        at
> > >
> > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> > >        at
> > > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > >        at
> > >
> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > >        at
> > > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > >        at
> > >
> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > >
> > > Best,
> > > Mahesh Balija.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > >> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > >>
> > >>> Is anyone else nervous about ignoring this issue or relying on
> > > non-build
> > >>> (hand run) test driven transitive dependency checking. I hope someone
> > >> else
> > >>> will chime in.
> > >>>
> > >>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
> > >> up
> > >>> the build machine to do this? I’d feel better about eyeballing deps if
> > > we
> > >>> could have a TEST_MASTER automatically run during builds at Apache.
> > > Maybe
> > >>> the regular unit tests are OK for building locally ourselves.
> > >>>
> > >>>>
> > >>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
> > >>> wrote:
> > >>>>
> > >>>>> Maybe a more fundamental issue is that we don’t know for sure
> > > whether
> > >> we
> > >>>>> have missing classes or not. The job.jar at least used the pom
> > >>> dependencies
> > >>>>> to guarantee every needed class was present. So the job.jar seems to
> > >>> solve
> > >>>>> the problem but may ship some unnecessary duplicate code, right?
> > >>>>>
> > >>>>
> > >>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
> > >>> turns
> > >>>> out more recent hadoop MR btw.
> > >>>
> > >>> Not speaking literally of the format. Spark understands jars and maven
> > >> can
> > >>> build one from transitive dependencies.
> > >>>
> > >>>>
> > >>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
> > >>> startup
> > >>>> tasks with all of it just on copy time). This is absolutely not the
> > > way
> > >>> to
> > >>>> go with this.
> > >>>>
> > >>>
> > >>> Lack of guarantee to load seems like a bigger problem than startup
> > > time.
> > >>> Clearly we can’t just ignore this.
> > >>>
> > >>
> > >> Nope. given highly iterative nature and dynamic task allocation in this
> > >> environment, one is looking to effects similar to Map Reduce. This is
> > not
> > >> the only reason why I never go to MR anymore, but that's one of main
> > > ones.
> > >>
> > >> How about experiment: why don't you create assembly that copies ALL
> > >> transitive dependencies in one folder, and then try to broadcast it from
> > >> single point (front end) to well... let's start with 20 machines. (of
> > >> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
> > >> can't do it for 20).
> > >>
> > >> Or, heck, let's try to simply parallel-copy it between too machines 20
> > >> times that are not collocated on the same subnet.
> > >>
> > >>
> > >>>>
> > >>>>> There may be any number of bugs waiting for the time we try running
> > >> on a
> > >>>>> node machine that doesn’t have some class in it’s classpath.
> > >>>>
> > >>>>
> > >>>> No. Assuming any given method is tested on all its execution paths,
> > >> there
> > >>>> will be no bugs. The bugs of that sort will only appear if the user
> > > is
> > >>>> using algebra directly and calls something that is not on the path,
> > >> from
> > >>>> the closure. In which case our answer to this is the same as for the
> > >>> solver
> > >>>> methodology developers -- use customized SparkConf while creating
> > >> context
> > >>>> to include stuff you really want.
> > >>>>
> > >>>> Also another right answer to this is that we probably should
> > > reasonably
> > >>>> provide the toolset here. For example, all the stats stuff found in R
> > >>> base
> > >>>> and R stat packages so the user is not compelled to go non-native.
> > >>>>
> > >>>>
> > >>>
> > >>> Huh? this is not true. The one I ran into was found by calling
> > > something
> > >>> in math from something in math-scala. It led outside and you can
> > >> encounter
> > >>> such things even in algebra.  In fact you have no idea if these
> > > problems
> > >>> exists except for the fact you have used it a lot personally.
> > >>>
> > >>
> > >>
> > >> You ran it with your own code that never existed before.
> > >>
> > >> But there's difference between released Mahout code (which is what you
> > > are
> > >> working on) and the user code. Released code must run thru remote tests
> > > as
> > >> you suggested and thus guarantee there are no such problems with post
> > >> release code.
> > >>
> > >> For users, we only can provide a way for them to load stuff that they
> > >> decide to use. We don't have apriori knowledge what they will use. It is
> > >> the same thing that spark does, and the same thing that MR does, doesn't
> > >> it?
> > >>
> > >> Of course mahout should drop rigorously the stuff it doesn't load, from
> > > the
> > >> scala scope. No argue about that. In fact that's what i suggested as #1
> > >> solution. But there's nothing much to do here but to go dependency
> > >> cleansing for math and spark code. Part of the reason there's so much is
> > >> because newer modules still bring in everything from mrLegacy.
> > >>
> > >> You are right in saying it is hard to guess what else dependencies are
> > in
> > >> the util/legacy code that are actually used. but that's not a
> > > justification
> > >> for brute force "copy them all" approach that virtually guarantees
> > > ruining
> > >> one of the foremost legacy issues this work intended to address.
> > >>
> > >
> >
> >

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

no i havent used it with anything but 1.0.1 and 0.9.x .

on a side note, I just have changed my employer. It is one of these big
guys that make it very difficult to do any contributions. So I am not sure
how much of anything i will be able to share/contribute.

On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> But unless you have the time to devote to errors avoid it. I’ve built
> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
> missing class errors. The 1.x branch seems to have some kind of peculiar
> build order dependencies. The errors sometimes don’t show up until runtime,
> passing all build tests.
>
> Dmitriy, have you successfully used any Spark version other than 1.0.1 on
> a cluster? If so do you recall the exact order and from what sources you
> built?
>
>
> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> You can't use spark client of one version and have the backend of another.
> You can try to change spark dependency in mahout poms to match your backend
> (or vice versa, you can change your backend to match what's on the client).
>
> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> >
> wrote:
>
> > Hi All,
> >
> > Here are the errors I get which I run in a pseudo distributed mode,
> >
> > Spark 1.0.2 and Mahout latest code (Clone)
> >
> > When I run the command in page,
> > https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> >
> > val drmX = drmData(::, 0 until 4)
> >
> > java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> > incompatible: stream classdesc serialVersionUID = 385418487991259089,
> > local class serialVersionUID = -6766554341038829528
> >        at
> > java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >        at
> > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >        at
> > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >        at
> > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >        at
> > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >        at
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >        at
> > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >        at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >        at
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >        at
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >        at
> > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >        at
> > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >        at
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >        at
> > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >        at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >        at
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >        at
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >        at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >        at java.lang.Thread.run(Thread.java:701)
> > 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> > 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> > 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> > 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> > 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> > 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> > org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> > TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> > org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> > serialVersionUID = 385418487991259089, local class serialVersionUID =
> > -6766554341038829528
> >        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> >
> > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >
> > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >
> > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> >
> > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> >
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> >        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >
> >
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> >
> > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> >
> > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> >
> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> >        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> >        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> >
> >
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> >
> >
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> >
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >        java.lang.Thread.run(Thread.java:701)
> > Driver stacktrace:
> >        at org.apache.spark.scheduler.DAGScheduler.org
> >
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> >        at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> >        at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> >        at
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >        at
> > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >        at
> >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> >        at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >        at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> >        at scala.Option.foreach(Option.scala:236)
> >        at
> >
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> >        at
> >
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> >        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> >        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> >        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> >        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> >        at
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> >        at
> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >        at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> >        at
> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >        at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >
> > Best,
> > Mahesh Balija.
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> >> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >>
> >>> Is anyone else nervous about ignoring this issue or relying on
> > non-build
> >>> (hand run) test driven transitive dependency checking. I hope someone
> >> else
> >>> will chime in.
> >>>
> >>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
> >> up
> >>> the build machine to do this? I’d feel better about eyeballing deps if
> > we
> >>> could have a TEST_MASTER automatically run during builds at Apache.
> > Maybe
> >>> the regular unit tests are OK for building locally ourselves.
> >>>
> >>>>
> >>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >>> wrote:
> >>>>
> >>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
> >>> wrote:
> >>>>
> >>>>> Maybe a more fundamental issue is that we don’t know for sure
> > whether
> >> we
> >>>>> have missing classes or not. The job.jar at least used the pom
> >>> dependencies
> >>>>> to guarantee every needed class was present. So the job.jar seems to
> >>> solve
> >>>>> the problem but may ship some unnecessary duplicate code, right?
> >>>>>
> >>>>
> >>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
> >>> turns
> >>>> out more recent hadoop MR btw.
> >>>
> >>> Not speaking literally of the format. Spark understands jars and maven
> >> can
> >>> build one from transitive dependencies.
> >>>
> >>>>
> >>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
> >>> startup
> >>>> tasks with all of it just on copy time). This is absolutely not the
> > way
> >>> to
> >>>> go with this.
> >>>>
> >>>
> >>> Lack of guarantee to load seems like a bigger problem than startup
> > time.
> >>> Clearly we can’t just ignore this.
> >>>
> >>
> >> Nope. given highly iterative nature and dynamic task allocation in this
> >> environment, one is looking to effects similar to Map Reduce. This is
> not
> >> the only reason why I never go to MR anymore, but that's one of main
> > ones.
> >>
> >> How about experiment: why don't you create assembly that copies ALL
> >> transitive dependencies in one folder, and then try to broadcast it from
> >> single point (front end) to well... let's start with 20 machines. (of
> >> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
> >> can't do it for 20).
> >>
> >> Or, heck, let's try to simply parallel-copy it between too machines 20
> >> times that are not collocated on the same subnet.
> >>
> >>
> >>>>
> >>>>> There may be any number of bugs waiting for the time we try running
> >> on a
> >>>>> node machine that doesn’t have some class in it’s classpath.
> >>>>
> >>>>
> >>>> No. Assuming any given method is tested on all its execution paths,
> >> there
> >>>> will be no bugs. The bugs of that sort will only appear if the user
> > is
> >>>> using algebra directly and calls something that is not on the path,
> >> from
> >>>> the closure. In which case our answer to this is the same as for the
> >>> solver
> >>>> methodology developers -- use customized SparkConf while creating
> >> context
> >>>> to include stuff you really want.
> >>>>
> >>>> Also another right answer to this is that we probably should
> > reasonably
> >>>> provide the toolset here. For example, all the stats stuff found in R
> >>> base
> >>>> and R stat packages so the user is not compelled to go non-native.
> >>>>
> >>>>
> >>>
> >>> Huh? this is not true. The one I ran into was found by calling
> > something
> >>> in math from something in math-scala. It led outside and you can
> >> encounter
> >>> such things even in algebra.  In fact you have no idea if these
> > problems
> >>> exists except for the fact you have used it a lot personally.
> >>>
> >>
> >>
> >> You ran it with your own code that never existed before.
> >>
> >> But there's difference between released Mahout code (which is what you
> > are
> >> working on) and the user code. Released code must run thru remote tests
> > as
> >> you suggested and thus guarantee there are no such problems with post
> >> release code.
> >>
> >> For users, we only can provide a way for them to load stuff that they
> >> decide to use. We don't have apriori knowledge what they will use. It is
> >> the same thing that spark does, and the same thing that MR does, doesn't
> >> it?
> >>
> >> Of course mahout should drop rigorously the stuff it doesn't load, from
> > the
> >> scala scope. No argue about that. In fact that's what i suggested as #1
> >> solution. But there's nothing much to do here but to go dependency
> >> cleansing for math and spark code. Part of the reason there's so much is
> >> because newer modules still bring in everything from mrLegacy.
> >>
> >> You are right in saying it is hard to guess what else dependencies are
> in
> >> the util/legacy code that are actually used. but that's not a
> > justification
> >> for brute force "copy them all" approach that virtually guarantees
> > ruining
> >> one of the foremost legacy issues this work intended to address.
> >>
> >
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

But unless you have the time to devote to errors avoid it. I’ve built everything from scratch using 1.0.2 and 1.1.0 and am getting these and missing class errors. The 1.x branch seems to have some kind of peculiar build order dependencies. The errors sometimes don’t show up until runtime, passing all build tests.

Dmitriy, have you successfully used any Spark version other than 1.0.1 on a cluster? If so do you recall the exact order and from what sources you built?


On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

You can't use spark client of one version and have the backend of another.
You can try to change spark dependency in mahout poms to match your backend
(or vice versa, you can change your backend to match what's on the client).

On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi All,
> 
> Here are the errors I get which I run in a pseudo distributed mode,
> 
> Spark 1.0.2 and Mahout latest code (Clone)
> 
> When I run the command in page,
> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> 
> val drmX = drmData(::, 0 until 4)
> 
> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> incompatible: stream classdesc serialVersionUID = 385418487991259089,
> local class serialVersionUID = -6766554341038829528
>        at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>        at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>        at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>        at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>        at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>        at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>        at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>        at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>        at
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>        at
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>        at
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>        at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>        at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>        at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>        at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        at java.lang.Thread.run(Thread.java:701)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> serialVersionUID = 385418487991259089, local class serialVersionUID =
> -6766554341038829528
>        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> 
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> 
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        java.lang.Thread.run(Thread.java:701)
> Driver stacktrace:
>        at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>        at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>        at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>        at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>        at scala.Option.foreach(Option.scala:236)
>        at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>        at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>        at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>        at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>        at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>        at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>        at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> Best,
> Mahesh Balija.
> 
> 
> 
> 
> 
> 
> 
> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>> 
>>> Is anyone else nervous about ignoring this issue or relying on
> non-build
>>> (hand run) test driven transitive dependency checking. I hope someone
>> else
>>> will chime in.
>>> 
>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>> up
>>> the build machine to do this? I’d feel better about eyeballing deps if
> we
>>> could have a TEST_MASTER automatically run during builds at Apache.
> Maybe
>>> the regular unit tests are OK for building locally ourselves.
>>> 
>>>> 
>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>> 
>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Maybe a more fundamental issue is that we don’t know for sure
> whether
>> we
>>>>> have missing classes or not. The job.jar at least used the pom
>>> dependencies
>>>>> to guarantee every needed class was present. So the job.jar seems to
>>> solve
>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>> 
>>>> 
>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>> turns
>>>> out more recent hadoop MR btw.
>>> 
>>> Not speaking literally of the format. Spark understands jars and maven
>> can
>>> build one from transitive dependencies.
>>> 
>>>> 
>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>> startup
>>>> tasks with all of it just on copy time). This is absolutely not the
> way
>>> to
>>>> go with this.
>>>> 
>>> 
>>> Lack of guarantee to load seems like a bigger problem than startup
> time.
>>> Clearly we can’t just ignore this.
>>> 
>> 
>> Nope. given highly iterative nature and dynamic task allocation in this
>> environment, one is looking to effects similar to Map Reduce. This is not
>> the only reason why I never go to MR anymore, but that's one of main
> ones.
>> 
>> How about experiment: why don't you create assembly that copies ALL
>> transitive dependencies in one folder, and then try to broadcast it from
>> single point (front end) to well... let's start with 20 machines. (of
>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>> can't do it for 20).
>> 
>> Or, heck, let's try to simply parallel-copy it between too machines 20
>> times that are not collocated on the same subnet.
>> 
>> 
>>>> 
>>>>> There may be any number of bugs waiting for the time we try running
>> on a
>>>>> node machine that doesn’t have some class in it’s classpath.
>>>> 
>>>> 
>>>> No. Assuming any given method is tested on all its execution paths,
>> there
>>>> will be no bugs. The bugs of that sort will only appear if the user
> is
>>>> using algebra directly and calls something that is not on the path,
>> from
>>>> the closure. In which case our answer to this is the same as for the
>>> solver
>>>> methodology developers -- use customized SparkConf while creating
>> context
>>>> to include stuff you really want.
>>>> 
>>>> Also another right answer to this is that we probably should
> reasonably
>>>> provide the toolset here. For example, all the stats stuff found in R
>>> base
>>>> and R stat packages so the user is not compelled to go non-native.
>>>> 
>>>> 
>>> 
>>> Huh? this is not true. The one I ran into was found by calling
> something
>>> in math from something in math-scala. It led outside and you can
>> encounter
>>> such things even in algebra.  In fact you have no idea if these
> problems
>>> exists except for the fact you have used it a lot personally.
>>> 
>> 
>> 
>> You ran it with your own code that never existed before.
>> 
>> But there's difference between released Mahout code (which is what you
> are
>> working on) and the user code. Released code must run thru remote tests
> as
>> you suggested and thus guarantee there are no such problems with post
>> release code.
>> 
>> For users, we only can provide a way for them to load stuff that they
>> decide to use. We don't have apriori knowledge what they will use. It is
>> the same thing that spark does, and the same thing that MR does, doesn't
>> it?
>> 
>> Of course mahout should drop rigorously the stuff it doesn't load, from
> the
>> scala scope. No argue about that. In fact that's what i suggested as #1
>> solution. But there's nothing much to do here but to go dependency
>> cleansing for math and spark code. Part of the reason there's so much is
>> because newer modules still bring in everything from mrLegacy.
>> 
>> You are right in saying it is hard to guess what else dependencies are in
>> the util/legacy code that are actually used. but that's not a
> justification
>> for brute force "copy them all" approach that virtually guarantees
> ruining
>> one of the foremost legacy issues this work intended to address.
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

You can't use spark client of one version and have the backend of another.
You can try to change spark dependency in mahout poms to match your backend
(or vice versa, you can change your backend to match what's on the client).

On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi All,
>
> Here are the errors I get which I run in a pseudo distributed mode,
>
> Spark 1.0.2 and Mahout latest code (Clone)
>
> When I run the command in page,
> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>
> val drmX = drmData(::, 0 until 4)
>
> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
> incompatible: stream classdesc serialVersionUID = 385418487991259089,
> local class serialVersionUID = -6766554341038829528
>         at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>         at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>         at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>         at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>         at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>         at
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>         at
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>         at
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>         at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:701)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0.0:0 failed 4 times, most recent failure: Exception failure in
> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
> serialVersionUID = 385418487991259089, local class serialVersionUID =
> -6766554341038829528
>         java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:701)
> Driver stacktrace:
>         at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Best,
> Mahesh Balija.
>
>
>
>
>
>
>
> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> > > Is anyone else nervous about ignoring this issue or relying on
> non-build
> > > (hand run) test driven transitive dependency checking. I hope someone
> > else
> > > will chime in.
> > >
> > > As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
> > up
> > > the build machine to do this? I’d feel better about eyeballing deps if
> we
> > > could have a TEST_MASTER automatically run during builds at Apache.
> Maybe
> > > the regular unit tests are OK for building locally ourselves.
> > >
> > > >
> > > > On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > > >
> > > > On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
> > > wrote:
> > > >
> > > >> Maybe a more fundamental issue is that we don’t know for sure
> whether
> > we
> > > >> have missing classes or not. The job.jar at least used the pom
> > > dependencies
> > > >> to guarantee every needed class was present. So the job.jar seems to
> > > solve
> > > >> the problem but may ship some unnecessary duplicate code, right?
> > > >>
> > > >
> > > > No, as i wrote spark doesn't  work with job jar format. Neither as it
> > > turns
> > > > out more recent hadoop MR btw.
> > >
> > > Not speaking literally of the format. Spark understands jars and maven
> > can
> > > build one from transitive dependencies.
> > >
> > > >
> > > > Yes, this is A LOT of duplicate code (will take normally MINUTES to
> > > startup
> > > > tasks with all of it just on copy time). This is absolutely not the
> way
> > > to
> > > > go with this.
> > > >
> > >
> > > Lack of guarantee to load seems like a bigger problem than startup
> time.
> > > Clearly we can’t just ignore this.
> > >
> >
> > Nope. given highly iterative nature and dynamic task allocation in this
> > environment, one is looking to effects similar to Map Reduce. This is not
> > the only reason why I never go to MR anymore, but that's one of main
> ones.
> >
> > How about experiment: why don't you create assembly that copies ALL
> > transitive dependencies in one folder, and then try to broadcast it from
> > single point (front end) to well... let's start with 20 machines. (of
> > course we ideally want to into 10^3 ..10^4 range -- but why bother if we
> > can't do it for 20).
> >
> > Or, heck, let's try to simply parallel-copy it between too machines 20
> > times that are not collocated on the same subnet.
> >
> >
> > > >
> > > >> There may be any number of bugs waiting for the time we try running
> > on a
> > > >> node machine that doesn’t have some class in it’s classpath.
> > > >
> > > >
> > > > No. Assuming any given method is tested on all its execution paths,
> > there
> > > > will be no bugs. The bugs of that sort will only appear if the user
> is
> > > > using algebra directly and calls something that is not on the path,
> > from
> > > > the closure. In which case our answer to this is the same as for the
> > > solver
> > > > methodology developers -- use customized SparkConf while creating
> > context
> > > > to include stuff you really want.
> > > >
> > > > Also another right answer to this is that we probably should
> reasonably
> > > > provide the toolset here. For example, all the stats stuff found in R
> > > base
> > > > and R stat packages so the user is not compelled to go non-native.
> > > >
> > > >
> > >
> > > Huh? this is not true. The one I ran into was found by calling
> something
> > > in math from something in math-scala. It led outside and you can
> > encounter
> > > such things even in algebra.  In fact you have no idea if these
> problems
> > > exists except for the fact you have used it a lot personally.
> > >
> >
> >
> > You ran it with your own code that never existed before.
> >
> > But there's difference between released Mahout code (which is what you
> are
> > working on) and the user code. Released code must run thru remote tests
> as
> > you suggested and thus guarantee there are no such problems with post
> > release code.
> >
> > For users, we only can provide a way for them to load stuff that they
> > decide to use. We don't have apriori knowledge what they will use. It is
> > the same thing that spark does, and the same thing that MR does, doesn't
> > it?
> >
> > Of course mahout should drop rigorously the stuff it doesn't load, from
> the
> > scala scope. No argue about that. In fact that's what i suggested as #1
> > solution. But there's nothing much to do here but to go dependency
> > cleansing for math and spark code. Part of the reason there's so much is
> > because newer modules still bring in everything from mrLegacy.
> >
> > You are right in saying it is hard to guess what else dependencies are in
> > the util/legacy code that are actually used. but that's not a
> justification
> > for brute force "copy them all" approach that virtually guarantees
> ruining
> > one of the foremost legacy issues this work intended to address.
> >
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Until we get this sorted out I suggest staying on Spark 1.0.1.

There are multiple problems trying to use anything newer. So far I suspect that the build of Spark and Mahout must be done in a very particular manner and haven’t discovered quite what that is yet.

The error below is often caused by running a version of Spark that Mahout was not build against, causing serialization class UIDs to not match. We've heard several report of problems running the shell examples and the CLI on 1.0.2 and 1.1.0

I’ll try to put together a bulletproof steps to build IF I can get it working.

In the meantime thanks for any stack traces and build process descriptions. If someone wants to create a JIRA for all these under one ticket that would be fine.


On Oct 21, 2014, at 7:15 AM, Mahesh Balija <ba...@gmail.com> wrote:

Also if I use any other versions of Spark there are incompatible method
signatures due to which Mahout Spark-shell itself is NOT started.

On Tue, Oct 21, 2014 at 7:42 PM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi All,
> 
> Here are the errors I get which I run in a pseudo distributed mode,
> 
> Spark 1.0.2 and Mahout latest code (Clone)
> 
> When I run the command in page,
> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> 
> val drmX = drmData(::, 0 until 4)
> 
> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = 385418487991259089, local class serialVersionUID = -6766554341038829528
> 	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> 	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> 	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> 	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> 	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> 	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> 	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 	at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> 	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> 	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> 	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:701)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = 385418487991259089, local class serialVersionUID = -6766554341038829528
>        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>        java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>        java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>        org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>        org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>        org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>        java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>        org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>        org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        java.lang.Thread.run(Thread.java:701)
> Driver stacktrace:
> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> 	at scala.Option.foreach(Option.scala:236)
> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> 	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> Best,
> Mahesh Balija.
> 
> 
> 
> 
> 
> 
> 
> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>>> Is anyone else nervous about ignoring this issue or relying on non-build
>>> (hand run) test driven transitive dependency checking. I hope someone
>> else
>>> will chime in.
>>> 
>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>> up
>>> the build machine to do this? I’d feel better about eyeballing deps if
>> we
>>> could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>>> the regular unit tests are OK for building locally ourselves.
>>> 
>>>> 
>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>> 
>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Maybe a more fundamental issue is that we don’t know for sure
>> whether we
>>>>> have missing classes or not. The job.jar at least used the pom
>>> dependencies
>>>>> to guarantee every needed class was present. So the job.jar seems to
>>> solve
>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>> 
>>>> 
>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>> turns
>>>> out more recent hadoop MR btw.
>>> 
>>> Not speaking literally of the format. Spark understands jars and maven
>> can
>>> build one from transitive dependencies.
>>> 
>>>> 
>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>> startup
>>>> tasks with all of it just on copy time). This is absolutely not the
>> way
>>> to
>>>> go with this.
>>>> 
>>> 
>>> Lack of guarantee to load seems like a bigger problem than startup time.
>>> Clearly we can’t just ignore this.
>>> 
>> 
>> Nope. given highly iterative nature and dynamic task allocation in this
>> environment, one is looking to effects similar to Map Reduce. This is not
>> the only reason why I never go to MR anymore, but that's one of main ones.
>> 
>> How about experiment: why don't you create assembly that copies ALL
>> transitive dependencies in one folder, and then try to broadcast it from
>> single point (front end) to well... let's start with 20 machines. (of
>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>> can't do it for 20).
>> 
>> Or, heck, let's try to simply parallel-copy it between too machines 20
>> times that are not collocated on the same subnet.
>> 
>> 
>>>> 
>>>>> There may be any number of bugs waiting for the time we try running
>> on a
>>>>> node machine that doesn’t have some class in it’s classpath.
>>>> 
>>>> 
>>>> No. Assuming any given method is tested on all its execution paths,
>> there
>>>> will be no bugs. The bugs of that sort will only appear if the user is
>>>> using algebra directly and calls something that is not on the path,
>> from
>>>> the closure. In which case our answer to this is the same as for the
>>> solver
>>>> methodology developers -- use customized SparkConf while creating
>> context
>>>> to include stuff you really want.
>>>> 
>>>> Also another right answer to this is that we probably should
>> reasonably
>>>> provide the toolset here. For example, all the stats stuff found in R
>>> base
>>>> and R stat packages so the user is not compelled to go non-native.
>>>> 
>>>> 
>>> 
>>> Huh? this is not true. The one I ran into was found by calling something
>>> in math from something in math-scala. It led outside and you can
>> encounter
>>> such things even in algebra.  In fact you have no idea if these problems
>>> exists except for the fact you have used it a lot personally.
>>> 
>> 
>> 
>> You ran it with your own code that never existed before.
>> 
>> But there's difference between released Mahout code (which is what you are
>> working on) and the user code. Released code must run thru remote tests as
>> you suggested and thus guarantee there are no such problems with post
>> release code.
>> 
>> For users, we only can provide a way for them to load stuff that they
>> decide to use. We don't have apriori knowledge what they will use. It is
>> the same thing that spark does, and the same thing that MR does, doesn't
>> it?
>> 
>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>> scala scope. No argue about that. In fact that's what i suggested as #1
>> solution. But there's nothing much to do here but to go dependency
>> cleansing for math and spark code. Part of the reason there's so much is
>> because newer modules still bring in everything from mrLegacy.
>> 
>> You are right in saying it is hard to guess what else dependencies are in
>> the util/legacy code that are actually used. but that's not a
>> justification
>> for brute force "copy them all" approach that virtually guarantees ruining
>> one of the foremost legacy issues this work intended to address.
>> 
> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Mahesh Balija <ba...@gmail.com>.

Also if I use any other versions of Spark there are incompatible method
signatures due to which Mahout Spark-shell itself is NOT started.

On Tue, Oct 21, 2014 at 7:42 PM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi All,
>
> Here are the errors I get which I run in a pseudo distributed mode,
>
> Spark 1.0.2 and Mahout latest code (Clone)
>
> When I run the command in page,
> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>
> val drmX = drmData(::, 0 until 4)
>
> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = 385418487991259089, local class serialVersionUID = -6766554341038829528
> 	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
> 	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> 	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> 	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
> 	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
> 	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
> 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> 	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 	at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
> 	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
> 	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
> 	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
> 	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
> 	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
> 	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:701)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = 385418487991259089, local class serialVersionUID = -6766554341038829528
>         java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>         java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>         java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>         java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>         java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>         java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>         org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>         org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>         java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>         java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>         org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:701)
> Driver stacktrace:
> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
> 	at scala.Option.foreach(Option.scala:236)
> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> 	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Best,
> Mahesh Balija.
>
>
>
>
>
>
>
> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>> > Is anyone else nervous about ignoring this issue or relying on non-build
>> > (hand run) test driven transitive dependency checking. I hope someone
>> else
>> > will chime in.
>> >
>> > As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>> up
>> > the build machine to do this? I’d feel better about eyeballing deps if
>> we
>> > could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>> > the regular unit tests are OK for building locally ourselves.
>> >
>> > >
>> > > On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > wrote:
>> > >
>> > > On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
>> > wrote:
>> > >
>> > >> Maybe a more fundamental issue is that we don’t know for sure
>> whether we
>> > >> have missing classes or not. The job.jar at least used the pom
>> > dependencies
>> > >> to guarantee every needed class was present. So the job.jar seems to
>> > solve
>> > >> the problem but may ship some unnecessary duplicate code, right?
>> > >>
>> > >
>> > > No, as i wrote spark doesn't  work with job jar format. Neither as it
>> > turns
>> > > out more recent hadoop MR btw.
>> >
>> > Not speaking literally of the format. Spark understands jars and maven
>> can
>> > build one from transitive dependencies.
>> >
>> > >
>> > > Yes, this is A LOT of duplicate code (will take normally MINUTES to
>> > startup
>> > > tasks with all of it just on copy time). This is absolutely not the
>> way
>> > to
>> > > go with this.
>> > >
>> >
>> > Lack of guarantee to load seems like a bigger problem than startup time.
>> > Clearly we can’t just ignore this.
>> >
>>
>> Nope. given highly iterative nature and dynamic task allocation in this
>> environment, one is looking to effects similar to Map Reduce. This is not
>> the only reason why I never go to MR anymore, but that's one of main ones.
>>
>> How about experiment: why don't you create assembly that copies ALL
>> transitive dependencies in one folder, and then try to broadcast it from
>> single point (front end) to well... let's start with 20 machines. (of
>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>> can't do it for 20).
>>
>> Or, heck, let's try to simply parallel-copy it between too machines 20
>> times that are not collocated on the same subnet.
>>
>>
>> > >
>> > >> There may be any number of bugs waiting for the time we try running
>> on a
>> > >> node machine that doesn’t have some class in it’s classpath.
>> > >
>> > >
>> > > No. Assuming any given method is tested on all its execution paths,
>> there
>> > > will be no bugs. The bugs of that sort will only appear if the user is
>> > > using algebra directly and calls something that is not on the path,
>> from
>> > > the closure. In which case our answer to this is the same as for the
>> > solver
>> > > methodology developers -- use customized SparkConf while creating
>> context
>> > > to include stuff you really want.
>> > >
>> > > Also another right answer to this is that we probably should
>> reasonably
>> > > provide the toolset here. For example, all the stats stuff found in R
>> > base
>> > > and R stat packages so the user is not compelled to go non-native.
>> > >
>> > >
>> >
>> > Huh? this is not true. The one I ran into was found by calling something
>> > in math from something in math-scala. It led outside and you can
>> encounter
>> > such things even in algebra.  In fact you have no idea if these problems
>> > exists except for the fact you have used it a lot personally.
>> >
>>
>>
>> You ran it with your own code that never existed before.
>>
>> But there's difference between released Mahout code (which is what you are
>> working on) and the user code. Released code must run thru remote tests as
>> you suggested and thus guarantee there are no such problems with post
>> release code.
>>
>> For users, we only can provide a way for them to load stuff that they
>> decide to use. We don't have apriori knowledge what they will use. It is
>> the same thing that spark does, and the same thing that MR does, doesn't
>> it?
>>
>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>> scala scope. No argue about that. In fact that's what i suggested as #1
>> solution. But there's nothing much to do here but to go dependency
>> cleansing for math and spark code. Part of the reason there's so much is
>> because newer modules still bring in everything from mrLegacy.
>>
>> You are right in saying it is hard to guess what else dependencies are in
>> the util/legacy code that are actually used. but that's not a
>> justification
>> for brute force "copy them all" approach that virtually guarantees ruining
>> one of the foremost legacy issues this work intended to address.
>>
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Mahesh Balija <ba...@gmail.com>.

Hi All,

Here are the errors I get which I run in a pseudo distributed mode,

Spark 1.0.2 and Mahout latest code (Clone)

When I run the command in page,
https://mahout.apache.org/users/sparkbindings/play-with-shell.html

val drmX = drmData(::, 0 until 4)

java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
incompatible: stream classdesc serialVersionUID = 385418487991259089,
local class serialVersionUID = -6766554341038829528
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
	at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
	at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:701)
14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0.0:0 failed 4 times, most recent failure: Exception failure in
TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
serialVersionUID = 385418487991259089, local class serialVersionUID =
-6766554341038829528
        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
        java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
        java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
        org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
        org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
        org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
        java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
        org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
        org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:701)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Best,
Mahesh Balija.







On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > Is anyone else nervous about ignoring this issue or relying on non-build
> > (hand run) test driven transitive dependency checking. I hope someone
> else
> > will chime in.
> >
> > As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
> up
> > the build machine to do this? I’d feel better about eyeballing deps if we
> > could have a TEST_MASTER automatically run during builds at Apache. Maybe
> > the regular unit tests are OK for building locally ourselves.
> >
> > >
> > > On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > >
> > > On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> > >
> > >> Maybe a more fundamental issue is that we don’t know for sure whether
> we
> > >> have missing classes or not. The job.jar at least used the pom
> > dependencies
> > >> to guarantee every needed class was present. So the job.jar seems to
> > solve
> > >> the problem but may ship some unnecessary duplicate code, right?
> > >>
> > >
> > > No, as i wrote spark doesn't  work with job jar format. Neither as it
> > turns
> > > out more recent hadoop MR btw.
> >
> > Not speaking literally of the format. Spark understands jars and maven
> can
> > build one from transitive dependencies.
> >
> > >
> > > Yes, this is A LOT of duplicate code (will take normally MINUTES to
> > startup
> > > tasks with all of it just on copy time). This is absolutely not the way
> > to
> > > go with this.
> > >
> >
> > Lack of guarantee to load seems like a bigger problem than startup time.
> > Clearly we can’t just ignore this.
> >
>
> Nope. given highly iterative nature and dynamic task allocation in this
> environment, one is looking to effects similar to Map Reduce. This is not
> the only reason why I never go to MR anymore, but that's one of main ones.
>
> How about experiment: why don't you create assembly that copies ALL
> transitive dependencies in one folder, and then try to broadcast it from
> single point (front end) to well... let's start with 20 machines. (of
> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
> can't do it for 20).
>
> Or, heck, let's try to simply parallel-copy it between too machines 20
> times that are not collocated on the same subnet.
>
>
> > >
> > >> There may be any number of bugs waiting for the time we try running
> on a
> > >> node machine that doesn’t have some class in it’s classpath.
> > >
> > >
> > > No. Assuming any given method is tested on all its execution paths,
> there
> > > will be no bugs. The bugs of that sort will only appear if the user is
> > > using algebra directly and calls something that is not on the path,
> from
> > > the closure. In which case our answer to this is the same as for the
> > solver
> > > methodology developers -- use customized SparkConf while creating
> context
> > > to include stuff you really want.
> > >
> > > Also another right answer to this is that we probably should reasonably
> > > provide the toolset here. For example, all the stats stuff found in R
> > base
> > > and R stat packages so the user is not compelled to go non-native.
> > >
> > >
> >
> > Huh? this is not true. The one I ran into was found by calling something
> > in math from something in math-scala. It led outside and you can
> encounter
> > such things even in algebra.  In fact you have no idea if these problems
> > exists except for the fact you have used it a lot personally.
> >
>
>
> You ran it with your own code that never existed before.
>
> But there's difference between released Mahout code (which is what you are
> working on) and the user code. Released code must run thru remote tests as
> you suggested and thus guarantee there are no such problems with post
> release code.
>
> For users, we only can provide a way for them to load stuff that they
> decide to use. We don't have apriori knowledge what they will use. It is
> the same thing that spark does, and the same thing that MR does, doesn't
> it?
>
> Of course mahout should drop rigorously the stuff it doesn't load, from the
> scala scope. No argue about that. In fact that's what i suggested as #1
> solution. But there's nothing much to do here but to go dependency
> cleansing for math and spark code. Part of the reason there's so much is
> because newer modules still bring in everything from mrLegacy.
>
> You are right in saying it is hard to guess what else dependencies are in
> the util/legacy code that are actually used. but that's not a justification
> for brute force "copy them all" approach that virtually guarantees ruining
> one of the foremost legacy issues this work intended to address.
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Is anyone else nervous about ignoring this issue or relying on non-build
> (hand run) test driven transitive dependency checking. I hope someone else
> will chime in.
>
> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set up
> the build machine to do this? I’d feel better about eyeballing deps if we
> could have a TEST_MASTER automatically run during builds at Apache. Maybe
> the regular unit tests are OK for building locally ourselves.
>
> >
> > On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> > On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> Maybe a more fundamental issue is that we don’t know for sure whether we
> >> have missing classes or not. The job.jar at least used the pom
> dependencies
> >> to guarantee every needed class was present. So the job.jar seems to
> solve
> >> the problem but may ship some unnecessary duplicate code, right?
> >>
> >
> > No, as i wrote spark doesn't  work with job jar format. Neither as it
> turns
> > out more recent hadoop MR btw.
>
> Not speaking literally of the format. Spark understands jars and maven can
> build one from transitive dependencies.
>
> >
> > Yes, this is A LOT of duplicate code (will take normally MINUTES to
> startup
> > tasks with all of it just on copy time). This is absolutely not the way
> to
> > go with this.
> >
>
> Lack of guarantee to load seems like a bigger problem than startup time.
> Clearly we can’t just ignore this.
>

Nope. given highly iterative nature and dynamic task allocation in this
environment, one is looking to effects similar to Map Reduce. This is not
the only reason why I never go to MR anymore, but that's one of main ones.

How about experiment: why don't you create assembly that copies ALL
transitive dependencies in one folder, and then try to broadcast it from
single point (front end) to well... let's start with 20 machines. (of
course we ideally want to into 10^3 ..10^4 range -- but why bother if we
can't do it for 20).

Or, heck, let's try to simply parallel-copy it between too machines 20
times that are not collocated on the same subnet.

> >
> >> There may be any number of bugs waiting for the time we try running on a
> >> node machine that doesn’t have some class in it’s classpath.
> >
> >
> > No. Assuming any given method is tested on all its execution paths, there
> > will be no bugs. The bugs of that sort will only appear if the user is
> > using algebra directly and calls something that is not on the path, from
> > the closure. In which case our answer to this is the same as for the
> solver
> > methodology developers -- use customized SparkConf while creating context
> > to include stuff you really want.
> >
> > Also another right answer to this is that we probably should reasonably
> > provide the toolset here. For example, all the stats stuff found in R
> base
> > and R stat packages so the user is not compelled to go non-native.
> >
> >
>
> Huh? this is not true. The one I ran into was found by calling something
> in math from something in math-scala. It led outside and you can encounter
> such things even in algebra.  In fact you have no idea if these problems
> exists except for the fact you have used it a lot personally.
>

You ran it with your own code that never existed before.

But there's difference between released Mahout code (which is what you are
working on) and the user code. Released code must run thru remote tests as
you suggested and thus guarantee there are no such problems with post
release code.

For users, we only can provide a way for them to load stuff that they
decide to use. We don't have apriori knowledge what they will use. It is
the same thing that spark does, and the same thing that MR does, doesn't it?

Of course mahout should drop rigorously the stuff it doesn't load, from the
scala scope. No argue about that. In fact that's what i suggested as #1
solution. But there's nothing much to do here but to go dependency
cleansing for math and spark code. Part of the reason there's so much is
because newer modules still bring in everything from mrLegacy.

You are right in saying it is hard to guess what else dependencies are in
the util/legacy code that are actually used. but that's not a justification
for brute force "copy them all" approach that virtually guarantees ruining
one of the foremost legacy issues this work intended to address.

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Is anyone else nervous about ignoring this issue or relying on non-build (hand run) test driven transitive dependency checking. I hope someone else will chime in. 

As to running unit tests on a TEST_MASTER I’ll look into it. Can we set up the build machine to do this? I’d feel better about eyeballing deps if we could have a TEST_MASTER automatically run during builds at Apache. Maybe the regular unit tests are OK for building locally ourselves.

> 
> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> Maybe a more fundamental issue is that we don’t know for sure whether we
>> have missing classes or not. The job.jar at least used the pom dependencies
>> to guarantee every needed class was present. So the job.jar seems to solve
>> the problem but may ship some unnecessary duplicate code, right?
>> 
> 
> No, as i wrote spark doesn't  work with job jar format. Neither as it turns
> out more recent hadoop MR btw.

Not speaking literally of the format. Spark understands jars and maven can build one from transitive dependencies.

> 
> Yes, this is A LOT of duplicate code (will take normally MINUTES to startup
> tasks with all of it just on copy time). This is absolutely not the way to
> go with this.
> 

Lack of guarantee to load seems like a bigger problem than startup time. Clearly we can’t just ignore this.

> 
>> There may be any number of bugs waiting for the time we try running on a
>> node machine that doesn’t have some class in it’s classpath.
> 
> 
> No. Assuming any given method is tested on all its execution paths, there
> will be no bugs. The bugs of that sort will only appear if the user is
> using algebra directly and calls something that is not on the path, from
> the closure. In which case our answer to this is the same as for the solver
> methodology developers -- use customized SparkConf while creating context
> to include stuff you really want.
> 
> Also another right answer to this is that we probably should reasonably
> provide the toolset here. For example, all the stats stuff found in R base
> and R stat packages so the user is not compelled to go non-native.
> 
> 

Huh? this is not true. The one I ran into was found by calling something in math from something in math-scala. It led outside and you can encounter such things even in algebra.  In fact you have no idea if these problems exists except for the fact you have used it a lot personally. 

Test catches these only if they are _not_ local unit tests but run on a TEST_MASTER and we would need very high test coverage. Test dependent rather than linker/statically checked—isn’t that why we love type safety? 

Actually I personally have no problem with non-type safe test-driven guarantees just so we’re all clear about what this implies. Javascript seems to thrive after all...

> 
> 
>> This is exactly what happened with RandomGenerator when it was dropped
>> from Spark. If I hadn’t run a test by hand on the cluster it would never
>> have shown up in the unit tests. I suspect that this may have led to other
>> odd error reports.
>> 
>> Would a script to run all unit tests on a cluster help find out whether we
>> have missing classes or not? As I understand it without a job.jar we can’t
>> really be sure.
>> 
> 
> this is probably a good idea, indeed. in fact, i may have introduced some
> of those when i transitioned stochastic stuff to mahout random utils
> without retesting it in distributed setting.
> 
> But i would think all one'd need is some little mod to standard spark-based
> test trait that creates the context in order to check out for something
> like TEST_MASTER in the environment and use the $TEST_MASTER master instead
> of local if one is found. once that tweak is done, one can easily rerun
> unit tests simply by giving
> 
> TEST_MASTER=spark://localhost:7077 mvn test
> 
> (similarly the way master is overriden for shell -- but of course we don't
> want tests to react to global MASTER variable just in case it is defined,
> so we need aptly named but different one).
> 

Yeah should be simple. I’ll look into this.

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Is anyone else nervous about ignoring this issue or relying on non-build (hand run) test driven transitive dependency checking. I hope someone else will chime in. 

As to running unit tests on a TEST_MASTER I’ll look into it. Can we set up the build machine to do this? I’d feel better about eyeballing deps if we could have a TEST_MASTER automatically run during builds at Apache. Maybe the regular unit tests are OK for building locally ourselves.

> 
> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> Maybe a more fundamental issue is that we don’t know for sure whether we
>> have missing classes or not. The job.jar at least used the pom dependencies
>> to guarantee every needed class was present. So the job.jar seems to solve
>> the problem but may ship some unnecessary duplicate code, right?
>> 
> 
> No, as i wrote spark doesn't  work with job jar format. Neither as it turns
> out more recent hadoop MR btw.

Not speaking literally of the format. Spark understands jars and maven can build one from transitive dependencies.

> 
> Yes, this is A LOT of duplicate code (will take normally MINUTES to startup
> tasks with all of it just on copy time). This is absolutely not the way to
> go with this.
> 

Lack of guarantee to load seems like a bigger problem than startup time. Clearly we can’t just ignore this.

> 
>> There may be any number of bugs waiting for the time we try running on a
>> node machine that doesn’t have some class in it’s classpath.
> 
> 
> No. Assuming any given method is tested on all its execution paths, there
> will be no bugs. The bugs of that sort will only appear if the user is
> using algebra directly and calls something that is not on the path, from
> the closure. In which case our answer to this is the same as for the solver
> methodology developers -- use customized SparkConf while creating context
> to include stuff you really want.
> 
> Also another right answer to this is that we probably should reasonably
> provide the toolset here. For example, all the stats stuff found in R base
> and R stat packages so the user is not compelled to go non-native.
> 
> 

Huh? this is not true. The one I ran into was found by calling something in math from something in math-scala. It led outside and you can encounter such things even in algebra.  In fact you have no idea if these problems exists except for the fact you have used it a lot personally. 

Test catches these only if they are _not_ local unit tests but run on a TEST_MASTER and we would need very high test coverage. Test dependent rather than linker/statically checked—isn’t that why we love type safety? 

Actually I personally have no problem with non-type safe test-driven guarantees just so we’re all clear about what this implies. Javascript seems to thrive after all...

> 
> 
>> This is exactly what happened with RandomGenerator when it was dropped
>> from Spark. If I hadn’t run a test by hand on the cluster it would never
>> have shown up in the unit tests. I suspect that this may have led to other
>> odd error reports.
>> 
>> Would a script to run all unit tests on a cluster help find out whether we
>> have missing classes or not? As I understand it without a job.jar we can’t
>> really be sure.
>> 
> 
> this is probably a good idea, indeed. in fact, i may have introduced some
> of those when i transitioned stochastic stuff to mahout random utils
> without retesting it in distributed setting.
> 
> But i would think all one'd need is some little mod to standard spark-based
> test trait that creates the context in order to check out for something
> like TEST_MASTER in the environment and use the $TEST_MASTER master instead
> of local if one is found. once that tweak is done, one can easily rerun
> unit tests simply by giving
> 
> TEST_MASTER=spark://localhost:7077 mvn test
> 
> (similarly the way master is overriden for shell -- but of course we don't
> want tests to react to global MASTER variable just in case it is defined,
> so we need aptly named but different one).
> 

Yeah should be simple. I’ll look into this.

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Maybe a more fundamental issue is that we don’t know for sure whether we
> have missing classes or not. The job.jar at least used the pom dependencies
> to guarantee every needed class was present. So the job.jar seems to solve
> the problem but may ship some unnecessary duplicate code, right?
>

No, as i wrote spark doesn't  work with job jar format. Neither as it turns
out more recent hadoop MR btw.

Yes, this is A LOT of duplicate code (will take normally MINUTES to startup
tasks with all of it just on copy time). This is absolutely not the way to
go with this.



> There may be any number of bugs waiting for the time we try running on a
> node machine that doesn’t have some class in it’s classpath.


No. Assuming any given method is tested on all its execution paths, there
will be no bugs. The bugs of that sort will only appear if the user is
using algebra directly and calls something that is not on the path, from
the closure. In which case our answer to this is the same as for the solver
methodology developers -- use customized SparkConf while creating context
to include stuff you really want.

Also another right answer to this is that we probably should reasonably
provide the toolset here. For example, all the stats stuff found in R base
and R stat packages so the user is not compelled to go non-native.




> This is exactly what happened with RandomGenerator when it was dropped
> from Spark. If I hadn’t run a test by hand on the cluster it would never
> have shown up in the unit tests. I suspect that this may have led to other
> odd error reports.
>
> Would a script to run all unit tests on a cluster help find out whether we
> have missing classes or not? As I understand it without a job.jar we can’t
> really be sure.
>

this is probably a good idea, indeed. in fact, i may have introduced some
of those when i transitioned stochastic stuff to mahout random utils
without retesting it in distributed setting.

But i would think all one'd need is some little mod to standard spark-based
test trait that creates the context in order to check out for something
like TEST_MASTER in the environment and use the $TEST_MASTER master instead
of local if one is found. once that tweak is done, one can easily rerun
unit tests simply by giving

TEST_MASTER=spark://localhost:7077 mvn test

(similarly the way master is overriden for shell -- but of course we don't
want tests to react to global MASTER variable just in case it is defined,
so we need aptly named but different one).


>
> On Oct 20, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> PS
>
> all jar-finding routines rely on MAHOUT_HOME variable to find jars. so if
> you add some logic to add custom mahout jar to context, it should rely on
> it too.
>
> Perhaps the solution could be along the following lines.
>
> findMahoutJars() finds minimally required set of jars to run. Perhaps we
> can add all Mahout transitive dependencies (bar stuff like hadoop and hbase
> which already present in Spark) to some folder in mahout tree, say
> $MAHOUT_HOME/libManaged (similar to SBT).
>
> Knowing that, we perhaps can add a helper, findMahoutDependencyJars(),
> which will accept one or more artifact name for finding jars from
> $MAHOUT_HOME/libManged, similarly to how findMahoutJars() do it.
>
> findMahoutDependencyJars() should assert that it found all jars requested.
>
> Then driver code could use that helper to create additoinal jars in
> SparkConf before requesting Spark context.
>
> So for example, in your case driver should say
>
> findMahoutDependencyJars( "commons-math" :: Nil )
>
>
> and then add the result to SparkConf.
>
>
>
> On Mon, Oct 20, 2014 at 11:05 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> >
> >
> > On Mon, Oct 20, 2014 at 10:49 AM, Pat Ferrel <pa...@occamsmachete.com>
> > wrote:
> >
> >> I agree it’s just that different classes, required by mahout are missing
> >> from the environment depending on what happens to be in Spark. These
> deps
> >> should be supplied in the job.jar assemblies, right?
> >>
> >
> > No. They should be physically available as jars, somewhere. E.g. in
> > compiled mahout tree.
> >
> > the "job.xml" assembly in the "spark" module is but a left over from an
> > experiment i ran on job jars with Spark long ago. It's just hanging
> around
> > there but not actually being built. Sorry for confusion. DRM doesn't use
> > job jars. As far as I have established, Spark does not understand job
> jars
> > (it's purely a Hadoop notion -- but even there it has been unsupported or
> > depricated for a long time now).
> >
> > So. we can e.g. create a new assembly for spark, such as "optional
> > dependencies" jars, and put it somewhere into the compiled tree. (I guess
> > similar to "managed libraries" notion in SBT.).
> >
> > Then, if you need any of those, your driver code needs to do the
> > following. The mahoutSparkContext() method accepts optional SparkConf
> > parameter. Additional jars could be added to SparkConf before passing on
> to
> > mahoutSparkContext. If you don't supply SparkConf, the method will create
> > default one. If you do, it will merge all mahout specific settings and
> > standard jars to the context information you supply.
> >
> > As far as i see, by default context includes only math, math-scala, spark
> > and mrlegacy jars. No third party jars. (line 212 in sparkbindings
> > package). The test that checks that is in SparkBindingsSuite.scala. (yes
> > you are correct, the one you mentioned.)
> >
> >
> >
> >
> >
> >
> >>
> >> Trying out the
> >>  test("context jars") {
> >>  }
> >>
> >> findMahoutContextJars(closeables) gets the .jars, and seems to
> explicitly
> >> filter out the job.jars. The job.jars include needed dependencies so
> for a
> >> clustered environment shouldn’t these be the only ones used?
> >>
> >>
> >> On Oct 20, 2014, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >>
> >> either way i don't believe there's something specific to 1.0.1, 1.0.2 or
> >> 1.1.0 that is causing/not causing classpath errors. it's just jars are
> >> picked by explicitly hardcoded artifact "opt-in" policy, not the other
> way
> >> around.
> >>
> >> It is not enough just to modify pom in order for something to appear in
> >> task classpath.
> >>
> >> On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>
> >>> Note that classpaths for "cluster" environment is tested trivially by
> >>> starting 1-2 workers and standalone spark manager processes locally. No
> >>> need to build anything "real". Workers would not know anything about
> >> mahout
> >>> so unless proper jars are exposed in context, they would have no way of
> >>> "faking" the access to classes.
> >>>
> >>> On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>>
> >>>> Yes, asap.
> >>>>
> >>>> To test this right it has to run on a cluster so I’m upgrading. When
> >>>> ready it will just be a “mvn clean install" if you already have Spark
> >> 1.1.0
> >>>> running.
> >>>>
> >>>> I would have only expected errors on the CLI drivers so if anyone else
> >>>> sees runtime errors please let us know. Some errors are very hard to
> >> unit
> >>>> test since the environment is different for local(unit tests) and
> >> cluster
> >>>> execution.
> >>>>
> >>>>
> >>>> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <
> balijamahesh.mca@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>> Hi Pat,
> >>>>
> >>>> Can you please give detailed steps to build Mahout against Spark
> 1.1.0.
> >>>> I build against 1.1.0 but still had class not found errors, thats why
> I
> >>>> reverted back to Spark 1.0.2 even though first few steps are
> successful
> >>>> but still facing some issues in running Mahout spark-shell sample
> >> commands
> >>>> (drmData) throws some errors even on 1.0.2.
> >>>>
> >>>> Best,
> >>>> Mahesh.B.
> >>>>
> >>>> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
> >>>>
> >>>>> From my experience 1.1.0 is quite stable, plus some performance
> >>>>> improvements that totally worth the effort.
> >>>>>
> >>>>>
> >>>>> On 10/19/2014 06:30 PM, Ted Dunning wrote:
> >>>>>
> >>>>>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Getting off the dubious Spark 1.0.1 version is turning out to be a
> >> bit
> >>>> of
> >>>>>>> work. Does anyone object to upgrading our Spark dependency? I’m not
> >>>> sure
> >>>>>>> if
> >>>>>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
> >>>> upgrading
> >>>>>>> your Spark cluster.
> >>>>>>>
> >>>>>>
> >>>>>> It is going to have to happen sooner or later.
> >>>>>>
> >>>>>> Sooner may actually be less total pain.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

BTW I converted to scala.util.Random so we don’t even know if that missing class now

On Oct 20, 2014, at 11:44 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Maybe a more fundamental issue is that we don’t know for sure whether we have missing classes or not. The job.jar at least used the pom dependencies to guarantee every needed class was present. So the job.jar seems to solve the problem but may ship some unnecessary duplicate code, right? 

There may be any number of bugs waiting for the time we try running on a node machine that doesn’t have some class in it’s classpath. This is exactly what happened with RandomGenerator when it was dropped from Spark. If I hadn’t run a test by hand on the cluster it would never have shown up in the unit tests. I suspect that this may have led to other odd error reports.

Would a script to run all unit tests on a cluster help find out whether we have missing classes or not? As I understand it without a job.jar we can’t really be sure.

On Oct 20, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

PS

all jar-finding routines rely on MAHOUT_HOME variable to find jars. so if
you add some logic to add custom mahout jar to context, it should rely on
it too.

Perhaps the solution could be along the following lines.

findMahoutJars() finds minimally required set of jars to run. Perhaps we
can add all Mahout transitive dependencies (bar stuff like hadoop and hbase
which already present in Spark) to some folder in mahout tree, say
$MAHOUT_HOME/libManaged (similar to SBT).

Knowing that, we perhaps can add a helper, findMahoutDependencyJars(),
which will accept one or more artifact name for finding jars from
$MAHOUT_HOME/libManged, similarly to how findMahoutJars() do it.

findMahoutDependencyJars() should assert that it found all jars requested.

Then driver code could use that helper to create additoinal jars in
SparkConf before requesting Spark context.

So for example, in your case driver should say

findMahoutDependencyJars( "commons-math" :: Nil )


and then add the result to SparkConf.



On Mon, Oct 20, 2014 at 11:05 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> 
> 
> On Mon, Oct 20, 2014 at 10:49 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> I agree it’s just that different classes, required by mahout are missing
>> from the environment depending on what happens to be in Spark. These deps
>> should be supplied in the job.jar assemblies, right?
>> 
> 
> No. They should be physically available as jars, somewhere. E.g. in
> compiled mahout tree.
> 
> the "job.xml" assembly in the "spark" module is but a left over from an
> experiment i ran on job jars with Spark long ago. It's just hanging around
> there but not actually being built. Sorry for confusion. DRM doesn't use
> job jars. As far as I have established, Spark does not understand job jars
> (it's purely a Hadoop notion -- but even there it has been unsupported or
> depricated for a long time now).
> 
> So. we can e.g. create a new assembly for spark, such as "optional
> dependencies" jars, and put it somewhere into the compiled tree. (I guess
> similar to "managed libraries" notion in SBT.).
> 
> Then, if you need any of those, your driver code needs to do the
> following. The mahoutSparkContext() method accepts optional SparkConf
> parameter. Additional jars could be added to SparkConf before passing on to
> mahoutSparkContext. If you don't supply SparkConf, the method will create
> default one. If you do, it will merge all mahout specific settings and
> standard jars to the context information you supply.
> 
> As far as i see, by default context includes only math, math-scala, spark
> and mrlegacy jars. No third party jars. (line 212 in sparkbindings
> package). The test that checks that is in SparkBindingsSuite.scala. (yes
> you are correct, the one you mentioned.)
> 
> 
> 
> 
> 
> 
>> 
>> Trying out the
>> test("context jars") {
>> }
>> 
>> findMahoutContextJars(closeables) gets the .jars, and seems to explicitly
>> filter out the job.jars. The job.jars include needed dependencies so for a
>> clustered environment shouldn’t these be the only ones used?
>> 
>> 
>> On Oct 20, 2014, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> either way i don't believe there's something specific to 1.0.1, 1.0.2 or
>> 1.1.0 that is causing/not causing classpath errors. it's just jars are
>> picked by explicitly hardcoded artifact "opt-in" policy, not the other way
>> around.
>> 
>> It is not enough just to modify pom in order for something to appear in
>> task classpath.
>> 
>> On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> Note that classpaths for "cluster" environment is tested trivially by
>>> starting 1-2 workers and standalone spark manager processes locally. No
>>> need to build anything "real". Workers would not know anything about
>> mahout
>>> so unless proper jars are exposed in context, they would have no way of
>>> "faking" the access to classes.
>>> 
>>> On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Yes, asap.
>>>> 
>>>> To test this right it has to run on a cluster so I’m upgrading. When
>>>> ready it will just be a “mvn clean install" if you already have Spark
>> 1.1.0
>>>> running.
>>>> 
>>>> I would have only expected errors on the CLI drivers so if anyone else
>>>> sees runtime errors please let us know. Some errors are very hard to
>> unit
>>>> test since the environment is different for local(unit tests) and
>> cluster
>>>> execution.
>>>> 
>>>> 
>>>> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <balijamahesh.mca@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> Can you please give detailed steps to build Mahout against Spark 1.1.0.
>>>> I build against 1.1.0 but still had class not found errors, thats why I
>>>> reverted back to Spark 1.0.2 even though first few steps are successful
>>>> but still facing some issues in running Mahout spark-shell sample
>> commands
>>>> (drmData) throws some errors even on 1.0.2.
>>>> 
>>>> Best,
>>>> Mahesh.B.
>>>> 
>>>> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
>>>> 
>>>>> From my experience 1.1.0 is quite stable, plus some performance
>>>>> improvements that totally worth the effort.
>>>>> 
>>>>> 
>>>>> On 10/19/2014 06:30 PM, Ted Dunning wrote:
>>>>> 
>>>>>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Getting off the dubious Spark 1.0.1 version is turning out to be a
>> bit
>>>> of
>>>>>>> work. Does anyone object to upgrading our Spark dependency? I’m not
>>>> sure
>>>>>>> if
>>>>>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
>>>> upgrading
>>>>>>> your Spark cluster.
>>>>>>> 
>>>>>> 
>>>>>> It is going to have to happen sooner or later.
>>>>>> 
>>>>>> Sooner may actually be less total pain.
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Maybe a more fundamental issue is that we don’t know for sure whether we have missing classes or not. The job.jar at least used the pom dependencies to guarantee every needed class was present. So the job.jar seems to solve the problem but may ship some unnecessary duplicate code, right? 

There may be any number of bugs waiting for the time we try running on a node machine that doesn’t have some class in it’s classpath. This is exactly what happened with RandomGenerator when it was dropped from Spark. If I hadn’t run a test by hand on the cluster it would never have shown up in the unit tests. I suspect that this may have led to other odd error reports.

Would a script to run all unit tests on a cluster help find out whether we have missing classes or not? As I understand it without a job.jar we can’t really be sure.

On Oct 20, 2014, at 11:16 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

PS

all jar-finding routines rely on MAHOUT_HOME variable to find jars. so if
you add some logic to add custom mahout jar to context, it should rely on
it too.

Perhaps the solution could be along the following lines.

findMahoutJars() finds minimally required set of jars to run. Perhaps we
can add all Mahout transitive dependencies (bar stuff like hadoop and hbase
which already present in Spark) to some folder in mahout tree, say
$MAHOUT_HOME/libManaged (similar to SBT).

Knowing that, we perhaps can add a helper, findMahoutDependencyJars(),
which will accept one or more artifact name for finding jars from
$MAHOUT_HOME/libManged, similarly to how findMahoutJars() do it.

findMahoutDependencyJars() should assert that it found all jars requested.

Then driver code could use that helper to create additoinal jars in
SparkConf before requesting Spark context.

So for example, in your case driver should say

findMahoutDependencyJars( "commons-math" :: Nil )


and then add the result to SparkConf.



On Mon, Oct 20, 2014 at 11:05 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> 
> 
> On Mon, Oct 20, 2014 at 10:49 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> 
>> I agree it’s just that different classes, required by mahout are missing
>> from the environment depending on what happens to be in Spark. These deps
>> should be supplied in the job.jar assemblies, right?
>> 
> 
> No. They should be physically available as jars, somewhere. E.g. in
> compiled mahout tree.
> 
> the "job.xml" assembly in the "spark" module is but a left over from an
> experiment i ran on job jars with Spark long ago. It's just hanging around
> there but not actually being built. Sorry for confusion. DRM doesn't use
> job jars. As far as I have established, Spark does not understand job jars
> (it's purely a Hadoop notion -- but even there it has been unsupported or
> depricated for a long time now).
> 
> So. we can e.g. create a new assembly for spark, such as "optional
> dependencies" jars, and put it somewhere into the compiled tree. (I guess
> similar to "managed libraries" notion in SBT.).
> 
> Then, if you need any of those, your driver code needs to do the
> following. The mahoutSparkContext() method accepts optional SparkConf
> parameter. Additional jars could be added to SparkConf before passing on to
> mahoutSparkContext. If you don't supply SparkConf, the method will create
> default one. If you do, it will merge all mahout specific settings and
> standard jars to the context information you supply.
> 
> As far as i see, by default context includes only math, math-scala, spark
> and mrlegacy jars. No third party jars. (line 212 in sparkbindings
> package). The test that checks that is in SparkBindingsSuite.scala. (yes
> you are correct, the one you mentioned.)
> 
> 
> 
> 
> 
> 
>> 
>> Trying out the
>>  test("context jars") {
>>  }
>> 
>> findMahoutContextJars(closeables) gets the .jars, and seems to explicitly
>> filter out the job.jars. The job.jars include needed dependencies so for a
>> clustered environment shouldn’t these be the only ones used?
>> 
>> 
>> On Oct 20, 2014, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> either way i don't believe there's something specific to 1.0.1, 1.0.2 or
>> 1.1.0 that is causing/not causing classpath errors. it's just jars are
>> picked by explicitly hardcoded artifact "opt-in" policy, not the other way
>> around.
>> 
>> It is not enough just to modify pom in order for something to appear in
>> task classpath.
>> 
>> On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> Note that classpaths for "cluster" environment is tested trivially by
>>> starting 1-2 workers and standalone spark manager processes locally. No
>>> need to build anything "real". Workers would not know anything about
>> mahout
>>> so unless proper jars are exposed in context, they would have no way of
>>> "faking" the access to classes.
>>> 
>>> On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>> 
>>>> Yes, asap.
>>>> 
>>>> To test this right it has to run on a cluster so I’m upgrading. When
>>>> ready it will just be a “mvn clean install" if you already have Spark
>> 1.1.0
>>>> running.
>>>> 
>>>> I would have only expected errors on the CLI drivers so if anyone else
>>>> sees runtime errors please let us know. Some errors are very hard to
>> unit
>>>> test since the environment is different for local(unit tests) and
>> cluster
>>>> execution.
>>>> 
>>>> 
>>>> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <balijamahesh.mca@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> Can you please give detailed steps to build Mahout against Spark 1.1.0.
>>>> I build against 1.1.0 but still had class not found errors, thats why I
>>>> reverted back to Spark 1.0.2 even though first few steps are successful
>>>> but still facing some issues in running Mahout spark-shell sample
>> commands
>>>> (drmData) throws some errors even on 1.0.2.
>>>> 
>>>> Best,
>>>> Mahesh.B.
>>>> 
>>>> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
>>>> 
>>>>> From my experience 1.1.0 is quite stable, plus some performance
>>>>> improvements that totally worth the effort.
>>>>> 
>>>>> 
>>>>> On 10/19/2014 06:30 PM, Ted Dunning wrote:
>>>>> 
>>>>>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Getting off the dubious Spark 1.0.1 version is turning out to be a
>> bit
>>>> of
>>>>>>> work. Does anyone object to upgrading our Spark dependency? I’m not
>>>> sure
>>>>>>> if
>>>>>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
>>>> upgrading
>>>>>>> your Spark cluster.
>>>>>>> 
>>>>>> 
>>>>>> It is going to have to happen sooner or later.
>>>>>> 
>>>>>> Sooner may actually be less total pain.
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS

all jar-finding routines rely on MAHOUT_HOME variable to find jars. so if
you add some logic to add custom mahout jar to context, it should rely on
it too.

Perhaps the solution could be along the following lines.

findMahoutJars() finds minimally required set of jars to run. Perhaps we
can add all Mahout transitive dependencies (bar stuff like hadoop and hbase
which already present in Spark) to some folder in mahout tree, say
$MAHOUT_HOME/libManaged (similar to SBT).

Knowing that, we perhaps can add a helper, findMahoutDependencyJars(),
which will accept one or more artifact name for finding jars from
$MAHOUT_HOME/libManged, similarly to how findMahoutJars() do it.

findMahoutDependencyJars() should assert that it found all jars requested.

Then driver code could use that helper to create additoinal jars in
SparkConf before requesting Spark context.

So for example, in your case driver should say

findMahoutDependencyJars( "commons-math" :: Nil )


and then add the result to SparkConf.



On Mon, Oct 20, 2014 at 11:05 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

>
>
> On Mon, Oct 20, 2014 at 10:49 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
>> I agree it’s just that different classes, required by mahout are missing
>> from the environment depending on what happens to be in Spark. These deps
>> should be supplied in the job.jar assemblies, right?
>>
>
> No. They should be physically available as jars, somewhere. E.g. in
> compiled mahout tree.
>
> the "job.xml" assembly in the "spark" module is but a left over from an
> experiment i ran on job jars with Spark long ago. It's just hanging around
> there but not actually being built. Sorry for confusion. DRM doesn't use
> job jars. As far as I have established, Spark does not understand job jars
> (it's purely a Hadoop notion -- but even there it has been unsupported or
> depricated for a long time now).
>
> So. we can e.g. create a new assembly for spark, such as "optional
> dependencies" jars, and put it somewhere into the compiled tree. (I guess
> similar to "managed libraries" notion in SBT.).
>
> Then, if you need any of those, your driver code needs to do the
> following. The mahoutSparkContext() method accepts optional SparkConf
> parameter. Additional jars could be added to SparkConf before passing on to
> mahoutSparkContext. If you don't supply SparkConf, the method will create
> default one. If you do, it will merge all mahout specific settings and
> standard jars to the context information you supply.
>
> As far as i see, by default context includes only math, math-scala, spark
> and mrlegacy jars. No third party jars. (line 212 in sparkbindings
> package). The test that checks that is in SparkBindingsSuite.scala. (yes
> you are correct, the one you mentioned.)
>
>
>
>
>
>
>>
>> Trying out the
>>   test("context jars") {
>>   }
>>
>> findMahoutContextJars(closeables) gets the .jars, and seems to explicitly
>> filter out the job.jars. The job.jars include needed dependencies so for a
>> clustered environment shouldn’t these be the only ones used?
>>
>>
>> On Oct 20, 2014, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> either way i don't believe there's something specific to 1.0.1, 1.0.2 or
>> 1.1.0 that is causing/not causing classpath errors. it's just jars are
>> picked by explicitly hardcoded artifact "opt-in" policy, not the other way
>> around.
>>
>> It is not enough just to modify pom in order for something to appear in
>> task classpath.
>>
>> On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>> > Note that classpaths for "cluster" environment is tested trivially by
>> > starting 1-2 workers and standalone spark manager processes locally. No
>> > need to build anything "real". Workers would not know anything about
>> mahout
>> > so unless proper jars are exposed in context, they would have no way of
>> > "faking" the access to classes.
>> >
>> > On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> >
>> >> Yes, asap.
>> >>
>> >> To test this right it has to run on a cluster so I’m upgrading. When
>> >> ready it will just be a “mvn clean install" if you already have Spark
>> 1.1.0
>> >> running.
>> >>
>> >> I would have only expected errors on the CLI drivers so if anyone else
>> >> sees runtime errors please let us know. Some errors are very hard to
>> unit
>> >> test since the environment is different for local(unit tests) and
>> cluster
>> >> execution.
>> >>
>> >>
>> >> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <balijamahesh.mca@gmail.com
>> >
>> >> wrote:
>> >>
>> >> Hi Pat,
>> >>
>> >> Can you please give detailed steps to build Mahout against Spark 1.1.0.
>> >> I build against 1.1.0 but still had class not found errors, thats why I
>> >> reverted back to Spark 1.0.2 even though first few steps are successful
>> >> but still facing some issues in running Mahout spark-shell sample
>> commands
>> >> (drmData) throws some errors even on 1.0.2.
>> >>
>> >> Best,
>> >> Mahesh.B.
>> >>
>> >> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
>> >>
>> >>> From my experience 1.1.0 is quite stable, plus some performance
>> >>> improvements that totally worth the effort.
>> >>>
>> >>>
>> >>> On 10/19/2014 06:30 PM, Ted Dunning wrote:
>> >>>
>> >>>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>> >>>> wrote:
>> >>>>
>> >>>> Getting off the dubious Spark 1.0.1 version is turning out to be a
>> bit
>> >> of
>> >>>>> work. Does anyone object to upgrading our Spark dependency? I’m not
>> >> sure
>> >>>>> if
>> >>>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
>> >> upgrading
>> >>>>> your Spark cluster.
>> >>>>>
>> >>>>
>> >>>> It is going to have to happen sooner or later.
>> >>>>
>> >>>> Sooner may actually be less total pain.
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >
>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mon, Oct 20, 2014 at 10:49 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I agree it’s just that different classes, required by mahout are missing
> from the environment depending on what happens to be in Spark. These deps
> should be supplied in the job.jar assemblies, right?
>

No. They should be physically available as jars, somewhere. E.g. in
compiled mahout tree.

the "job.xml" assembly in the "spark" module is but a left over from an
experiment i ran on job jars with Spark long ago. It's just hanging around
there but not actually being built. Sorry for confusion. DRM doesn't use
job jars. As far as I have established, Spark does not understand job jars
(it's purely a Hadoop notion -- but even there it has been unsupported or
depricated for a long time now).

So. we can e.g. create a new assembly for spark, such as "optional
dependencies" jars, and put it somewhere into the compiled tree. (I guess
similar to "managed libraries" notion in SBT.).

Then, if you need any of those, your driver code needs to do the following.
The mahoutSparkContext() method accepts optional SparkConf parameter.
Additional jars could be added to SparkConf before passing on to
mahoutSparkContext. If you don't supply SparkConf, the method will create
default one. If you do, it will merge all mahout specific settings and
standard jars to the context information you supply.

As far as i see, by default context includes only math, math-scala, spark
and mrlegacy jars. No third party jars. (line 212 in sparkbindings
package). The test that checks that is in SparkBindingsSuite.scala. (yes
you are correct, the one you mentioned.)

>
> Trying out the
>   test("context jars") {
>   }
>
> findMahoutContextJars(closeables) gets the .jars, and seems to explicitly
> filter out the job.jars. The job.jars include needed dependencies so for a
> clustered environment shouldn’t these be the only ones used?
>
>
> On Oct 20, 2014, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> either way i don't believe there's something specific to 1.0.1, 1.0.2 or
> 1.1.0 that is causing/not causing classpath errors. it's just jars are
> picked by explicitly hardcoded artifact "opt-in" policy, not the other way
> around.
>
> It is not enough just to modify pom in order for something to appear in
> task classpath.
>
> On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Note that classpaths for "cluster" environment is tested trivially by
> > starting 1-2 workers and standalone spark manager processes locally. No
> > need to build anything "real". Workers would not know anything about
> mahout
> > so unless proper jars are exposed in context, they would have no way of
> > "faking" the access to classes.
> >
> > On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> Yes, asap.
> >>
> >> To test this right it has to run on a cluster so I’m upgrading. When
> >> ready it will just be a “mvn clean install" if you already have Spark
> 1.1.0
> >> running.
> >>
> >> I would have only expected errors on the CLI drivers so if anyone else
> >> sees runtime errors please let us know. Some errors are very hard to
> unit
> >> test since the environment is different for local(unit tests) and
> cluster
> >> execution.
> >>
> >>
> >> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <ba...@gmail.com>
> >> wrote:
> >>
> >> Hi Pat,
> >>
> >> Can you please give detailed steps to build Mahout against Spark 1.1.0.
> >> I build against 1.1.0 but still had class not found errors, thats why I
> >> reverted back to Spark 1.0.2 even though first few steps are successful
> >> but still facing some issues in running Mahout spark-shell sample
> commands
> >> (drmData) throws some errors even on 1.0.2.
> >>
> >> Best,
> >> Mahesh.B.
> >>
> >> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
> >>
> >>> From my experience 1.1.0 is quite stable, plus some performance
> >>> improvements that totally worth the effort.
> >>>
> >>>
> >>> On 10/19/2014 06:30 PM, Ted Dunning wrote:
> >>>
> >>>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
> >>>> wrote:
> >>>>
> >>>> Getting off the dubious Spark 1.0.1 version is turning out to be a bit
> >> of
> >>>>> work. Does anyone object to upgrading our Spark dependency? I’m not
> >> sure
> >>>>> if
> >>>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
> >> upgrading
> >>>>> your Spark cluster.
> >>>>>
> >>>>
> >>>> It is going to have to happen sooner or later.
> >>>>
> >>>> Sooner may actually be less total pain.
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I agree it’s just that different classes, required by mahout are missing from the environment depending on what happens to be in Spark. These deps should be supplied in the job.jar assemblies, right? 

Trying out the 
  test("context jars") {
  }

findMahoutContextJars(closeables) gets the .jars, and seems to explicitly filter out the job.jars. The job.jars include needed dependencies so for a clustered environment shouldn’t these be the only ones used?


On Oct 20, 2014, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

either way i don't believe there's something specific to 1.0.1, 1.0.2 or
1.1.0 that is causing/not causing classpath errors. it's just jars are
picked by explicitly hardcoded artifact "opt-in" policy, not the other way
around.

It is not enough just to modify pom in order for something to appear in
task classpath.

On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Note that classpaths for "cluster" environment is tested trivially by
> starting 1-2 workers and standalone spark manager processes locally. No
> need to build anything "real". Workers would not know anything about mahout
> so unless proper jars are exposed in context, they would have no way of
> "faking" the access to classes.
> 
> On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> Yes, asap.
>> 
>> To test this right it has to run on a cluster so I’m upgrading. When
>> ready it will just be a “mvn clean install" if you already have Spark 1.1.0
>> running.
>> 
>> I would have only expected errors on the CLI drivers so if anyone else
>> sees runtime errors please let us know. Some errors are very hard to unit
>> test since the environment is different for local(unit tests) and cluster
>> execution.
>> 
>> 
>> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <ba...@gmail.com>
>> wrote:
>> 
>> Hi Pat,
>> 
>> Can you please give detailed steps to build Mahout against Spark 1.1.0.
>> I build against 1.1.0 but still had class not found errors, thats why I
>> reverted back to Spark 1.0.2 even though first few steps are successful
>> but still facing some issues in running Mahout spark-shell sample commands
>> (drmData) throws some errors even on 1.0.2.
>> 
>> Best,
>> Mahesh.B.
>> 
>> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
>> 
>>> From my experience 1.1.0 is quite stable, plus some performance
>>> improvements that totally worth the effort.
>>> 
>>> 
>>> On 10/19/2014 06:30 PM, Ted Dunning wrote:
>>> 
>>>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>>>> wrote:
>>>> 
>>>> Getting off the dubious Spark 1.0.1 version is turning out to be a bit
>> of
>>>>> work. Does anyone object to upgrading our Spark dependency? I’m not
>> sure
>>>>> if
>>>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
>> upgrading
>>>>> your Spark cluster.
>>>>> 
>>>> 
>>>> It is going to have to happen sooner or later.
>>>> 
>>>> Sooner may actually be less total pain.
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

either way i don't believe there's something specific to 1.0.1, 1.0.2 or
1.1.0 that is causing/not causing classpath errors. it's just jars are
picked by explicitly hardcoded artifact "opt-in" policy, not the other way
around.

It is not enough just to modify pom in order for something to appear in
task classpath.

On Mon, Oct 20, 2014 at 9:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Note that classpaths for "cluster" environment is tested trivially by
> starting 1-2 workers and standalone spark manager processes locally. No
> need to build anything "real". Workers would not know anything about mahout
> so unless proper jars are exposed in context, they would have no way of
> "faking" the access to classes.
>
> On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Yes, asap.
>>
>> To test this right it has to run on a cluster so I’m upgrading. When
>> ready it will just be a “mvn clean install" if you already have Spark 1.1.0
>> running.
>>
>> I would have only expected errors on the CLI drivers so if anyone else
>> sees runtime errors please let us know. Some errors are very hard to unit
>> test since the environment is different for local(unit tests) and cluster
>> execution.
>>
>>
>> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <ba...@gmail.com>
>> wrote:
>>
>> Hi Pat,
>>
>> Can you please give detailed steps to build Mahout against Spark 1.1.0.
>> I build against 1.1.0 but still had class not found errors, thats why I
>> reverted back to Spark 1.0.2 even though first few steps are successful
>> but still facing some issues in running Mahout spark-shell sample commands
>> (drmData) throws some errors even on 1.0.2.
>>
>> Best,
>> Mahesh.B.
>>
>> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
>>
>> > From my experience 1.1.0 is quite stable, plus some performance
>> > improvements that totally worth the effort.
>> >
>> >
>> > On 10/19/2014 06:30 PM, Ted Dunning wrote:
>> >
>> >> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>> >> wrote:
>> >>
>> >> Getting off the dubious Spark 1.0.1 version is turning out to be a bit
>> of
>> >>> work. Does anyone object to upgrading our Spark dependency? I’m not
>> sure
>> >>> if
>> >>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean
>> upgrading
>> >>> your Spark cluster.
>> >>>
>> >>
>> >> It is going to have to happen sooner or later.
>> >>
>> >> Sooner may actually be less total pain.
>> >>
>> >>
>> >
>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Note that classpaths for "cluster" environment is tested trivially by
starting 1-2 workers and standalone spark manager processes locally. No
need to build anything "real". Workers would not know anything about mahout
so unless proper jars are exposed in context, they would have no way of
"faking" the access to classes.

On Mon, Oct 20, 2014 at 9:28 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Yes, asap.
>
> To test this right it has to run on a cluster so I’m upgrading. When ready
> it will just be a “mvn clean install" if you already have Spark 1.1.0
> running.
>
> I would have only expected errors on the CLI drivers so if anyone else
> sees runtime errors please let us know. Some errors are very hard to unit
> test since the environment is different for local(unit tests) and cluster
> execution.
>
>
> On Oct 20, 2014, at 9:14 AM, Mahesh Balija <ba...@gmail.com>
> wrote:
>
> Hi Pat,
>
> Can you please give detailed steps to build Mahout against Spark 1.1.0.
> I build against 1.1.0 but still had class not found errors, thats why I
> reverted back to Spark 1.0.2 even though first few steps are successful
> but still facing some issues in running Mahout spark-shell sample commands
> (drmData) throws some errors even on 1.0.2.
>
> Best,
> Mahesh.B.
>
> On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:
>
> > From my experience 1.1.0 is quite stable, plus some performance
> > improvements that totally worth the effort.
> >
> >
> > On 10/19/2014 06:30 PM, Ted Dunning wrote:
> >
> >> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
> >> wrote:
> >>
> >> Getting off the dubious Spark 1.0.1 version is turning out to be a bit
> of
> >>> work. Does anyone object to upgrading our Spark dependency? I’m not
> sure
> >>> if
> >>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
> >>> your Spark cluster.
> >>>
> >>
> >> It is going to have to happen sooner or later.
> >>
> >> Sooner may actually be less total pain.
> >>
> >>
> >
>
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes, asap.

To test this right it has to run on a cluster so I’m upgrading. When ready it will just be a “mvn clean install" if you already have Spark 1.1.0 running.

I would have only expected errors on the CLI drivers so if anyone else sees runtime errors please let us know. Some errors are very hard to unit test since the environment is different for local(unit tests) and cluster execution.

On Oct 20, 2014, at 9:14 AM, Mahesh Balija <ba...@gmail.com> wrote:

Hi Pat,

Can you please give detailed steps to build Mahout against Spark 1.1.0.
I build against 1.1.0 but still had class not found errors, thats why I
reverted back to Spark 1.0.2 even though first few steps are successful
but still facing some issues in running Mahout spark-shell sample commands
(drmData) throws some errors even on 1.0.2.

Best,
Mahesh.B.

On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:

> From my experience 1.1.0 is quite stable, plus some performance
> improvements that totally worth the effort.
> 
> 
> On 10/19/2014 06:30 PM, Ted Dunning wrote:
> 
>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>> Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
>>> work. Does anyone object to upgrading our Spark dependency? I’m not sure
>>> if
>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
>>> your Spark cluster.
>>> 
>> 
>> It is going to have to happen sooner or later.
>> 
>> Sooner may actually be less total pain.
>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes, asap.

To test this right it has to run on a cluster so I’m upgrading. When ready it will just be a “mvn clean install" if you already have Spark 1.1.0 running.

I would have only expected errors on the CLI drivers so if anyone else sees runtime errors please let us know. Some errors are very hard to unit test since the environment is different for local(unit tests) and cluster execution.

On Oct 20, 2014, at 9:14 AM, Mahesh Balija <ba...@gmail.com> wrote:

Hi Pat,

Can you please give detailed steps to build Mahout against Spark 1.1.0.
I build against 1.1.0 but still had class not found errors, thats why I
reverted back to Spark 1.0.2 even though first few steps are successful
but still facing some issues in running Mahout spark-shell sample commands
(drmData) throws some errors even on 1.0.2.

Best,
Mahesh.B.

On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:

> From my experience 1.1.0 is quite stable, plus some performance
> improvements that totally worth the effort.
> 
> 
> On 10/19/2014 06:30 PM, Ted Dunning wrote:
> 
>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> 
>> Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
>>> work. Does anyone object to upgrading our Spark dependency? I’m not sure
>>> if
>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
>>> your Spark cluster.
>>> 
>> 
>> It is going to have to happen sooner or later.
>> 
>> Sooner may actually be less total pain.
>> 
>> 
>

Re: Upgrade to Spark 1.1.0?

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Pat,

Can you please give detailed steps to build Mahout against Spark 1.1.0.
I build against 1.1.0 but still had class not found errors, thats why I
reverted back to Spark 1.0.2 even though first few steps are successful
 but still facing some issues in running Mahout spark-shell sample commands
(drmData) throws some errors even on 1.0.2.

Best,
Mahesh.B.

On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:

> From my experience 1.1.0 is quite stable, plus some performance
> improvements that totally worth the effort.
>
>
> On 10/19/2014 06:30 PM, Ted Dunning wrote:
>
>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>>  Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
>>> work. Does anyone object to upgrading our Spark dependency? I’m not sure
>>> if
>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
>>> your Spark cluster.
>>>
>>
>> It is going to have to happen sooner or later.
>>
>> Sooner may actually be less total pain.
>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Pat,

Can you please give detailed steps to build Mahout against Spark 1.1.0.
I build against 1.1.0 but still had class not found errors, thats why I
reverted back to Spark 1.0.2 even though first few steps are successful
 but still facing some issues in running Mahout spark-shell sample commands
(drmData) throws some errors even on 1.0.2.

Best,
Mahesh.B.

On Mon, Oct 20, 2014 at 1:46 AM, peng <pc...@uowmail.edu.au> wrote:

> From my experience 1.1.0 is quite stable, plus some performance
> improvements that totally worth the effort.
>
>
> On 10/19/2014 06:30 PM, Ted Dunning wrote:
>
>> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>>
>>  Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
>>> work. Does anyone object to upgrading our Spark dependency? I’m not sure
>>> if
>>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
>>> your Spark cluster.
>>>
>>
>> It is going to have to happen sooner or later.
>>
>> Sooner may actually be less total pain.
>>
>>
>

Re: Upgrade to Spark 1.1.0?

Posted by peng <pc...@uowmail.edu.au>.

 From my experience 1.1.0 is quite stable, plus some performance 
improvements that totally worth the effort.

On 10/19/2014 06:30 PM, Ted Dunning wrote:
> On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
>> Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
>> work. Does anyone object to upgrading our Spark dependency? I’m not sure if
>> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
>> your Spark cluster.
>
> It is going to have to happen sooner or later.
>
> Sooner may actually be less total pain.
>

Re: Upgrade to Spark 1.1.0?

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
> work. Does anyone object to upgrading our Spark dependency? I’m not sure if
> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
> your Spark cluster.

It is going to have to happen sooner or later.

Sooner may actually be less total pain.

Re: Upgrade to Spark 1.1.0?

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
> work. Does anyone object to upgrading our Spark dependency? I’m not sure if
> Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
> your Spark cluster.

It is going to have to happen sooner or later.

Sooner may actually be less total pain.