You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Paul Wais <pw...@yelp.com> on 2014/09/18 10:06:25 UTC

Unable to find proto buffer class error with RDD

Dear List,

I'm writing an application where I have RDDs of protobuf messages.
When I run the app via bin/spar-submit with --master local
--driver-class-path path/to/my/uber.jar, Spark is able to
ser/deserialize the messages correctly.

However, if I run WITHOUT --driver-class-path path/to/my/uber.jar or I
try --master spark://my.master:7077 , then I run into errors that make
it look like my protobuf message classes are not on the classpath:

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost):
java.lang.RuntimeException: Unable to find proto buffer class
        com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        java.lang.reflect.Method.invoke(Method.java:606)
        java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
        java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
        ...

Why do I need --driver-class-path in the local scenario?  And how can
I ensure my classes are on the classpath no matter how my app is
submitted via bin/spark-submit (e.g. --master spark://my.master:7077 )
?  I've tried poking through the shell scripts and SparkSubmit.scala
and unfortunately I haven't been able to grok exactly what Spark is
doing with the remote/local JVMs.

Cheers,
-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Unable to find proto buffer class error with RDD

Posted by Paul Wais <pw...@yelp.com>.

Derp, one caveat to my "solution":  I guess Spark doesn't use Kryo for
Function serde :(

On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais <pw...@yelp.com> wrote:
> Well it looks like this is indeed a protobuf issue.  Poked a little more
> with Kryo.  Since protobuf messages are serializable, I tried just making
> Kryo use the JavaSerializer for my messages.  The resulting stack trace made
> it look like protobuf GeneratedMessageLite is actually using the classloader
> that loaded it, which I believe would be the root loader?
>
>  *
> https://code.google.com/p/protobuf/source/browse/trunk/java/src/main/java/com/google/protobuf/GeneratedMessageLite.java?r=425#775
>  *
> http://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/Class.java#l186
>  *
> http://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/ClassLoader.java#l1529
>  * See note:
> http://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/Class.java#l220
>
> So I guess protobuf java serialization is sensitive to the class loader.  I
> wonder if Kenton ever saw this one coming :)  I do have a solution, though
> (see way below)
>
>
> Here's the code and stack trace:
>
> SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("myapp");
> sparkConf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer");
> sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator");
>
> ...
>
> public class MyKryoRegistrator implements KryoRegistrator {
> public void registerClasses(Kryo kryo) {
>         kryo.register(MyProtoMessage.class, new JavaSerializer());
>     }
> }
>
> ...
>
> 14/09/19 05:39:12 ERROR Executor: Exception in task 2.0 in stage 1.0 (TID 2)
> com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
>         at
> com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
>         at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>         at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
>         at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
>         at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>         at
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)
>         at
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
>         at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>         at
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
>         at
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
>         at
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
>         at
> org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:123)
>         at
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:80)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>         at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>         at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>         at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.RuntimeException: Unable to find proto buffer class
>         at
> com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>         at
> com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:40)
>         ... 31 more
> Caused by: java.lang.ClassNotFoundException: MyProtos$MyProto
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:190)
>         at
> com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:768)
>         ... 40 more
> 14/09/19 05:39:12 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 1,
> localhost): com.esotericsoftware.kryo.KryoException: Error during Java
> deserialization.
>
> com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
>         com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
>
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
>         com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)
>
> com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
>         com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
>
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
>
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
>
> org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:123)
>
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:80)
>         sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         java.lang.reflect.Method.invoke(Method.java:606)
>
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:744)
>
>
>
>
>
>
>
>
> I was able to work around this by writing a Kyro Serializer that just uses
> protobuf's delimited binary message stream protocol.  This is super brittle
> (protobuf lib could break, protobuf generated code could break, could fail
> only at runtime), but appears to work:
>
>
>
> public class MyKryoRegistrator implements KryoRegistrator {
> public void registerClasses(Kryo kryo) {
>             kryo.register(MyProtobufMessage.class, new
> ProtobufSerializer<MyProtobufMessage.class>());
>         }
> public class ProtobufSerializer<T extends
> com.google.protobuf.GeneratedMessage> extends Serializer<T> {
>
> public void write(Kryo kryo, Output output, T msg) {
> try {
> msg.writeDelimitedTo(output);
> } catch (Exception ex) {
> throw new KryoException("Error during Java serialization.", ex);
> }
> }
>
> @SuppressWarnings("unchecked")
> public T read(Kryo kryo, Input input, Class<T> type) {
> try {
> return (T) type.getDeclaredMethod("parseDelimitedFrom",
> java.io.InputStream.class).invoke(null, input);
> } catch (Exception ex) {
> throw new KryoException("Error during Java deserialization.", ex);
> }
> }
> }
> }
>
>
>
> Note that one needs to use this wrapper/serializer for *all* protobuf
> classes used in the app, not just if the pb class is used in an RDD.
>
>
> The only other real alternative to the above hack would be to ship my
> uber.jar to all my worker machines somehow make Spark load it... that would
> be pretty ugly and a decent amount of work/complexity even if I had the help
> of Mesos, libswarm/Docker, etc.   (I'm actually already running this Spark
> app in a cluster of Docker containers, but the assets in these containers
> are static to app runs / recompiles and it makes no sense to get into the
> nitty of pushing assets per-build just yet).   Spark's code shipping
> features are certainly crucial for its REPL features, but it might be worth
> adding some notes in the docs for app developers detailing how Spark ships
> code, when it serializes things, etc.  It's not super obvious that Spark
> would interact this badly with protobuf, especially since protobuf is
> technically one of its dependencies :)
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Unable to find proto buffer class error with RDD

Posted by Paul Wais <pw...@yelp.com>.

Well it looks like this is indeed a protobuf issue.  Poked a little more
with Kryo.  Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages.  The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
classloader that loaded it, which I believe would be the root loader?

 *
https://code.google.com/p/protobuf/source/browse/trunk/java/src/main/java/com/google/protobuf/GeneratedMessageLite.java?r=425#775
 *
http://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/Class.java#l186
 *
http://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/ClassLoader.java#l1529
 * See note:
http://hg.openjdk.java.net/jdk7u/jdk7u6/jdk/file/8c2c5d63a17e/src/share/classes/java/lang/Class.java#l220

So I guess protobuf java serialization is sensitive to the class loader.  I
wonder if Kenton ever saw this one coming :)  I do have a solution, though
(see way below)


Here's the code and stack trace:

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("myapp");
sparkConf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator");

...

public class MyKryoRegistrator implements KryoRegistrator {
public void registerClasses(Kryo kryo) {
        kryo.register(MyProtoMessage.class, new JavaSerializer());
    }
}

...

14/09/19 05:39:12 ERROR Executor: Exception in task 2.0 in stage 1.0 (TID 2)
com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
        at
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
        at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
        at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
        at
com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)
        at
com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
        at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
        at
org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
        at
org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)
        at
org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:123)
        at
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:80)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
        at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: Unable to find proto buffer class
        at
com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
        at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
        at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:40)
        ... 31 more
Caused by: java.lang.ClassNotFoundException: MyProtos$MyProto
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:190)
        at
com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:768)
        ... 40 more
14/09/19 05:39:12 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 1,
localhost): com.esotericsoftware.kryo.KryoException: Error during Java
deserialization.

com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
        com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)

com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
        com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)

com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
        com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)

org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:80)

org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:123)

org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:80)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        java.lang.reflect.Method.invoke(Method.java:606)

java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:744)








I was able to work around this by writing a Kyro Serializer that just uses
protobuf's delimited binary message stream protocol.  This is super brittle
(protobuf lib could break, protobuf generated code could break, could fail
only at runtime), but appears to work:



public class MyKryoRegistrator implements KryoRegistrator {
public void registerClasses(Kryo kryo) {
            kryo.register(MyProtobufMessage.class, new
ProtobufSerializer<MyProtobufMessage.class>());
        }
 public class ProtobufSerializer<T extends
com.google.protobuf.GeneratedMessage> extends Serializer<T> {

public void write(Kryo kryo, Output output, T msg) {
try {
msg.writeDelimitedTo(output);
} catch (Exception ex) {
throw new KryoException("Error during Java serialization.", ex);
}
}

@SuppressWarnings("unchecked")
public T read(Kryo kryo, Input input, Class<T> type) {
try {
return (T) type.getDeclaredMethod("parseDelimitedFrom",
java.io.InputStream.class).invoke(null, input);
} catch (Exception ex) {
throw new KryoException("Error during Java deserialization.", ex);
}
}
}
}



Note that one needs to use this wrapper/serializer for *all* protobuf
classes used in the app, not just if the pb class is used in an RDD.


The only other real alternative to the above hack would be to ship my
uber.jar to all my worker machines somehow make Spark load it... that would
be pretty ugly and a decent amount of work/complexity even if I had the
help of Mesos, libswarm/Docker, etc.   (I'm actually already running this
Spark app in a cluster of Docker containers, but the assets in these
containers are static to app runs / recompiles and it makes no sense to get
into the nitty of pushing assets per-build just yet).   Spark's code
shipping features are certainly crucial for its REPL features, but it might
be worth adding some notes in the docs for app developers detailing how
Spark ships code, when it serializes things, etc.  It's not super obvious
that Spark would interact this badly with protobuf, especially since
protobuf is technically one of its dependencies :)

Re: Unable to find proto buffer class error with RDD

Posted by Paul Wais <pw...@yelp.com>.

It turns out Kyro doesn't play well with protobuf.  Out of the box I see:

com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException

Serialization trace:
extra_ (com.foo.bar.MyMessage)
        com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
...

Maybe I can fix this, not sure.




I dug a big deeper into the Java serialization issue and I believe
there's a bug in Spark introduced by a recent change in v1.1.


FMI the full stack trace I see in my app (sorry the one in the first
email was truncated) is:

java.lang.RuntimeException: Unable to find proto buffer class
        at com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
        at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:74)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)


So, the Executor appears to be using the right classloader here in spark 1.1:

https://github.com/apache/spark/blob/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/executor/Executor.scala#L159


I added some code to print the available classes in the initializer of
the map() function and I *do* see my protobuf message class printed,
but *only* for the current thread classloader and *not* for SparkEnv's
class loader.

try {
  ClassLoader cl = Thread.currentThread().getContextClassLoader();
//SparkEnv.getThreadLocal().getClass().getClassLoader();
  Field f = ClassLoader.class.getDeclaredField("classes");
  f.setAccessible(true);

  Vector<Class> classes = (Vector<Class>) f.get(cl);
  for (int i = 0; i < classes.size(); ++i) {
    Class c = classes.get(i);
    String name = c.toString();
    if (name.indexOf("myPackage") != -1) {
       System.err.println("class: " + name);
    }
  }
} catch (NoSuchFieldException e1) {
e1.printStackTrace();
} catch (IllegalAccessException e1) {
e1.printStackTrace();
}



So my code is available to the classloader and there's perhaps
something wrong with the classloader that
ParallelCollectionPartition.readObject() is using.  I note that if I
run spark-submit with --driver-class-path uber.jar, then the error
goes away (for local master only tho.. for remote the error returns).


I think the root problem might be related to this change:
https://github.com/apache/spark/commit/cc3648774e9a744850107bb187f2828d447e0a48#diff-7b43397a89d8249663cbd13374a48db0R42

That change did not appear to touch ParallelCollectionRDD, which I
believe is using the root classloader (would explain why
--driver-class-path fixes the problem):
https://github.com/apache/spark/blob/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74

If uber.jar is on the classpath, then the root classloader would have
the code, hence why --driver-class-path fixes the bug.




On Thu, Sep 18, 2014 at 5:42 PM, Paul Wais <pw...@yelp.com> wrote:
> hmmmmmm would using kyro help me here?
>
>
> On Thursday, September 18, 2014, Paul Wais <pw...@yelp.com> wrote:
>>
>> Ah, can one NOT create an RDD of any arbitrary Serializable type?  It
>> looks like I might be getting bitten by the same
>> "java.io.ObjectInputStream uses root class loader only" bugs mentioned
>> in:
>>
>> *
>> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-td3259.html
>> * https://github.com/apache/spark/pull/181
>>
>> *
>> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3C7F6AA9E820F55D4A96946A87E086EF4A4BCDF9AF@EAGH-ERFPMBX41.ERF.thomson.com%3E
>> * https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I
>>
>>
>>
>>
>> On Thu, Sep 18, 2014 at 4:51 PM, Paul Wais <pw...@yelp.com> wrote:
>> > Well, it looks like Spark is just not loading my code into the
>> > driver/executors.... E.g.:
>> >
>> > List<String> foo = JavaRDD<MyMessage> bars.map(
>> >     new Function< MyMessage, String>() {
>> >
>> >     {
>> >         System.err.println("classpath: " +
>> > System.getProperty("java.class.path"));
>> >
>> >         CodeSource src =
>> >
>> > com.google.protobuf.GeneratedMessageLite.class.getProtectionDomain().getCodeSource();
>> >         if (src2 != null) {
>> >            URL jar = src2.getLocation();
>> >
>> > System.err.println("aaacom.google.protobuf.GeneratedMessageLite
>> > from jar: " + jar.toString());
>> >     }
>> >
>> >     @Override
>> >     public String call(MyMessage v1) throws Exception {
>> >         return v1.getString();
>> >     }
>> > }).collect();
>> >
>> > prints:
>> > classpath:
>> > ::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.1.jar:/opt/spark/lib/datanucleus-rdbms-3.2.1.jar:/opt/spark/lib/datanucleus-core-3.2.2.jar
>> > com.google.protobuf.GeneratedMessageLite from jar:
>> > file:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar
>> >
>> > I do see after those lines:
>> > 14/09/18 23:28:09 INFO Executor: Adding
>> > file:/tmp/spark-cc147338-183f-46f6-b698-5b897e808a08/uber.jar to class
>> > loader
>> >
>> >
>> > This is with:
>> >
>> > spart-submit --master local --class MyClass --jars uber.jar  uber.jar
>> >
>> >
>> > My uber.jar has protobuf 2.5; I expected GeneratedMessageLite would
>> > come from there.  I'm using spark 1.1 and hadoop 2.3; hadoop 2.3
>> > should use protobuf 2.5[1] and even shade it properly.  I read claims
>> > in this list that Spark shades protobuf correctly since 0.9.? and
>> > looking thru the pom.xml on github it looks like Spark includes
>> > protobuf 2.5 in the hadoop 2.3 profile.
>> >
>> >
>> > I guess I'm still at "What's the deal with getting Spark to distribute
>> > and load code from my jar correctly?"
>> >
>> >
>> > [1]
>> > http://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml
>> >
>> > On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais <pw...@yelp.com> wrote:
>> >> Dear List,
>> >>
>> >> I'm writing an application where I have RDDs of protobuf messages.
>> >> When I run the app via bin/spar-submit with --master local
>> >> --driver-class-path path/to/my/uber.jar, Spark is able to
>> >> ser/deserialize the messages correctly.
>> >>
>> >> However, if I run WITHOUT --driver-class-path path/to/my/uber.jar or I
>> >> try --master spark://my.master:7077 , then I run into errors that make
>> >> it look like my protobuf message classes are not on the classpath:
>> >>
>> >> Exception in thread "main" org.apache.spark.SparkException: Job
>> >> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
>> >> recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost):
>> >> java.lang.RuntimeException: Unable to find proto buffer class
>> >>
>> >> com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
>> >>         sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>
>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >>
>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >>         java.lang.reflect.Method.invoke(Method.java:606)
>> >>
>> >> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>> >>
>> >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
>> >>         ...
>> >>
>> >> Why do I need --driver-class-path in the local scenario?  And how can
>> >> I ensure my classes are on the classpath no matter how my app is
>> >> submitted via bin/spark-submit (e.g. --master spark://my.master:7077 )
>> >> ?  I've tried poking through the shell scripts and SparkSubmit.scala
>> >> and unfortunately I haven't been able to grok exactly what Spark is
>> >> doing with the remote/local JVMs.
>> >>
>> >> Cheers,
>> >> -Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Unable to find proto buffer class error with RDD

Posted by Paul Wais <pw...@yelp.com>.

hmmmmmm would using kyro help me here?

On Thursday, September 18, 2014, Paul Wais <pw...@yelp.com> wrote:

> Ah, can one NOT create an RDD of any arbitrary Serializable type?  It
> looks like I might be getting bitten by the same
> "java.io.ObjectInputStream uses root class loader only" bugs mentioned
> in:
>
> *
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-td3259.html
> * https://github.com/apache/spark/pull/181
>
> *
> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3C7F6AA9E820F55D4A96946A87E086EF4A4BCDF9AF@EAGH-ERFPMBX41.ERF.thomson.com%3E
> * https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I
>
>
>
>
> On Thu, Sep 18, 2014 at 4:51 PM, Paul Wais <pwais@yelp.com <javascript:;>>
> wrote:
> > Well, it looks like Spark is just not loading my code into the
> > driver/executors.... E.g.:
> >
> > List<String> foo = JavaRDD<MyMessage> bars.map(
> >     new Function< MyMessage, String>() {
> >
> >     {
> >         System.err.println("classpath: " +
> > System.getProperty("java.class.path"));
> >
> >         CodeSource src =
> >
> com.google.protobuf.GeneratedMessageLite.class.getProtectionDomain().getCodeSource();
> >         if (src2 != null) {
> >            URL jar = src2.getLocation();
> >
> System.err.println("aaacom.google.protobuf.GeneratedMessageLite
> > from jar: " + jar.toString());
> >     }
> >
> >     @Override
> >     public String call(MyMessage v1) throws Exception {
> >         return v1.getString();
> >     }
> > }).collect();
> >
> > prints:
> > classpath:
> ::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.1.jar:/opt/spark/lib/datanucleus-rdbms-3.2.1.jar:/opt/spark/lib/datanucleus-core-3.2.2.jar
> > com.google.protobuf.GeneratedMessageLite from jar:
> > file:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar
> >
> > I do see after those lines:
> > 14/09/18 23:28:09 INFO Executor: Adding
> > file:/tmp/spark-cc147338-183f-46f6-b698-5b897e808a08/uber.jar to class
> > loader
> >
> >
> > This is with:
> >
> > spart-submit --master local --class MyClass --jars uber.jar  uber.jar
> >
> >
> > My uber.jar has protobuf 2.5; I expected GeneratedMessageLite would
> > come from there.  I'm using spark 1.1 and hadoop 2.3; hadoop 2.3
> > should use protobuf 2.5[1] and even shade it properly.  I read claims
> > in this list that Spark shades protobuf correctly since 0.9.? and
> > looking thru the pom.xml on github it looks like Spark includes
> > protobuf 2.5 in the hadoop 2.3 profile.
> >
> >
> > I guess I'm still at "What's the deal with getting Spark to distribute
> > and load code from my jar correctly?"
> >
> >
> > [1]
> http://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml
> >
> > On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais <pwais@yelp.com
> <javascript:;>> wrote:
> >> Dear List,
> >>
> >> I'm writing an application where I have RDDs of protobuf messages.
> >> When I run the app via bin/spar-submit with --master local
> >> --driver-class-path path/to/my/uber.jar, Spark is able to
> >> ser/deserialize the messages correctly.
> >>
> >> However, if I run WITHOUT --driver-class-path path/to/my/uber.jar or I
> >> try --master spark://my.master:7077 , then I run into errors that make
> >> it look like my protobuf message classes are not on the classpath:
> >>
> >> Exception in thread "main" org.apache.spark.SparkException: Job
> >> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
> >> recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost):
> >> java.lang.RuntimeException: Unable to find proto buffer class
> >>
>  com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
> >>         sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
>  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >>
>  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>         java.lang.reflect.Method.invoke(Method.java:606)
> >>
>  java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
> >>
>  java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
> >>         ...
> >>
> >> Why do I need --driver-class-path in the local scenario?  And how can
> >> I ensure my classes are on the classpath no matter how my app is
> >> submitted via bin/spark-submit (e.g. --master spark://my.master:7077 )
> >> ?  I've tried poking through the shell scripts and SparkSubmit.scala
> >> and unfortunately I haven't been able to grok exactly what Spark is
> >> doing with the remote/local JVMs.
> >>
> >> Cheers,
> >> -Paul
>

Re: Unable to find proto buffer class error with RDD

Posted by Paul Wais <pw...@yelp.com>.

Ah, can one NOT create an RDD of any arbitrary Serializable type?  It
looks like I might be getting bitten by the same
"java.io.ObjectInputStream uses root class loader only" bugs mentioned
in:

* http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-td3259.html
* https://github.com/apache/spark/pull/181

* http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3C7F6AA9E820F55D4A96946A87E086EF4A4BCDF9AF@EAGH-ERFPMBX41.ERF.thomson.com%3E
* https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I




On Thu, Sep 18, 2014 at 4:51 PM, Paul Wais <pw...@yelp.com> wrote:
> Well, it looks like Spark is just not loading my code into the
> driver/executors.... E.g.:
>
> List<String> foo = JavaRDD<MyMessage> bars.map(
>     new Function< MyMessage, String>() {
>
>     {
>         System.err.println("classpath: " +
> System.getProperty("java.class.path"));
>
>         CodeSource src =
> com.google.protobuf.GeneratedMessageLite.class.getProtectionDomain().getCodeSource();
>         if (src2 != null) {
>            URL jar = src2.getLocation();
>            System.err.println("aaacom.google.protobuf.GeneratedMessageLite
> from jar: " + jar.toString());
>     }
>
>     @Override
>     public String call(MyMessage v1) throws Exception {
>         return v1.getString();
>     }
> }).collect();
>
> prints:
> classpath: ::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.1.jar:/opt/spark/lib/datanucleus-rdbms-3.2.1.jar:/opt/spark/lib/datanucleus-core-3.2.2.jar
> com.google.protobuf.GeneratedMessageLite from jar:
> file:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar
>
> I do see after those lines:
> 14/09/18 23:28:09 INFO Executor: Adding
> file:/tmp/spark-cc147338-183f-46f6-b698-5b897e808a08/uber.jar to class
> loader
>
>
> This is with:
>
> spart-submit --master local --class MyClass --jars uber.jar  uber.jar
>
>
> My uber.jar has protobuf 2.5; I expected GeneratedMessageLite would
> come from there.  I'm using spark 1.1 and hadoop 2.3; hadoop 2.3
> should use protobuf 2.5[1] and even shade it properly.  I read claims
> in this list that Spark shades protobuf correctly since 0.9.? and
> looking thru the pom.xml on github it looks like Spark includes
> protobuf 2.5 in the hadoop 2.3 profile.
>
>
> I guess I'm still at "What's the deal with getting Spark to distribute
> and load code from my jar correctly?"
>
>
> [1] http://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml
>
> On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais <pw...@yelp.com> wrote:
>> Dear List,
>>
>> I'm writing an application where I have RDDs of protobuf messages.
>> When I run the app via bin/spar-submit with --master local
>> --driver-class-path path/to/my/uber.jar, Spark is able to
>> ser/deserialize the messages correctly.
>>
>> However, if I run WITHOUT --driver-class-path path/to/my/uber.jar or I
>> try --master spark://my.master:7077 , then I run into errors that make
>> it look like my protobuf message classes are not on the classpath:
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job
>> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
>> recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost):
>> java.lang.RuntimeException: Unable to find proto buffer class
>>         com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
>>         sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>         sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         java.lang.reflect.Method.invoke(Method.java:606)
>>         java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>>         java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
>>         ...
>>
>> Why do I need --driver-class-path in the local scenario?  And how can
>> I ensure my classes are on the classpath no matter how my app is
>> submitted via bin/spark-submit (e.g. --master spark://my.master:7077 )
>> ?  I've tried poking through the shell scripts and SparkSubmit.scala
>> and unfortunately I haven't been able to grok exactly what Spark is
>> doing with the remote/local JVMs.
>>
>> Cheers,
>> -Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Unable to find proto buffer class error with RDD

Posted by Paul Wais <pw...@yelp.com>.

Well, it looks like Spark is just not loading my code into the
driver/executors.... E.g.:

List<String> foo = JavaRDD<MyMessage> bars.map(
    new Function< MyMessage, String>() {

    {
        System.err.println("classpath: " +
System.getProperty("java.class.path"));

        CodeSource src =
com.google.protobuf.GeneratedMessageLite.class.getProtectionDomain().getCodeSource();
        if (src2 != null) {
           URL jar = src2.getLocation();
           System.err.println("aaacom.google.protobuf.GeneratedMessageLite
from jar: " + jar.toString());
    }

    @Override
    public String call(MyMessage v1) throws Exception {
        return v1.getString();
    }
}).collect();

prints:
classpath: ::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.1.jar:/opt/spark/lib/datanucleus-rdbms-3.2.1.jar:/opt/spark/lib/datanucleus-core-3.2.2.jar
com.google.protobuf.GeneratedMessageLite from jar:
file:/opt/spark/lib/spark-assembly-1.1.0-hadoop2.3.0.jar

I do see after those lines:
14/09/18 23:28:09 INFO Executor: Adding
file:/tmp/spark-cc147338-183f-46f6-b698-5b897e808a08/uber.jar to class
loader


This is with:

spart-submit --master local --class MyClass --jars uber.jar  uber.jar


My uber.jar has protobuf 2.5; I expected GeneratedMessageLite would
come from there.  I'm using spark 1.1 and hadoop 2.3; hadoop 2.3
should use protobuf 2.5[1] and even shade it properly.  I read claims
in this list that Spark shades protobuf correctly since 0.9.? and
looking thru the pom.xml on github it looks like Spark includes
protobuf 2.5 in the hadoop 2.3 profile.


I guess I'm still at "What's the deal with getting Spark to distribute
and load code from my jar correctly?"


[1] http://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml

On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais <pw...@yelp.com> wrote:
> Dear List,
>
> I'm writing an application where I have RDDs of protobuf messages.
> When I run the app via bin/spar-submit with --master local
> --driver-class-path path/to/my/uber.jar, Spark is able to
> ser/deserialize the messages correctly.
>
> However, if I run WITHOUT --driver-class-path path/to/my/uber.jar or I
> try --master spark://my.master:7077 , then I run into errors that make
> it look like my protobuf message classes are not on the classpath:
>
> Exception in thread "main" org.apache.spark.SparkException: Job
> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
> recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost):
> java.lang.RuntimeException: Unable to find proto buffer class
>         com.google.protobuf.GeneratedMessageLite$SerializedForm.readResolve(GeneratedMessageLite.java:775)
>         sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         java.lang.reflect.Method.invoke(Method.java:606)
>         java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>         java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1807)
>         ...
>
> Why do I need --driver-class-path in the local scenario?  And how can
> I ensure my classes are on the classpath no matter how my app is
> submitted via bin/spark-submit (e.g. --master spark://my.master:7077 )
> ?  I've tried poking through the shell scripts and SparkSubmit.scala
> and unfortunately I haven't been able to grok exactly what Spark is
> doing with the remote/local JVMs.
>
> Cheers,
> -Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org