You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by aecc <al...@gmail.com> on 2014/11/24 19:15:26 UTC

Using Spark Context as an attribute of a class cannot be used

Hello guys,

I'm using Spark 1.0.0 and Kryo serialization
In the Spark Shell, when I create a class that contains as an attribute the
SparkContext, in this way:

class AAA(val s: SparkContext) { }
val aaa = new AAA(sc)

and I execute any action using that attribute like:

val myNumber = 5
aaa.s.textFile("FILE").filter(_ == myNumber.toString).count
or
aaa.s.parallelize(1 to 10).filter(_ == myNumber).count

Returns a NonSerializibleException:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$AAA
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
	at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
	at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Any thoughts about how to solve this issue and how can I give a workaround
to it? I'm actually developing an Api that will need the usage of this
SparkContext several times in different locations, so I will needed to be
accessible.

Thanks a lot for the cooperation



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using Spark Context as an attribute of a class cannot be used

Posted by Marcelo Vanzin <va...@cloudera.com>.

That's an interesting question for which I do not know the answer.
Probably a question for someone with more knowledge of the internals
of the shell interpreter...

On Mon, Nov 24, 2014 at 2:19 PM, aecc <al...@gmail.com> wrote:
> Ok, great, I'm gonna do do it that way, thanks :). However I still don't
> understand why this object should be serialized and shipped?
>
> aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881
>
> However this :
> aaa.s.parallelize(1 to 10).filter(_ == myNumber).count
>
> Needs to be serialized, and this:
>
> sc.parallelize(1 to 10).filter(_ == myNumber).count
>
> does not.


-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using Spark Context as an attribute of a class cannot be used

Posted by aecc <al...@gmail.com>.

Ok, great, I'm gonna do do it that way, thanks :). However I still don't
understand why this object should be serialized and shipped?

aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881

However this :
aaa.s.parallelize(1 to 10).filter(_ == myNumber).count

Needs to be serialized, and this:

sc.parallelize(1 to 10).filter(_ == myNumber).count

does not.

2014-11-24 23:13 GMT+01:00 Marcelo Vanzin [via Apache Spark User List] <
ml-node+s1001560n19692h43@n3.nabble.com>:

> On Mon, Nov 24, 2014 at 1:56 PM, aecc <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19692&i=0>> wrote:
> > I checked sqlContext, they use it in the same way I would like to use my
> > class, they make the class Serializable with transient. Does this
> affects
> > somehow the whole pipeline of data moving? I mean, will I get
> performance
> > issues when doing this because now the class will be Serialized for some
> > reason that I still don't understand?
>
> If you want to do the same thing, your "AAA" needs to be serializable
> and you need to mark all non-serializable fields as "@transient". The
> only performance penalty you'll be paying is the serialization /
> deserialization of the "AAA" instance, which most probably will be
> really small compared to the actual work the task will be doing.
>
> Unless your class is holding a whole lot of data, in which case you
> should start thinking about using a broadcast instead.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19692&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19692&i=2>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19692.html
>  To unsubscribe from Using Spark Context as an attribute of a class cannot
> be used, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19668&code=YWxlc3NhbmRyb2FlY2NAZ21haWwuY29tfDE5NjY4fDE2MzQ0ODgyMDU=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Alessandro Chacón
Aecc_ORG




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19694.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using Spark Context as an attribute of a class cannot be used

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Mon, Nov 24, 2014 at 1:56 PM, aecc <al...@gmail.com> wrote:
> I checked sqlContext, they use it in the same way I would like to use my
> class, they make the class Serializable with transient. Does this affects
> somehow the whole pipeline of data moving? I mean, will I get performance
> issues when doing this because now the class will be Serialized for some
> reason that I still don't understand?

If you want to do the same thing, your "AAA" needs to be serializable
and you need to mark all non-serializable fields as "@transient". The
only performance penalty you'll be paying is the serialization /
deserialization of the "AAA" instance, which most probably will be
really small compared to the actual work the task will be doing.

Unless your class is holding a whole lot of data, in which case you
should start thinking about using a broadcast instead.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using Spark Context as an attribute of a class cannot be used

Posted by aecc <al...@gmail.com>.

Yes, I'm running this in the Shell. In my compiled Jar it works perfectly,
the issue is I need to do this on the shell.

Any available workarounds?

I checked sqlContext, they use it in the same way I would like to use my
class, they make the class Serializable with transient. Does this affects
somehow the whole pipeline of data moving? I mean, will I get performance
issues when doing this because now the class will be Serialized for some
reason that I still don't understand?


2014-11-24 22:33 GMT+01:00 Marcelo Vanzin [via Apache Spark User List] <
ml-node+s1001560n19687h88@n3.nabble.com>:

> Hello,
>
> On Mon, Nov 24, 2014 at 12:07 PM, aecc <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19687&i=0>> wrote:
> > This is the stacktrace:
> >
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> not
> > serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA
> >         - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", type: "class
> > $iwC$$iwC$$iwC$$iwC$AAA")
>
> Ah. Looks to me that you're trying to run this in spark-shell, right?
>
> I'm not 100% sure of how it works internally, but I think the Scala
> repl works a little differently than regular Scala code in this
> regard. When you declare a "val" in the shell it will behave
> differently than a "val" inside a method in a compiled Scala class -
> the former will behave like an instance variable, the latter like a
> local variable. So, this is probably why you're running into this.
>
> Try compiling your code and running it outside the shell to see how it
> goes. I'm not sure whether there's a workaround for this when trying
> things out in the shell - maybe declare an `object` to hold your
> constants? Never really tried, so YMMV.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19687&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19687&i=2>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19687.html
>  To unsubscribe from Using Spark Context as an attribute of a class cannot
> be used, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19668&code=YWxlc3NhbmRyb2FlY2NAZ21haWwuY29tfDE5NjY4fDE2MzQ0ODgyMDU=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Alessandro Chacón
Aecc_ORG




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19690.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using Spark Context as an attribute of a class cannot be used

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hello,

On Mon, Nov 24, 2014 at 12:07 PM, aecc <al...@gmail.com> wrote:
> This is the stacktrace:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not
> serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA
>         - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", type: "class
> $iwC$$iwC$$iwC$$iwC$AAA")

Ah. Looks to me that you're trying to run this in spark-shell, right?

I'm not 100% sure of how it works internally, but I think the Scala
repl works a little differently than regular Scala code in this
regard. When you declare a "val" in the shell it will behave
differently than a "val" inside a method in a compiled Scala class -
the former will behave like an instance variable, the latter like a
local variable. So, this is probably why you're running into this.

Try compiling your code and running it outside the shell to see how it
goes. I'm not sure whether there's a workaround for this when trying
things out in the shell - maybe declare an `object` to hold your
constants? Never really tried, so YMMV.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using Spark Context as an attribute of a class cannot be used

Posted by aecc <al...@gmail.com>.

If I actually instead of using myNumber I use the 5 value, the exception is
not given. E.g:

aaa.s.parallelize(1 to 10).filter(_ == 5).count

Works perfectly



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19680.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using Spark Context as an attribute of a class cannot be used

Posted by aecc <al...@gmail.com>.

Marcelo Vanzin wrote
> Do you expect to be able to use the spark context on the remote task?

Not At all, what I want to create is a wrapper of the SparkContext, to be
used only on the driver node.
I would like to have in this "AAA" wrapper several attributes, such as the
SparkContext and other configurations for my project.

I tested using -Dsun.io.serialization.extendedDebugInfo=true

This is the stacktrace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA
	- field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", type: "class
$iwC$$iwC$$iwC$$iwC$AAA")
	- object (class "$iwC$$iwC$$iwC$$iwC", $iwC$$iwC$$iwC$$iwC@24e57dcb)
	- field (class "$iwC$$iwC$$iwC", name: "$iw", type: "class
$iwC$$iwC$$iwC$$iwC")
	- object (class "$iwC$$iwC$$iwC", $iwC$$iwC$$iwC@178cc62b)
	- field (class "$iwC$$iwC", name: "$iw", type: "class $iwC$$iwC$$iwC")
	- object (class "$iwC$$iwC", $iwC$$iwC@1e9f5eeb)
	- field (class "$iwC", name: "$iw", type: "class $iwC$$iwC")
	- object (class "$iwC", $iwC@37d8e87e)
	- field (class "$line18.$read", name: "$iw", type: "class $iwC")
	- object (class "$line18.$read", $line18.$read@124551f)
	- field (class "$iwC$$iwC$$iwC", name: "$VAL15", type: "class
$line18.$read")
	- object (class "$iwC$$iwC$$iwC", $iwC$$iwC$$iwC@2e846e6b)
	- field (class "$iwC$$iwC$$iwC$$iwC", name: "$outer", type: "class
$iwC$$iwC$$iwC")
	- object (class "$iwC$$iwC$$iwC$$iwC", $iwC$$iwC$$iwC$$iwC@4b31ba1b)
	- field (class "$iwC$$iwC$$iwC$$iwC$$anonfun$1", name: "$outer", type:
"class $iwC$$iwC$$iwC$$iwC")
	- object (class "$iwC$$iwC$$iwC$$iwC$$anonfun$1", <function1>)
	- field (class "org.apache.spark.rdd.FilteredRDD", name: "f", type:
"interface scala.Function1")
	- root object (class "org.apache.spark.rdd.FilteredRDD", FilteredRDD[3] at
filter at <console>:20)
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)

I actually don't understand much about this stack trace. If you can help me,
I would appreciate it.

Transient didn't work either

Thanks a lot



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Using Spark Context as an attribute of a class cannot be used

Posted by Marcelo Vanzin <va...@cloudera.com>.

Do you expect to be able to use the spark context on the remote task?

If you do, that won't work. You'll need to rethink what it is you're
trying to do, since SparkContext is not serializable and it doesn't
make sense to make it so. If you don't, you could mark the field as
@transient.

But the two examples you posted shouldn't be creating a reference to
the "aaa" variable in the serialized task. You could use
-Dsun.io.serialization.extendedDebugInfo=true to debug these things.


On Mon, Nov 24, 2014 at 10:15 AM, aecc <al...@gmail.com> wrote:
> Hello guys,
>
> I'm using Spark 1.0.0 and Kryo serialization
> In the Spark Shell, when I create a class that contains as an attribute the
> SparkContext, in this way:
>
> class AAA(val s: SparkContext) { }
> val aaa = new AAA(sc)
>
> and I execute any action using that attribute like:
>
> val myNumber = 5
> aaa.s.textFile("FILE").filter(_ == myNumber.toString).count
> or
> aaa.s.parallelize(1 to 10).filter(_ == myNumber).count
>
> Returns a NonSerializibleException:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not
> serializable: java.io.NotSerializableException:
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$AAA
>         at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>         at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
>         at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Any thoughts about how to solve this issue and how can I give a workaround
> to it? I'm actually developing an Api that will need the usage of this
> SparkContext several times in different locations, so I will needed to be
> accessible.
>
> Thanks a lot for the cooperation
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org