You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sheel Pancholi <sh...@gmail.com> on 2021/02/26 09:22:35 UTC

Spark closures behavior in local mode in IDEs

Hi ,

I am observing weird behavior of spark and closures in local mode on my
machine v/s a 3 node cluster (Spark 2.4.5).

Following is the piece of code

object Example {
  val num=5
  def myfunc={

  sc.parallelize(1 to 4).map(_+num).foreach(println)
}
}

I expected this to fail regardless since the local variable *num* is needed
in the closure and therefore *Example* object would need to be serialized
but it cannot be since it does not extend *Serializable* interface.

   - when I run the same piece of code from spark-shell on my same local
   machine, it fails with the error given the rationale above: [image:
   enter image description here] <https://i.stack.imgur.com/KgCRU.png>
   - When I run the same piece of code *in yarn mode* on a 3 node EMR
   cluster, it fails with the exact same error as in the above
   screenshot...given the same rationale as mentioned above.
   - when I run the same piece of code *in local mode* on a the same
   cluster (=> master node), it also fails. The same rationale still holds
   true.
   - However, this, when I run from an sbt project *(not a Spark
   installation or anything...just added Spark libraries to my sbt project and
   used a conf.master(local[..])* in local mode runs fine and gives me an
   o/p of 6,7,8,9: [image: enter image description here]
   <https://i.stack.imgur.com/yUCdp.png>

This means its running fine everywhere except when you run it by adding
Spark dependencies in your sbt project. The question is what explains the
different local mode behavior when running your Spark code by simply adding
your Spark libraries in sbt project?


Regards,

Sheel

Re: Spark closures behavior in local mode in IDEs

Posted by Sheel Pancholi <sh...@gmail.com>.

Thanks Owen

Agreed! The only explanation that I "made peace with" is that
static/singleton Scala "object" being static/singleton natively does not
require any serialization and would be available across the threads within
the jvm and would require serialization only when this singleton would need
to be moved to a different machine. When you run the same instead with
spark-shell or submit a homogeneity in behavior is observed between local
and cluster mode as though a layer of Spark engine takes over the
adjudication and discretion required in determining serialization needs of
an object.

If the behavior in local mode that I am observing (and is for everyone to
try out) is not intended in local mode when running using java cmd (and not
the spark-submit/shell wrapper), then this may require a fix. If this
indeed has a legitimate explanation other than my understanding of what's
going on as outlined above at the start of the mail, then that would be
great to know.


On Fri, 26 Feb, 2021, 10:29 PM Sean Owen, <sr...@gmail.com> wrote:

> Yeah this is a good question. It is certainly to do with executing within
> the same JVM, but even I'd have to dig into the code to explain why the
> spark-sql version operates differently, as that also appears to be local.
> To be clear this 'shouldn't' work, just happens to not fail in local
> execution.
>
> On Fri, Feb 26, 2021 at 10:40 AM Sheel Pancholi <sh...@gmail.com>
> wrote:
>
>>
>> I am afraid that might at best be partially true. What would explain
>> spark-shell in local mode also throwing the same error! It should hv run
>> fine by that logic. In digging more, it was apparent why this was
>> happening.
>>
>> When you run your code simply adding libraries to your code and running
>> in local mode you are not essentially running a spark-submit... And your
>> static/singleton "object" is available across tasks inside the same jvm.
>> When you instead have that same code running from the instance of a class
>> from your IDE in local mode you see the exact same serialization issue
>> since it has to be distributed to the tasks.. Technically it still being
>> within the same jvm should not directly require serialization but somehow
>> the glow of control show spark libraries to see this as an issue. One can
>> easily extrapolate and hypothesize (unless the creators spare some time
>> answering on this thread) this idea to why then and therefore a singleton
>> object is treated in a different way when using spark-submit in local mode
>> possibly to bring homogeneity in the way "spark-submit" treats local and
>> cluster mode as even a static/singleton might/will need to be moved to a
>> different machine in a cluster mode bringing its case on par with a regular
>> instance of a class.
>>
>>
>> On Fri, Feb 26, 2021 at 9:07 PM Lalwani, Jayesh <jl...@amazon.com>
>> wrote:
>>
>>> Yes, as you found, in local mode, Spark won’t serialize your objects. It
>>> will just pass the reference to the closure. This means that it is possible
>>> to write code that works in local mode, but doesn’t when you run
>>> distributed.
>>>
>>>
>>>
>>> *From: *Sheel Pancholi <sh...@gmail.com>
>>> *Date: *Friday, February 26, 2021 at 4:24 AM
>>> *To: *user <us...@spark.apache.org>
>>> *Subject: *[EXTERNAL] Spark closures behavior in local mode in IDEs
>>>
>>>
>>>
>>> *CAUTION*: This email originated from outside of the organization. Do
>>> not click links or open attachments unless you can confirm the sender and
>>> know the content is safe.
>>>
>>>
>>>
>>> Hi ,
>>>
>>>
>>>
>>> I am observing weird behavior of spark and closures in local mode on my
>>> machine v/s a 3 node cluster (Spark 2.4.5).
>>>
>>> Following is the piece of code
>>>
>>> object Example {
>>>
>>>   val num=5
>>>
>>>   def myfunc={
>>>
>>>
>>>
>>>   sc.parallelize(1 to 4).map(_+num).foreach(println)
>>>
>>> }
>>>
>>> }
>>>
>>> I expected this to fail regardless since the local variable *num* is
>>> needed in the closure and therefore *Example* object would need to be
>>> serialized but it cannot be since it does not extend *Serializable*
>>>  interface.
>>>
>>> ·         when I run the same piece of code from spark-shell on my same
>>> local machine, it fails with the error given the rationale above: [image:
>>> Image removed by sender. enter image description here]
>>> <https://i.stack.imgur.com/KgCRU.png>
>>>
>>> ·         When I run the same piece of code *in yarn mode* on a 3 node
>>> EMR cluster, it fails with the exact same error as in the above
>>> screenshot...given the same rationale as mentioned above.
>>>
>>> ·         when I run the same piece of code *in local mode* on a the
>>> same cluster (=> master node), it also fails. The same rationale still
>>> holds true.
>>>
>>> ·         However, this, when I run from an sbt project *(not a Spark
>>> installation or anything...just added Spark libraries to my sbt project and
>>> used a **conf.master(local[..])* in local mode runs fine and gives me
>>> an o/p of 6,7,8,9: [image: Image removed by sender. enter image
>>> description here] <https://i.stack.imgur.com/yUCdp.png>
>>>
>>> This means its running fine everywhere except when you run it by adding
>>> Spark dependencies in your sbt project. The question is what explains the
>>> different local mode behavior when running your Spark code by simply adding
>>> your Spark libraries in sbt project?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Sheel
>>>
>>
>>
>> --
>>
>> Best Regards,
>>
>> Sheel Pancholi
>>
>> *Mob: +91 9620474620*
>>
>> *Connect with me on: Twitter <https://twitter.com/sheelstera> | LinkedIn
>> <http://in.linkedin.com/in/sheelstera>*
>>
>> *Write to me at:Sheel@Yahoo!! <sh...@yahoo.com> | Sheel@Gmail!!
>> <sh...@gmail.com> | Sheel@Windows Live <sh...@live.com>*
>>
>> P *Save a tree* - please do not print this email unless you really need
>> to!
>>
>

Re: Spark closures behavior in local mode in IDEs

Posted by Sean Owen <sr...@gmail.com>.

Yeah this is a good question. It is certainly to do with executing within
the same JVM, but even I'd have to dig into the code to explain why the
spark-sql version operates differently, as that also appears to be local.
To be clear this 'shouldn't' work, just happens to not fail in local
execution.

On Fri, Feb 26, 2021 at 10:40 AM Sheel Pancholi <sh...@gmail.com>
wrote:

>
> I am afraid that might at best be partially true. What would explain
> spark-shell in local mode also throwing the same error! It should hv run
> fine by that logic. In digging more, it was apparent why this was
> happening.
>
> When you run your code simply adding libraries to your code and running in
> local mode you are not essentially running a spark-submit... And your
> static/singleton "object" is available across tasks inside the same jvm.
> When you instead have that same code running from the instance of a class
> from your IDE in local mode you see the exact same serialization issue
> since it has to be distributed to the tasks.. Technically it still being
> within the same jvm should not directly require serialization but somehow
> the glow of control show spark libraries to see this as an issue. One can
> easily extrapolate and hypothesize (unless the creators spare some time
> answering on this thread) this idea to why then and therefore a singleton
> object is treated in a different way when using spark-submit in local mode
> possibly to bring homogeneity in the way "spark-submit" treats local and
> cluster mode as even a static/singleton might/will need to be moved to a
> different machine in a cluster mode bringing its case on par with a regular
> instance of a class.
>
>
> On Fri, Feb 26, 2021 at 9:07 PM Lalwani, Jayesh <jl...@amazon.com>
> wrote:
>
>> Yes, as you found, in local mode, Spark won’t serialize your objects. It
>> will just pass the reference to the closure. This means that it is possible
>> to write code that works in local mode, but doesn’t when you run
>> distributed.
>>
>>
>>
>> *From: *Sheel Pancholi <sh...@gmail.com>
>> *Date: *Friday, February 26, 2021 at 4:24 AM
>> *To: *user <us...@spark.apache.org>
>> *Subject: *[EXTERNAL] Spark closures behavior in local mode in IDEs
>>
>>
>>
>> *CAUTION*: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hi ,
>>
>>
>>
>> I am observing weird behavior of spark and closures in local mode on my
>> machine v/s a 3 node cluster (Spark 2.4.5).
>>
>> Following is the piece of code
>>
>> object Example {
>>
>>   val num=5
>>
>>   def myfunc={
>>
>>
>>
>>   sc.parallelize(1 to 4).map(_+num).foreach(println)
>>
>> }
>>
>> }
>>
>> I expected this to fail regardless since the local variable *num* is
>> needed in the closure and therefore *Example* object would need to be
>> serialized but it cannot be since it does not extend *Serializable*
>>  interface.
>>
>> ·         when I run the same piece of code from spark-shell on my same
>> local machine, it fails with the error given the rationale above: [image:
>> Image removed by sender. enter image description here]
>> <https://i.stack.imgur.com/KgCRU.png>
>>
>> ·         When I run the same piece of code *in yarn mode* on a 3 node
>> EMR cluster, it fails with the exact same error as in the above
>> screenshot...given the same rationale as mentioned above.
>>
>> ·         when I run the same piece of code *in local mode* on a the
>> same cluster (=> master node), it also fails. The same rationale still
>> holds true.
>>
>> ·         However, this, when I run from an sbt project *(not a Spark
>> installation or anything...just added Spark libraries to my sbt project and
>> used a **conf.master(local[..])* in local mode runs fine and gives me an
>> o/p of 6,7,8,9: [image: Image removed by sender. enter image description
>> here] <https://i.stack.imgur.com/yUCdp.png>
>>
>> This means its running fine everywhere except when you run it by adding
>> Spark dependencies in your sbt project. The question is what explains the
>> different local mode behavior when running your Spark code by simply adding
>> your Spark libraries in sbt project?
>>
>>
>>
>> Regards,
>>
>> Sheel
>>
>
>
> --
>
> Best Regards,
>
> Sheel Pancholi
>
> *Mob: +91 9620474620*
>
> *Connect with me on: Twitter <https://twitter.com/sheelstera> | LinkedIn
> <http://in.linkedin.com/in/sheelstera>*
>
> *Write to me at:Sheel@Yahoo!! <sh...@yahoo.com> | Sheel@Gmail!!
> <sh...@gmail.com> | Sheel@Windows Live <sh...@live.com>*
>
> P *Save a tree* - please do not print this email unless you really need
> to!
>

Re: Spark closures behavior in local mode in IDEs

Posted by Sheel Pancholi <sh...@gmail.com>.

I am afraid that might at best be partially true. What would explain
spark-shell in local mode also throwing the same error! It should hv run
fine by that logic. In digging more, it was apparent why this was
happening.

When you run your code simply adding libraries to your code and running in
local mode you are not essentially running a spark-submit... And your
static/singleton "object" is available across tasks inside the same jvm.
When you instead have that same code running from the instance of a class
from your IDE in local mode you see the exact same serialization issue
since it has to be distributed to the tasks.. Technically it still being
within the same jvm should not directly require serialization but somehow
the glow of control show spark libraries to see this as an issue. One can
easily extrapolate and hypothesize (unless the creators spare some time
answering on this thread) this idea to why then and therefore a singleton
object is treated in a different way when using spark-submit in local mode
possibly to bring homogeneity in the way "spark-submit" treats local and
cluster mode as even a static/singleton might/will need to be moved to a
different machine in a cluster mode bringing its case on par with a regular
instance of a class.

On Fri, Feb 26, 2021 at 9:07 PM Lalwani, Jayesh <jl...@amazon.com> wrote:

> Yes, as you found, in local mode, Spark won’t serialize your objects. It
> will just pass the reference to the closure. This means that it is possible
> to write code that works in local mode, but doesn’t when you run
> distributed.
>
>
>
> *From: *Sheel Pancholi <sh...@gmail.com>
> *Date: *Friday, February 26, 2021 at 4:24 AM
> *To: *user <us...@spark.apache.org>
> *Subject: *[EXTERNAL] Spark closures behavior in local mode in IDEs
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Hi ,
>
>
>
> I am observing weird behavior of spark and closures in local mode on my
> machine v/s a 3 node cluster (Spark 2.4.5).
>
> Following is the piece of code
>
> object Example {
>
>   val num=5
>
>   def myfunc={
>
>
>
>   sc.parallelize(1 to 4).map(_+num).foreach(println)
>
> }
>
> }
>
> I expected this to fail regardless since the local variable *num* is
> needed in the closure and therefore *Example* object would need to be
> serialized but it cannot be since it does not extend *Serializable*
>  interface.
>
> ·         when I run the same piece of code from spark-shell on my same
> local machine, it fails with the error given the rationale above: [image:
> Image removed by sender. enter image description here]
> <https://i.stack.imgur.com/KgCRU.png>
>
> ·         When I run the same piece of code *in yarn mode* on a 3 node
> EMR cluster, it fails with the exact same error as in the above
> screenshot...given the same rationale as mentioned above.
>
> ·         when I run the same piece of code *in local mode* on a the same
> cluster (=> master node), it also fails. The same rationale still holds
> true.
>
> ·         However, this, when I run from an sbt project *(not a Spark
> installation or anything...just added Spark libraries to my sbt project and
> used a **conf.master(local[..])* in local mode runs fine and gives me an
> o/p of 6,7,8,9: [image: Image removed by sender. enter image description
> here] <https://i.stack.imgur.com/yUCdp.png>
>
> This means its running fine everywhere except when you run it by adding
> Spark dependencies in your sbt project. The question is what explains the
> different local mode behavior when running your Spark code by simply adding
> your Spark libraries in sbt project?
>
>
>
> Regards,
>
> Sheel
>

-- 

Best Regards,

Sheel Pancholi

*Mob: +91 9620474620*

*Connect with me on: Twitter <https://twitter.com/sheelstera> | LinkedIn
<http://in.linkedin.com/in/sheelstera>*

*Write to me at:Sheel@Yahoo!! <sh...@yahoo.com> | Sheel@Gmail!!
<sh...@gmail.com> | Sheel@Windows Live <sh...@live.com>*

P *Save a tree* - please do not print this email unless you really need to!

Re: Spark closures behavior in local mode in IDEs

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.

Yes, as you found, in local mode, Spark won’t serialize your objects. It will just pass the reference to the closure. This means that it is possible to write code that works in local mode, but doesn’t when you run distributed.

From: Sheel Pancholi <sh...@gmail.com>
Date: Friday, February 26, 2021 at 4:24 AM
To: user <us...@spark.apache.org>
Subject: [EXTERNAL] Spark closures behavior in local mode in IDEs

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hi ,

I am observing weird behavior of spark and closures in local mode on my machine v/s a 3 node cluster (Spark 2.4.5).

Following is the piece of code

object Example {

  val num=5

  def myfunc={

  sc.parallelize(1 to 4).map(_+num).foreach(println)

}

}

I expected this to fail regardless since the local variable num is needed in the closure and therefore Example object would need to be serialized but it cannot be since it does not extend Serializable interface.
·         when I run the same piece of code from spark-shell on my same local machine, it fails with the error given the rationale above: [Image removed by sender. enter image description here] <https://i.stack.imgur.com/KgCRU.png>
·         When I run the same piece of code in yarn mode on a 3 node EMR cluster, it fails with the exact same error as in the above screenshot...given the same rationale as mentioned above.
·         when I run the same piece of code in local mode on a the same cluster (=> master node), it also fails. The same rationale still holds true.
·         However, this, when I run from an sbt project (not a Spark installation or anything...just added Spark libraries to my sbt project and used a conf.master(local[..]) in local mode runs fine and gives me an o/p of 6,7,8,9: [Image removed by sender. enter image description here] <https://i.stack.imgur.com/yUCdp.png>

This means its running fine everywhere except when you run it by adding Spark dependencies in your sbt project. The question is what explains the different local mode behavior when running your Spark code by simply adding your Spark libraries in sbt project?

Regards,

Sheel