You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2014/11/04 18:22:15 UTC

MEMORY_ONLY_SER question

Folks,
If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
for a transformation/action later, is the whole partition of the RDD
deserialized into Java objects first before my transform/action code works
on it? Or is it deserialized in a streaming manner as the iterator moves
over the partition? Is this behavior customizable? I generally use the Kryo
serializer.

Mohit.

Re: MEMORY_ONLY_SER question

Posted by Mohit Jaggi <mo...@gmail.com>.

thanks jerry and tathagata. does anyone know how kryo compresses data? are
there any other serializers that work with spark and have good compression
for basic data types?

On Tue, Nov 4, 2014 at 10:29 PM, Shao, Saisai <sa...@intel.com> wrote:

>  From my understanding, the Spark code use Kryo as a streaming manner for
> RDD partitions, the deserialization comes with iteration to move forward.
> But the internal thing of Kryo to deserialize all the object once or
> incrementally is actually a behavior of Kryo, I guess Kyro will not
> deserialize the objects once for all.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Mohit Jaggi [mailto:mohitjaggi@gmail.com]
> *Sent:* Wednesday, November 05, 2014 2:01 PM
> *To:* Tathagata Das
> *Cc:* user@spark.apache.org
> *Subject:* Re: MEMORY_ONLY_SER question
>
>
>
> I used the word "streaming" but I did not mean to refer to spark
> streaming. I meant if a partition containing 10 objects was kryo-serialized
> into a single buffer, then in a mapPartitions() call, as I call iter.next()
> 10 times to access these objects one at a time, does the deserialization
> happen
>
> a) once to get all 10 objects,
>
> b) 10 times "incrementally" to get an object at a time, or
>
> c) 10 times to get 10 objects and discard the "wrong" 9 objects [ i doubt
> this would a design anyone would have adopted ]
>
> I think your answer is option (a) and you refered to Spark streaming to
> indicate that there is no difference in its behavior from spark
> core...right?
>
>
>
> If it is indeed option (a), I am happy with it and don't need to
> customize. If it is (b), I would like to have (a) instead.
>
>
>
> I am also wondering if kryo is good at compression of strings and numbers.
> Often I have the data type as "Double" but it could be encoded in much
> fewer bits.
>
>
>
>
>
>
>
> On Tue, Nov 4, 2014 at 1:02 PM, Tathagata Das <ta...@gmail.com>
> wrote:
>
>  It it deserialized in a streaming manner as the iterator moves over the
> partition. This is a functionality of core Spark, and Spark Streaming just
> uses it as is.
>
> What do you want to customize it to?
>
>
>
> On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi <mo...@gmail.com> wrote:
>
>  Folks,
>
> If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
> for a transformation/action later, is the whole partition of the RDD
> deserialized into Java objects first before my transform/action code works
> on it? Or is it deserialized in a streaming manner as the iterator moves
> over the partition? Is this behavior customizable? I generally use the Kryo
> serializer.
>
>
>
> Mohit.
>
>
>
>
>

RE: MEMORY_ONLY_SER question

Posted by "Shao, Saisai" <sa...@intel.com>.

From my understanding, the Spark code use Kryo as a streaming manner for RDD partitions, the deserialization comes with iteration to move forward. But the internal thing of Kryo to deserialize all the object once or incrementally is actually a behavior of Kryo, I guess Kyro will not deserialize the objects once for all.

Thanks
Jerry

From: Mohit Jaggi [mailto:mohitjaggi@gmail.com]
Sent: Wednesday, November 05, 2014 2:01 PM
To: Tathagata Das
Cc: user@spark.apache.org
Subject: Re: MEMORY_ONLY_SER question

I used the word "streaming" but I did not mean to refer to spark streaming. I meant if a partition containing 10 objects was kryo-serialized into a single buffer, then in a mapPartitions() call, as I call iter.next() 10 times to access these objects one at a time, does the deserialization happen
a) once to get all 10 objects,
b) 10 times "incrementally" to get an object at a time, or
c) 10 times to get 10 objects and discard the "wrong" 9 objects [ i doubt this would a design anyone would have adopted ]
I think your answer is option (a) and you refered to Spark streaming to indicate that there is no difference in its behavior from spark core...right?

If it is indeed option (a), I am happy with it and don't need to customize. If it is (b), I would like to have (a) instead.

I am also wondering if kryo is good at compression of strings and numbers. Often I have the data type as "Double" but it could be encoded in much fewer bits.

On Tue, Nov 4, 2014 at 1:02 PM, Tathagata Das <ta...@gmail.com>> wrote:
It it deserialized in a streaming manner as the iterator moves over the partition. This is a functionality of core Spark, and Spark Streaming just uses it as is.
What do you want to customize it to?

On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi <mo...@gmail.com>> wrote:
Folks,
If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed for a transformation/action later, is the whole partition of the RDD deserialized into Java objects first before my transform/action code works on it? Or is it deserialized in a streaming manner as the iterator moves over the partition? Is this behavior customizable? I generally use the Kryo serializer.

Mohit.

Re: MEMORY_ONLY_SER question

Posted by Mohit Jaggi <mo...@gmail.com>.

I used the word "streaming" but I did not mean to refer to spark streaming.
I meant if a partition containing 10 objects was kryo-serialized into a
single buffer, then in a mapPartitions() call, as I call iter.next() 10
times to access these objects one at a time, does the deserialization happen
a) once to get all 10 objects,
b) 10 times "incrementally" to get an object at a time, or
c) 10 times to get 10 objects and discard the "wrong" 9 objects [ i doubt
this would a design anyone would have adopted ]
I think your answer is option (a) and you refered to Spark streaming to
indicate that there is no difference in its behavior from spark
core...right?

If it is indeed option (a), I am happy with it and don't need to customize.
If it is (b), I would like to have (a) instead.

I am also wondering if kryo is good at compression of strings and numbers.
Often I have the data type as "Double" but it could be encoded in much
fewer bits.

On Tue, Nov 4, 2014 at 1:02 PM, Tathagata Das <ta...@gmail.com>
wrote:

> It it deserialized in a streaming manner as the iterator moves over the
> partition. This is a functionality of core Spark, and Spark Streaming just
> uses it as is.
> What do you want to customize it to?
>
> On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi <mo...@gmail.com> wrote:
>
>> Folks,
>> If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
>> for a transformation/action later, is the whole partition of the RDD
>> deserialized into Java objects first before my transform/action code works
>> on it? Or is it deserialized in a streaming manner as the iterator moves
>> over the partition? Is this behavior customizable? I generally use the Kryo
>> serializer.
>>
>> Mohit.
>>
>
>

Re: MEMORY_ONLY_SER question

Posted by Tathagata Das <ta...@gmail.com>.

It it deserialized in a streaming manner as the iterator moves over the
partition. This is a functionality of core Spark, and Spark Streaming just
uses it as is.
What do you want to customize it to?

On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi <mo...@gmail.com> wrote:

> Folks,
> If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
> for a transformation/action later, is the whole partition of the RDD
> deserialized into Java objects first before my transform/action code works
> on it? Or is it deserialized in a streaming manner as the iterator moves
> over the partition? Is this behavior customizable? I generally use the Kryo
> serializer.
>
> Mohit.
>