You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mark Heimann <ma...@kard.info> on 2015/08/12 12:32:46 UTC

What is the Effect of Serialization within Stages?

Hello everyone,

I am wondering what the effect of serialization is within a stage.

My understanding of Spark as an execution engine is that the data flow
graph is divided into stages and a new stage always starts after an
operation/transformation that cannot be pipelined (such as groupBy or join)
because it can only be completed after the whole data set has "been taken
care off". At the end of a stage shuffle files are written and at the
beginning of the next stage they are read from.

Within a stage my understanding is that pipelining is used, therefore I
wonder whether there is any serialization overhead involved when there is
no shuffling taking place. I am also assuming that my data set fits into
memory and must not be spilled to disk.

So if I would chain multiple *map* or *flatMap* operations and they end up
in the same stage, will there be any serialization overhead for piping the
result of the first *map* operation as a parameter into the following *map*
operation?

Any ideas and feedback appreciated, thanks a lot.

Best regards,
Mark

Re: What is the Effect of Serialization within Stages?

Posted by Mark Heimann <ma...@kard.info>.
Thanks a lot guys, that's exactly what I hoped for :-).

Cheers,
Mark

2015-08-13 6:35 GMT+02:00 Hemant Bhanawat <he...@gmail.com>:

> A chain of map and flatmap does not cause any
> serialization-deserialization.
>
>
>
> On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <ma...@kard.info>
> wrote:
>
>> Hello everyone,
>>
>> I am wondering what the effect of serialization is within a stage.
>>
>> My understanding of Spark as an execution engine is that the data flow
>> graph is divided into stages and a new stage always starts after an
>> operation/transformation that cannot be pipelined (such as groupBy or join)
>> because it can only be completed after the whole data set has "been taken
>> care off". At the end of a stage shuffle files are written and at the
>> beginning of the next stage they are read from.
>>
>> Within a stage my understanding is that pipelining is used, therefore I
>> wonder whether there is any serialization overhead involved when there is
>> no shuffling taking place. I am also assuming that my data set fits into
>> memory and must not be spilled to disk.
>>
>> So if I would chain multiple *map* or *flatMap* operations and they end
>> up in the same stage, will there be any serialization overhead for piping
>> the result of the first *map* operation as a parameter into the
>> following *map* operation?
>>
>> Any ideas and feedback appreciated, thanks a lot.
>>
>> Best regards,
>> Mark
>>
>
>

Re: What is the Effect of Serialization within Stages?

Posted by Zoltán Zvara <zo...@gmail.com>.
Serialization only occurs intra-stage, when you are using Python, and as
far as I know, only in the first stage, when reading the data and passing
it to the Python interpreter the first time.

Multiple operations are just chains of simple *map *and *flatMap *operators
at task level on simple Scala *Iterator of type T*, where T is the type of
RDD.

On Thu, Aug 13, 2015 at 4:09 PM Hemant Bhanawat <he...@gmail.com>
wrote:

> A chain of map and flatmap does not cause any
> serialization-deserialization.
>
>
>
> On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <ma...@kard.info>
> wrote:
>
>> Hello everyone,
>>
>> I am wondering what the effect of serialization is within a stage.
>>
>> My understanding of Spark as an execution engine is that the data flow
>> graph is divided into stages and a new stage always starts after an
>> operation/transformation that cannot be pipelined (such as groupBy or join)
>> because it can only be completed after the whole data set has "been taken
>> care off". At the end of a stage shuffle files are written and at the
>> beginning of the next stage they are read from.
>>
>> Within a stage my understanding is that pipelining is used, therefore I
>> wonder whether there is any serialization overhead involved when there is
>> no shuffling taking place. I am also assuming that my data set fits into
>> memory and must not be spilled to disk.
>>
>> So if I would chain multiple *map* or *flatMap* operations and they end
>> up in the same stage, will there be any serialization overhead for piping
>> the result of the first *map* operation as a parameter into the
>> following *map* operation?
>>
>> Any ideas and feedback appreciated, thanks a lot.
>>
>> Best regards,
>> Mark
>>
>
>

Re: What is the Effect of Serialization within Stages?

Posted by Hemant Bhanawat <he...@gmail.com>.
A chain of map and flatmap does not cause any
serialization-deserialization.



On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <ma...@kard.info>
wrote:

> Hello everyone,
>
> I am wondering what the effect of serialization is within a stage.
>
> My understanding of Spark as an execution engine is that the data flow
> graph is divided into stages and a new stage always starts after an
> operation/transformation that cannot be pipelined (such as groupBy or join)
> because it can only be completed after the whole data set has "been taken
> care off". At the end of a stage shuffle files are written and at the
> beginning of the next stage they are read from.
>
> Within a stage my understanding is that pipelining is used, therefore I
> wonder whether there is any serialization overhead involved when there is
> no shuffling taking place. I am also assuming that my data set fits into
> memory and must not be spilled to disk.
>
> So if I would chain multiple *map* or *flatMap* operations and they end
> up in the same stage, will there be any serialization overhead for piping
> the result of the first *map* operation as a parameter into the following
> *map* operation?
>
> Any ideas and feedback appreciated, thanks a lot.
>
> Best regards,
> Mark
>