You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Patrick Brown <pa...@gmail.com> on 2018/10/24 20:09:37 UTC

KryoSerializer Implementation - Not using KryoPool

Hi,

I am wondering about the implementation of KryoSerializer, specifically the
lack of use of KryoPool, which is recommended by Kryo themselves.

Looking at the code, it seems that frequently KryoSerializer.newInstance is
called, followed by a serialize and then this instance goes out of scope,
this seems like it causes frequent creation of Kryo instances, something
which the Kryo documentation says is expensive.

By doing flame graphs on our own running software (it processes a lot of
small jobs) it seems like a good amount of time is spent on this.

I have a small patch we are using internally which implements a reused
KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to
avoid the creation of many Kryo instances. I am wonder if I am missing
something as to why this isn't done already. If not I am wondering if this
might be a patch that Spark would be interested in merging in, and how I
might go about that.

Thanks,

Patrick

Re: KryoSerializer Implementation - Not using KryoPool

Posted by Sean Owen <sr...@gmail.com>.

It's not so much the KryoSerializerInstance that's the problem, but
that it will always make a new Kryo (although at most 1). You mean to
supply it with a reference to a pool instead, shared across all
KryoSerializerInstance? plausible yeah.

See https://spark.apache.org/contributing.html for guidance but
basically you'd make a JIRA and then a pull request at apache/spark,
with the JIRA number in the title and some other conventions.
On Thu, Oct 25, 2018 at 11:51 AM Patrick Brown
<pa...@gmail.com> wrote:
>
> Based on my, limited, read through of the code that uses this, it seems like often a new KryoSerializerInstance is created for whatever task and then it falls out of scope, instead of being reused. I did notice that comment about a pool size of 1, however if the use is generally how I just described, I don't think that really does anything.
>
> I would be happy to open a PR but, pardon my ignorance, how would I go about doing that properly? Do I need to open a JIRA issue first? Also how would I demonstrate performance gains? Do you guys use something like ScalaMeter?
>
> Thanks for your help!
>
> On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <sr...@gmail.com> wrote:
>>
>> I don't know; possibly just because it wasn't available whenever Kryo
>> was first used in the project.
>>
>> Skimming the code, the KryoSerializerInstance looks like a wrapper
>> that provides a Kryo object to do work. It already maintains a 'pool'
>> of just 1 instance. Is the point that KryoSerializer can share a
>> KryoPool across KryoSerializerInstances that provides them with a Kryo
>> rather than allocate a new one? makes sense, though I believe the
>> concern is always whether that somehow shares state or config in a way
>> that breaks something. I see there's already a reset() call in here to
>> try to avoid that.
>>
>> Well, seems worth a PR, especially if you can demonstrate some
>> performance gains.
>>
>> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
>> <pa...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves.
>> >
>> > Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope, this seems like it causes frequent creation of Kryo instances, something which the Kryo documentation says is expensive.
>> >
>> > By doing flame graphs on our own running software (it processes a lot of small jobs) it seems like a good amount of time is spent on this.
>> >
>> > I have a small patch we are using internally which implements a reused KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid the creation of many Kryo instances. I am wonder if I am missing something as to why this isn't done already. If not I am wondering if this might be a patch that Spark would be interested in merging in, and how I might go about that.
>> >
>> > Thanks,
>> >
>> > Patrick

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: KryoSerializer Implementation - Not using KryoPool

Posted by Patrick Brown <pa...@gmail.com>.

Based on my, limited, read through of the code that uses this, it seems
like often a new KryoSerializerInstance is created for whatever task and
then it falls out of scope, instead of being reused. I did notice that
comment about a pool size of 1, however if the use is generally how I just
described, I don't think that really does anything.

I would be happy to open a PR but, pardon my ignorance, how would I go
about doing that properly? Do I need to open a JIRA issue first? Also how
would I demonstrate performance gains? Do you guys use something like
ScalaMeter?

Thanks for your help!

On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <sr...@gmail.com> wrote:

> I don't know; possibly just because it wasn't available whenever Kryo
> was first used in the project.
>
> Skimming the code, the KryoSerializerInstance looks like a wrapper
> that provides a Kryo object to do work. It already maintains a 'pool'
> of just 1 instance. Is the point that KryoSerializer can share a
> KryoPool across KryoSerializerInstances that provides them with a Kryo
> rather than allocate a new one? makes sense, though I believe the
> concern is always whether that somehow shares state or config in a way
> that breaks something. I see there's already a reset() call in here to
> try to avoid that.
>
> Well, seems worth a PR, especially if you can demonstrate some
> performance gains.
>
> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
> <pa...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am wondering about the implementation of KryoSerializer, specifically
> the lack of use of KryoPool, which is recommended by Kryo themselves.
> >
> > Looking at the code, it seems that frequently KryoSerializer.newInstance
> is called, followed by a serialize and then this instance goes out of
> scope, this seems like it causes frequent creation of Kryo instances,
> something which the Kryo documentation says is expensive.
> >
> > By doing flame graphs on our own running software (it processes a lot of
> small jobs) it seems like a good amount of time is spent on this.
> >
> > I have a small patch we are using internally which implements a reused
> KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to
> avoid the creation of many Kryo instances. I am wonder if I am missing
> something as to why this isn't done already. If not I am wondering if this
> might be a patch that Spark would be interested in merging in, and how I
> might go about that.
> >
> > Thanks,
> >
> > Patrick
>

Re: KryoSerializer Implementation - Not using KryoPool

Posted by Sean Owen <sr...@gmail.com>.

I don't know; possibly just because it wasn't available whenever Kryo
was first used in the project.

Skimming the code, the KryoSerializerInstance looks like a wrapper
that provides a Kryo object to do work. It already maintains a 'pool'
of just 1 instance. Is the point that KryoSerializer can share a
KryoPool across KryoSerializerInstances that provides them with a Kryo
rather than allocate a new one? makes sense, though I believe the
concern is always whether that somehow shares state or config in a way
that breaks something. I see there's already a reset() call in here to
try to avoid that.

Well, seems worth a PR, especially if you can demonstrate some
performance gains.

On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
<pa...@gmail.com> wrote:
>
> Hi,
>
> I am wondering about the implementation of KryoSerializer, specifically the lack of use of KryoPool, which is recommended by Kryo themselves.
>
> Looking at the code, it seems that frequently KryoSerializer.newInstance is called, followed by a serialize and then this instance goes out of scope, this seems like it causes frequent creation of Kryo instances, something which the Kryo documentation says is expensive.
>
> By doing flame graphs on our own running software (it processes a lot of small jobs) it seems like a good amount of time is spent on this.
>
> I have a small patch we are using internally which implements a reused KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to avoid the creation of many Kryo instances. I am wonder if I am missing something as to why this isn't done already. If not I am wondering if this might be a patch that Spark would be interested in merging in, and how I might go about that.
>
> Thanks,
>
> Patrick

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org