You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Matteo Merli <mm...@apache.org> on 2020/12/23 17:35:14 UTC

[PROPOSAL] PIP 75: Replace protobuf code generator

## Motivation

In the Pulsar wire protocol, we are using Google Protobuf in order to perform
serialization/deserialization of the commands that are exchanged between
clients and brokers.

Because of the overhead involved with the regular Protobuf implementation, since
very early on, we have been using a modified version of Protobuf 2.4.1.
The modifications were done to ensure a more efficient serialization code that
used thread local caches for the objects used in the process.

There are few issues with the current approach:
 1. The patch to the Protobuf code generator is only based on version 2.4.1 and
    cannot be upgraded to newer Protobuf versions
 2. The new Protobuf version, 3.xx, do have the same performance issues as the
    2.x versions.
 3. The thread-local approach for reusing objects is not ideal. Thread-local
    access is not free and it would be better to instead cache the root objects
    only.

## Goal

Have an efficient and maintainable way to perform serialization/deserialization
of Pulsar protocol.

The current proposal is to switch from the patched Protobuf 2.4.1 and use a
different code generator, Splunk LightProto:
https://github.com/splunk/lightproto.

This code generator has the following features/goals:

 1. Generate the fastest possible Java code for Protobuf SerDe
 2. 100% Compatible with proto2 definition and wire protocol
 3. Zero-copy deserialization using Netty ByteBuf
 4. Deserialize from direct memory
 5. Zero heap allocations in serialization / deserialization
 6. Lazy deserialization of strings and bytes
 7. Reusable mutable objects
 8. No runtime dependency library
 9. Java based code generator with Maven plugin

There is extensive testing to ensure the generated code serializes and parses
the same bytes in the same way as the Google Protobuf does.


--
Matteo Merli
<mm...@apache.org>

Re: [PROPOSAL] PIP 75: Replace protobuf code generator

Posted by Matteo Merli <ma...@gmail.com>.
On Wed, Dec 23, 2020 at 11:45 AM Enrico Olivelli <eo...@gmail.com> wrote:
> I guess we cannot leverage it on BK as we are with Protobuf 3.x

The initial focus was for Proto2, though with few changes we can also
support the proto3 syntax.


--
Matteo Merli
<ma...@gmail.com>

Re: [PROPOSAL] PIP 75: Replace protobuf code generator

Posted by Enrico Olivelli <eo...@gmail.com>.
Dave,

Il Mer 23 Dic 2020, 19:58 Dave Fisher <wa...@apache.org> ha scritto:

>
>
> > On Dec 23, 2020, at 10:55 AM, Matteo Merli <ma...@gmail.com>
> wrote:
> >
> > Hi Dave,
> >
> > these are not dependencies that are needed (or would be pulled) from
> > the Pulsar build.
> >
> > The JMH is only used if someone wants to run a benchmark on the
> > LightProto code. Pulsar does not depend on that.
>
> Excellent. You’ll need to note that for developers if benchmarking is
> exposed in Pulsar.
>

JMH is only for micro benchmarks, this is something users don't use, only
Pulsar (Splunk) developers

Matteo,
Very nice work!

Looking forward to see this land into community code

I guess we cannot leverage it on BK as we are with Protobuf 3.x

Enrico





> >
> > From Pulsar build we would need:
> > 1. At build time -> The LightProto Maven Plugin with the generator module
> > 2. At runtime (and in Apache release) -> No dependencies
> >
> >
> > Matteo
> >
> >
> >
> > --
> > Matteo Merli
> > <ma...@gmail.com>
> >
> > On Wed, Dec 23, 2020 at 10:20 AM Dave Fisher <wa...@apache.org> wrote:
> >>
> >> Hi Matteo,
> >>
> >> I looked at the dependencies in Splunk LightPro and the following are
> not acceptable in an Apache Binary:
> >>
> >>            <dependency>
> >>                <groupId>org.openjdk.jmh</groupId>
> >>                <artifactId>jmh-core</artifactId>
> >>                <version>${jmh.version}</version>
> >>            </dependency>
> >>            <dependency>
> >>                <groupId>org.openjdk.jmh</groupId>
> >>                <artifactId>jmh-generator-annprocess</artifactId>
> >>                <version>${jmh.version}</version>
> >>                <scope>provided</scope>
> >>            </dependency>
> >>
> >> These are GPL according to mvnrepository.com.
> >>
> >> The other dependencies are fine and all class A or B.
> >>
> >> Have a look at https://www.apache.org/legal/resolved.html
> >>
> >> Regards,
> >> Dave
> >>
> >>> On Dec 23, 2020, at 9:35 AM, Matteo Merli <mm...@apache.org> wrote:
> >>>
> >>> ## Motivation
> >>>
> >>> In the Pulsar wire protocol, we are using Google Protobuf in order to
> perform
> >>> serialization/deserialization of the commands that are exchanged
> between
> >>> clients and brokers.
> >>>
> >>> Because of the overhead involved with the regular Protobuf
> implementation, since
> >>> very early on, we have been using a modified version of Protobuf 2.4.1.
> >>> The modifications were done to ensure a more efficient serialization
> code that
> >>> used thread local caches for the objects used in the process.
> >>>
> >>> There are few issues with the current approach:
> >>> 1. The patch to the Protobuf code generator is only based on version
> 2.4.1 and
> >>>   cannot be upgraded to newer Protobuf versions
> >>> 2. The new Protobuf version, 3.xx, do have the same performance issues
> as the
> >>>   2.x versions.
> >>> 3. The thread-local approach for reusing objects is not ideal.
> Thread-local
> >>>   access is not free and it would be better to instead cache the root
> objects
> >>>   only.
> >>>
> >>> ## Goal
> >>>
> >>> Have an efficient and maintainable way to perform
> serialization/deserialization
> >>> of Pulsar protocol.
> >>>
> >>> The current proposal is to switch from the patched Protobuf 2.4.1 and
> use a
> >>> different code generator, Splunk LightProto:
> >>> https://github.com/splunk/lightproto.
> >>>
> >>> This code generator has the following features/goals:
> >>>
> >>> 1. Generate the fastest possible Java code for Protobuf SerDe
> >>> 2. 100% Compatible with proto2 definition and wire protocol
> >>> 3. Zero-copy deserialization using Netty ByteBuf
> >>> 4. Deserialize from direct memory
> >>> 5. Zero heap allocations in serialization / deserialization
> >>> 6. Lazy deserialization of strings and bytes
> >>> 7. Reusable mutable objects
> >>> 8. No runtime dependency library
> >>> 9. Java based code generator with Maven plugin
> >>>
> >>> There is extensive testing to ensure the generated code serializes and
> parses
> >>> the same bytes in the same way as the Google Protobuf does.
> >>>
> >>>
> >>> --
> >>> Matteo Merli
> >>> <mm...@apache.org>
> >>
>
>

Re: [PROPOSAL] PIP 75: Replace protobuf code generator

Posted by Dave Fisher <wa...@apache.org>.

> On Dec 23, 2020, at 10:55 AM, Matteo Merli <ma...@gmail.com> wrote:
> 
> Hi Dave,
> 
> these are not dependencies that are needed (or would be pulled) from
> the Pulsar build.
> 
> The JMH is only used if someone wants to run a benchmark on the
> LightProto code. Pulsar does not depend on that.

Excellent. You’ll need to note that for developers if benchmarking is exposed in Pulsar.

> 
> From Pulsar build we would need:
> 1. At build time -> The LightProto Maven Plugin with the generator module
> 2. At runtime (and in Apache release) -> No dependencies
> 
> 
> Matteo
> 
> 
> 
> --
> Matteo Merli
> <ma...@gmail.com>
> 
> On Wed, Dec 23, 2020 at 10:20 AM Dave Fisher <wa...@apache.org> wrote:
>> 
>> Hi Matteo,
>> 
>> I looked at the dependencies in Splunk LightPro and the following are not acceptable in an Apache Binary:
>> 
>>            <dependency>
>>                <groupId>org.openjdk.jmh</groupId>
>>                <artifactId>jmh-core</artifactId>
>>                <version>${jmh.version}</version>
>>            </dependency>
>>            <dependency>
>>                <groupId>org.openjdk.jmh</groupId>
>>                <artifactId>jmh-generator-annprocess</artifactId>
>>                <version>${jmh.version}</version>
>>                <scope>provided</scope>
>>            </dependency>
>> 
>> These are GPL according to mvnrepository.com.
>> 
>> The other dependencies are fine and all class A or B.
>> 
>> Have a look at https://www.apache.org/legal/resolved.html
>> 
>> Regards,
>> Dave
>> 
>>> On Dec 23, 2020, at 9:35 AM, Matteo Merli <mm...@apache.org> wrote:
>>> 
>>> ## Motivation
>>> 
>>> In the Pulsar wire protocol, we are using Google Protobuf in order to perform
>>> serialization/deserialization of the commands that are exchanged between
>>> clients and brokers.
>>> 
>>> Because of the overhead involved with the regular Protobuf implementation, since
>>> very early on, we have been using a modified version of Protobuf 2.4.1.
>>> The modifications were done to ensure a more efficient serialization code that
>>> used thread local caches for the objects used in the process.
>>> 
>>> There are few issues with the current approach:
>>> 1. The patch to the Protobuf code generator is only based on version 2.4.1 and
>>>   cannot be upgraded to newer Protobuf versions
>>> 2. The new Protobuf version, 3.xx, do have the same performance issues as the
>>>   2.x versions.
>>> 3. The thread-local approach for reusing objects is not ideal. Thread-local
>>>   access is not free and it would be better to instead cache the root objects
>>>   only.
>>> 
>>> ## Goal
>>> 
>>> Have an efficient and maintainable way to perform serialization/deserialization
>>> of Pulsar protocol.
>>> 
>>> The current proposal is to switch from the patched Protobuf 2.4.1 and use a
>>> different code generator, Splunk LightProto:
>>> https://github.com/splunk/lightproto.
>>> 
>>> This code generator has the following features/goals:
>>> 
>>> 1. Generate the fastest possible Java code for Protobuf SerDe
>>> 2. 100% Compatible with proto2 definition and wire protocol
>>> 3. Zero-copy deserialization using Netty ByteBuf
>>> 4. Deserialize from direct memory
>>> 5. Zero heap allocations in serialization / deserialization
>>> 6. Lazy deserialization of strings and bytes
>>> 7. Reusable mutable objects
>>> 8. No runtime dependency library
>>> 9. Java based code generator with Maven plugin
>>> 
>>> There is extensive testing to ensure the generated code serializes and parses
>>> the same bytes in the same way as the Google Protobuf does.
>>> 
>>> 
>>> --
>>> Matteo Merli
>>> <mm...@apache.org>
>> 


Re: [PROPOSAL] PIP 75: Replace protobuf code generator

Posted by Matteo Merli <ma...@gmail.com>.
Hi Dave,

these are not dependencies that are needed (or would be pulled) from
the Pulsar build.

The JMH is only used if someone wants to run a benchmark on the
LightProto code. Pulsar does not depend on that.

From Pulsar build we would need:
 1. At build time -> The LightProto Maven Plugin with the generator module
 2. At runtime (and in Apache release) -> No dependencies


Matteo



--
Matteo Merli
<ma...@gmail.com>

On Wed, Dec 23, 2020 at 10:20 AM Dave Fisher <wa...@apache.org> wrote:
>
> Hi Matteo,
>
> I looked at the dependencies in Splunk LightPro and the following are not acceptable in an Apache Binary:
>
>             <dependency>
>                 <groupId>org.openjdk.jmh</groupId>
>                 <artifactId>jmh-core</artifactId>
>                 <version>${jmh.version}</version>
>             </dependency>
>             <dependency>
>                 <groupId>org.openjdk.jmh</groupId>
>                 <artifactId>jmh-generator-annprocess</artifactId>
>                 <version>${jmh.version}</version>
>                 <scope>provided</scope>
>             </dependency>
>
> These are GPL according to mvnrepository.com.
>
> The other dependencies are fine and all class A or B.
>
> Have a look at https://www.apache.org/legal/resolved.html
>
> Regards,
> Dave
>
> > On Dec 23, 2020, at 9:35 AM, Matteo Merli <mm...@apache.org> wrote:
> >
> > ## Motivation
> >
> > In the Pulsar wire protocol, we are using Google Protobuf in order to perform
> > serialization/deserialization of the commands that are exchanged between
> > clients and brokers.
> >
> > Because of the overhead involved with the regular Protobuf implementation, since
> > very early on, we have been using a modified version of Protobuf 2.4.1.
> > The modifications were done to ensure a more efficient serialization code that
> > used thread local caches for the objects used in the process.
> >
> > There are few issues with the current approach:
> > 1. The patch to the Protobuf code generator is only based on version 2.4.1 and
> >    cannot be upgraded to newer Protobuf versions
> > 2. The new Protobuf version, 3.xx, do have the same performance issues as the
> >    2.x versions.
> > 3. The thread-local approach for reusing objects is not ideal. Thread-local
> >    access is not free and it would be better to instead cache the root objects
> >    only.
> >
> > ## Goal
> >
> > Have an efficient and maintainable way to perform serialization/deserialization
> > of Pulsar protocol.
> >
> > The current proposal is to switch from the patched Protobuf 2.4.1 and use a
> > different code generator, Splunk LightProto:
> > https://github.com/splunk/lightproto.
> >
> > This code generator has the following features/goals:
> >
> > 1. Generate the fastest possible Java code for Protobuf SerDe
> > 2. 100% Compatible with proto2 definition and wire protocol
> > 3. Zero-copy deserialization using Netty ByteBuf
> > 4. Deserialize from direct memory
> > 5. Zero heap allocations in serialization / deserialization
> > 6. Lazy deserialization of strings and bytes
> > 7. Reusable mutable objects
> > 8. No runtime dependency library
> > 9. Java based code generator with Maven plugin
> >
> > There is extensive testing to ensure the generated code serializes and parses
> > the same bytes in the same way as the Google Protobuf does.
> >
> >
> > --
> > Matteo Merli
> > <mm...@apache.org>
>

Re: [PROPOSAL] PIP 75: Replace protobuf code generator

Posted by Dave Fisher <wa...@apache.org>.
Hi Matteo,

I looked at the dependencies in Splunk LightPro and the following are not acceptable in an Apache Binary:

            <dependency>
                <groupId>org.openjdk.jmh</groupId>
                <artifactId>jmh-core</artifactId>
                <version>${jmh.version}</version>
            </dependency>
            <dependency>
                <groupId>org.openjdk.jmh</groupId>
                <artifactId>jmh-generator-annprocess</artifactId>
                <version>${jmh.version}</version>
                <scope>provided</scope>
            </dependency>

These are GPL according to mvnrepository.com.

The other dependencies are fine and all class A or B.

Have a look at https://www.apache.org/legal/resolved.html

Regards,
Dave

> On Dec 23, 2020, at 9:35 AM, Matteo Merli <mm...@apache.org> wrote:
> 
> ## Motivation
> 
> In the Pulsar wire protocol, we are using Google Protobuf in order to perform
> serialization/deserialization of the commands that are exchanged between
> clients and brokers.
> 
> Because of the overhead involved with the regular Protobuf implementation, since
> very early on, we have been using a modified version of Protobuf 2.4.1.
> The modifications were done to ensure a more efficient serialization code that
> used thread local caches for the objects used in the process.
> 
> There are few issues with the current approach:
> 1. The patch to the Protobuf code generator is only based on version 2.4.1 and
>    cannot be upgraded to newer Protobuf versions
> 2. The new Protobuf version, 3.xx, do have the same performance issues as the
>    2.x versions.
> 3. The thread-local approach for reusing objects is not ideal. Thread-local
>    access is not free and it would be better to instead cache the root objects
>    only.
> 
> ## Goal
> 
> Have an efficient and maintainable way to perform serialization/deserialization
> of Pulsar protocol.
> 
> The current proposal is to switch from the patched Protobuf 2.4.1 and use a
> different code generator, Splunk LightProto:
> https://github.com/splunk/lightproto.
> 
> This code generator has the following features/goals:
> 
> 1. Generate the fastest possible Java code for Protobuf SerDe
> 2. 100% Compatible with proto2 definition and wire protocol
> 3. Zero-copy deserialization using Netty ByteBuf
> 4. Deserialize from direct memory
> 5. Zero heap allocations in serialization / deserialization
> 6. Lazy deserialization of strings and bytes
> 7. Reusable mutable objects
> 8. No runtime dependency library
> 9. Java based code generator with Maven plugin
> 
> There is extensive testing to ensure the generated code serializes and parses
> the same bytes in the same way as the Google Protobuf does.
> 
> 
> --
> Matteo Merli
> <mm...@apache.org>