You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@thrift.apache.org by David Nadlinger <co...@klickverbot.at> on 2011/04/30 01:36:31 UTC

Pluggable Serializers

Hello list,

as this is my first post here, let my quickly introduce myself first: My 
name is David Nadlinger, I'm a student from Austria, and I am going to 
work on a Thrift-related project during this year's Google Summer of 
Code under the umbrella of Digital Mars: a Thrift implementation for/in 
the D programming language. [1]

While preparing my project proposal, I came across a JIRA entry which 
discusses the idea of pluggable serializers [2], and as I will implement 
a new language library during the course of the project, this obviously 
caught my attention. As I am somewhat familiar with the way 
serialization is currently implemented, I can see the limitations of the 
existing approach, but are there any details on how exactly the design 
of the proposed new solution would look like? Maybe there is some 
previous discussion on the topic I missed while looking through the 
mailing list archives? Otherwise, Bryan, would you mind quickly 
sketching how you envision the design?

As I am currently thinking about the library design for D, I would be 
grateful for any feedback, also regarding any other lessons learned 
about the current C++/Java library design.

Thanks a lot,
David


[1] http://klickverbot.at/code/gsoc/thrift/ (nothing of interest there yet)

[2] https://issues.apache.org/jira/browse/THRIFT-769

Re: Pluggable Serializers

Posted by Bryan Duxbury <br...@rapleaf.com>.

I certainly wouldn't want to see the global interface design you proposed.
That would be awkward.

I was thinking something more along the lines of your factory idea, but
without hardcoding the serializer styles. Instead, it would be up to the
factory to select the appropriate serializer. Then it becomes an exercise in
classifying protocols so that it's clear which kind of serializer they
should use. For instance, Binary and (existing) Compact both use the
"context free" serializer; JSON would probably use a custom serializer, as
would a rewritten Compact.

I'm not sure how to go into too much more detail without writing code. Are
you in a position where you want to start hacking on this project? If so, we
can chat offline about how to get a prototype going.

-Bryan

On Sun, May 22, 2011 at 10:51 AM, David Nadlinger <co...@klickverbot.at>wrote:

> Hey Bryan,
>
> First, I'd like to thank you a lot for your offer – I very much appreciate
> any help from more experienced Thrift users or developers.
>
> I thought a bit more about this issue, and while I agree that the current
> scheme makes it really hard to implement alternative protocols differing
> from the flat, »context-free« nature of the default binary protocol, I'm not
> sure how pluggable serializers would be implemented in your idea.
>
> More specifically, I can't quite see how structs would really be serialized
> after the change. Would you propose to replace the protocol interface by a
> project-specific generated serializer interface having a write method for
> all defined struct types, like in the following example?
>
> ---
> struct Foo {
>   int a;
>   // No read/write method here.
> }
>
> struct Bar { … }
>
> interface TSerializer {
>   void writeFoo(Foo f);
>   void writeBar(Bar b);
> }
>
> class TBinarySerializer implements TSerializer {
>   this (TTransport t) { … }
>   void writeFoo(Foo f) { … }
>   void writeBar(Bar b) { … }
> }
>
> class TJsonSerializer implements TSerializer { … }
> ---
>
> Having such a single global interface doesn't seem quite right to me
> (extensibility, etc.) even if it would be generated, and indeed you wrote
> about serializer classes being generated for each struct. But how would you
> connect serializers to protocols then, or how would the protocol interface
> (i.e. TProtocol and friends) look like in the first place to allow for
> writing protocol agnostic code? It appears to me that somewhere all possible
> »protocol styles« (i.e. serializer types) would have to be enumerated,
> because otherwise there would be no way for the write() methods to be able
> to select the correct serializer, which doesn't seem like a great solution
> either.
>
> To clarify what I mean, another example how I think this approach could be
> implemented:
>
> ---
> interface TProtocol {
>   void writeStruct(TStructSerializerFactory s);
>   …
> }
>
> class TBinaryProtocol implements TProtocol {
>   void writeStruct(TStructSerializerFactory s) {
>      s.getBinarySerializer().writeTo(this);
>   }
>   …
> }
>
> interface TBinarySerializer { void writeTo(TBinaryProtocol t); }
> interface TJsonSerializer { void writeTo(TJsonProtocol t); }
>
> interface TStructSerializerFactory {
>   // Have to enumerate all possible protocol »styles« here.
>   TBinarySerializer getBinarySerializer();
>   TJsonSerializer getJsonSerializer();
>   …
> }
>
> struct Foo {
>   int a;
>   void write(TProtocol t) {
>      t.writeStruct(new FooSerializerFactory(this));
>   }
> }
>
> class FooSerializerFactory implements TStructSerializerFactory {
>   Foo f_;
>   this(Foo f) {
>      f_ = f;
>   }
>   TBinarySerializer getBinarySerializer() {
>      return new FooBinarySerializer(f_);
>   }
>   // other factory methods
>   …
> }
>
> class FooBinarySerializer implements TBinarySerializer {
>   Foo f_;
>   this(Foo f) {
>      f_ = f;
>   }
>   void writeTo(TBinaryProtocol t) {
>      // The code currently generated into Foo.write().
>      …
>   }
> }
> ---
>
> There are of course a few other possible ways to implement this, but I
> couldn't really come up with a design to connect serializers and protocols
> that doesn't seem hackish or overly complex.
>
> But isn't the problem really just that the current TProtocol interface
> makes it hard to implement protocols that have some kind of »scope« or
> »nesting«, like JSON does, because everything is »flattened« to a single
> layer, only to painstakingly reconstruct the structure from the
> write*Begin() and write*End() calls later?
>
> I think it would help quite a bit to just replace all the pairs of *Begin()
> and *End() calls with a single function, e.g. writeStruct(), which takes a
> delegate/lambda (or whatever it is called in the respective language) for
> writing the children. A little piece of D-style pseudocode to illustrate
> what I mean:
>
> ---
> interface TProtocol {
>   void writeStruct(string name, void delegate() writeMembers);
>   …
> }
>
> class TJsonProtocol implements TProtocol {
>   void writeStruct(string name, void delegate() writeMembers) {
>      // Do some setup work, open a new JSON object.
>      …
>      // Call the passed in delegate, which calls other write* functions
>      // on this protocol instance to write out all the members.
>      writeMembers();
>
>      // Do some cleanup work, close the JSON object definition, being
>      // able to access any data stored in local variables above.
>   }
>
>   …
> }
>
> struct Foo {
>   int a;
>   void write(TProtocol t) {
>      t.writeStruct("Foo", {
>         // Write all the members of Foo to t, just like we do now:
>         t.writeField(1, …);
>      } );
>   }
> }
> ---
>
> This way, you don't need an excessive amount of bookkeeping to persist the
> information about the structure across the different calls by just mapping
> the structure to recursive function calls, but there is still a simple
> common interface for all protocols. I'll give it a try when implementing the
> protocols in D, let's see how this works out…
>
> Thanks for reading through all this,
> David
>
>
>
> On 4/30/11 11:26 PM, Bryan Duxbury wrote:
>
>> Hey David -
>>
>> I don't think it's been explored in great detail anywhere yet, but my idea
>> was that we'd introduce a layer of abstraction between struct and protocol
>> called serializer. This new object would basically take the guts of the
>> write() and read() methods and move them into a separate class, which the
>> compiler would generate for each struct.
>>
>> The first draft of this would just be an exercise in refactoring, but once
>> the code was generated in a different class, we could extend he model to
>> generate different kinds of serializers that work better with different
>> protocols. For instance, I could imagine a "CompactSerializer" that meant
>> we
>> didn't have to keep a stateful Protocol, or a JsonSerializer that just
>> made
>> JSON without all the existing machinations.
>>
>> I wish I had more to offer here, but I just haven't had the time to
>> experiment. If you're starting from scratch on a new language
>> implementation, I'd recommend just porting the Java library as directly as
>> you can manage. It's extremely mature and robust - and it has pretty
>> decent
>> tests.
>>
>> Let me know if you run into specific roadblocks. I'm always happy to help
>> new languages come on board!
>>
>> -Bryan
>>
>> On Fri, Apr 29, 2011 at 4:36 PM, David Nadlinger<code@klickverbot.at
>> >wrote:
>>
>>  Hello list,
>>>
>>> as this is my first post here, let my quickly introduce myself first: My
>>> name is David Nadlinger, I'm a student from Austria, and I am going to
>>> work
>>> on a Thrift-related project during this year's Google Summer of Code
>>> under
>>> the umbrella of Digital Mars: a Thrift implementation for/in the D
>>> programming language. [1]
>>>
>>> While preparing my project proposal, I came across a JIRA entry which
>>> discusses the idea of pluggable serializers [2], and as I will implement
>>> a
>>> new language library during the course of the project, this obviously
>>> caught
>>> my attention. As I am somewhat familiar with the way serialization is
>>> currently implemented, I can see the limitations of the existing
>>> approach,
>>> but are there any details on how exactly the design of the proposed new
>>> solution would look like? Maybe there is some previous discussion on the
>>> topic I missed while looking through the mailing list archives?
>>> Otherwise,
>>> Bryan, would you mind quickly sketching how you envision the design?
>>>
>>> As I am currently thinking about the library design for D, I would be
>>> grateful for any feedback, also regarding any other lessons learned about
>>> the current C++/Java library design.
>>>
>>> Thanks a lot,
>>> David
>>>
>>>
>>> [1] http://klickverbot.at/code/gsoc/thrift/ (nothing of interest there
>>> yet)
>>>
>>> [2] https://issues.apache.org/jira/browse/THRIFT-769
>>>
>>>
>>
>

Re: Pluggable Serializers

Posted by David Nadlinger <co...@klickverbot.at>.

Hey Bryan,

First, I'd like to thank you a lot for your offer – I very much 
appreciate any help from more experienced Thrift users or developers.

I thought a bit more about this issue, and while I agree that the 
current scheme makes it really hard to implement alternative protocols 
differing from the flat, »context-free« nature of the default binary 
protocol, I'm not sure how pluggable serializers would be implemented in 
your idea.

More specifically, I can't quite see how structs would really be 
serialized after the change. Would you propose to replace the protocol 
interface by a project-specific generated serializer interface having a 
write method for all defined struct types, like in the following example?

---
struct Foo {
    int a;
    // No read/write method here.
}

struct Bar { … }

interface TSerializer {
    void writeFoo(Foo f);
    void writeBar(Bar b);
}

class TBinarySerializer implements TSerializer {
    this (TTransport t) { … }
    void writeFoo(Foo f) { … }
    void writeBar(Bar b) { … }
}

class TJsonSerializer implements TSerializer { … }
---

Having such a single global interface doesn't seem quite right to me 
(extensibility, etc.) even if it would be generated, and indeed you 
wrote about serializer classes being generated for each struct. But how 
would you connect serializers to protocols then, or how would the 
protocol interface (i.e. TProtocol and friends) look like in the first 
place to allow for writing protocol agnostic code? It appears to me that 
somewhere all possible »protocol styles« (i.e. serializer types) would 
have to be enumerated, because otherwise there would be no way for the 
write() methods to be able to select the correct serializer, which 
doesn't seem like a great solution either.

To clarify what I mean, another example how I think this approach could 
be implemented:

---
interface TProtocol {
    void writeStruct(TStructSerializerFactory s);
    …
}

class TBinaryProtocol implements TProtocol {
    void writeStruct(TStructSerializerFactory s) {
       s.getBinarySerializer().writeTo(this);
    }
    …
}

interface TBinarySerializer { void writeTo(TBinaryProtocol t); }
interface TJsonSerializer { void writeTo(TJsonProtocol t); }

interface TStructSerializerFactory {
    // Have to enumerate all possible protocol »styles« here.
    TBinarySerializer getBinarySerializer();
    TJsonSerializer getJsonSerializer();
    …
}

struct Foo {
    int a;
    void write(TProtocol t) {
       t.writeStruct(new FooSerializerFactory(this));
    }
}

class FooSerializerFactory implements TStructSerializerFactory {
    Foo f_;
    this(Foo f) {
       f_ = f;
    }
    TBinarySerializer getBinarySerializer() {
       return new FooBinarySerializer(f_);
    }
    // other factory methods
    …
}

class FooBinarySerializer implements TBinarySerializer {
    Foo f_;
    this(Foo f) {
       f_ = f;
    }
    void writeTo(TBinaryProtocol t) {
       // The code currently generated into Foo.write().
       …
    }
}
---

There are of course a few other possible ways to implement this, but I 
couldn't really come up with a design to connect serializers and 
protocols that doesn't seem hackish or overly complex.

But isn't the problem really just that the current TProtocol interface 
makes it hard to implement protocols that have some kind of »scope« or 
»nesting«, like JSON does, because everything is »flattened« to a single 
layer, only to painstakingly reconstruct the structure from the 
write*Begin() and write*End() calls later?

I think it would help quite a bit to just replace all the pairs of 
*Begin() and *End() calls with a single function, e.g. writeStruct(), 
which takes a delegate/lambda (or whatever it is called in the 
respective language) for writing the children. A little piece of D-style 
pseudocode to illustrate what I mean:

---
interface TProtocol {
    void writeStruct(string name, void delegate() writeMembers);
    …
}

class TJsonProtocol implements TProtocol {
    void writeStruct(string name, void delegate() writeMembers) {
       // Do some setup work, open a new JSON object.
       …
       // Call the passed in delegate, which calls other write* functions
       // on this protocol instance to write out all the members.
       writeMembers();

       // Do some cleanup work, close the JSON object definition, being
       // able to access any data stored in local variables above.
    }

    …
}

struct Foo {
    int a;
    void write(TProtocol t) {
       t.writeStruct("Foo", {
          // Write all the members of Foo to t, just like we do now:
          t.writeField(1, …);
       } );
    }
}
---

This way, you don't need an excessive amount of bookkeeping to persist 
the information about the structure across the different calls by just 
mapping the structure to recursive function calls, but there is still a 
simple common interface for all protocols. I'll give it a try when 
implementing the protocols in D, let's see how this works out…

Thanks for reading through all this,
David

On 4/30/11 11:26 PM, Bryan Duxbury wrote:
> Hey David -
>
> I don't think it's been explored in great detail anywhere yet, but my idea
> was that we'd introduce a layer of abstraction between struct and protocol
> called serializer. This new object would basically take the guts of the
> write() and read() methods and move them into a separate class, which the
> compiler would generate for each struct.
>
> The first draft of this would just be an exercise in refactoring, but once
> the code was generated in a different class, we could extend he model to
> generate different kinds of serializers that work better with different
> protocols. For instance, I could imagine a "CompactSerializer" that meant we
> didn't have to keep a stateful Protocol, or a JsonSerializer that just made
> JSON without all the existing machinations.
>
> I wish I had more to offer here, but I just haven't had the time to
> experiment. If you're starting from scratch on a new language
> implementation, I'd recommend just porting the Java library as directly as
> you can manage. It's extremely mature and robust - and it has pretty decent
> tests.
>
> Let me know if you run into specific roadblocks. I'm always happy to help
> new languages come on board!
>
> -Bryan
>
> On Fri, Apr 29, 2011 at 4:36 PM, David Nadlinger<co...@klickverbot.at>wrote:
>
>> Hello list,
>>
>> as this is my first post here, let my quickly introduce myself first: My
>> name is David Nadlinger, I'm a student from Austria, and I am going to work
>> on a Thrift-related project during this year's Google Summer of Code under
>> the umbrella of Digital Mars: a Thrift implementation for/in the D
>> programming language. [1]
>>
>> While preparing my project proposal, I came across a JIRA entry which
>> discusses the idea of pluggable serializers [2], and as I will implement a
>> new language library during the course of the project, this obviously caught
>> my attention. As I am somewhat familiar with the way serialization is
>> currently implemented, I can see the limitations of the existing approach,
>> but are there any details on how exactly the design of the proposed new
>> solution would look like? Maybe there is some previous discussion on the
>> topic I missed while looking through the mailing list archives? Otherwise,
>> Bryan, would you mind quickly sketching how you envision the design?
>>
>> As I am currently thinking about the library design for D, I would be
>> grateful for any feedback, also regarding any other lessons learned about
>> the current C++/Java library design.
>>
>> Thanks a lot,
>> David
>>
>>
>> [1] http://klickverbot.at/code/gsoc/thrift/ (nothing of interest there
>> yet)
>>
>> [2] https://issues.apache.org/jira/browse/THRIFT-769
>>
>

Re: Pluggable Serializers

Posted by Bryan Duxbury <br...@rapleaf.com>.

Hey David -

I don't think it's been explored in great detail anywhere yet, but my idea
was that we'd introduce a layer of abstraction between struct and protocol
called serializer. This new object would basically take the guts of the
write() and read() methods and move them into a separate class, which the
compiler would generate for each struct.

The first draft of this would just be an exercise in refactoring, but once
the code was generated in a different class, we could extend he model to
generate different kinds of serializers that work better with different
protocols. For instance, I could imagine a "CompactSerializer" that meant we
didn't have to keep a stateful Protocol, or a JsonSerializer that just made
JSON without all the existing machinations.

I wish I had more to offer here, but I just haven't had the time to
experiment. If you're starting from scratch on a new language
implementation, I'd recommend just porting the Java library as directly as
you can manage. It's extremely mature and robust - and it has pretty decent
tests.

Let me know if you run into specific roadblocks. I'm always happy to help
new languages come on board!

-Bryan

On Fri, Apr 29, 2011 at 4:36 PM, David Nadlinger <co...@klickverbot.at>wrote:

> Hello list,
>
> as this is my first post here, let my quickly introduce myself first: My
> name is David Nadlinger, I'm a student from Austria, and I am going to work
> on a Thrift-related project during this year's Google Summer of Code under
> the umbrella of Digital Mars: a Thrift implementation for/in the D
> programming language. [1]
>
> While preparing my project proposal, I came across a JIRA entry which
> discusses the idea of pluggable serializers [2], and as I will implement a
> new language library during the course of the project, this obviously caught
> my attention. As I am somewhat familiar with the way serialization is
> currently implemented, I can see the limitations of the existing approach,
> but are there any details on how exactly the design of the proposed new
> solution would look like? Maybe there is some previous discussion on the
> topic I missed while looking through the mailing list archives? Otherwise,
> Bryan, would you mind quickly sketching how you envision the design?
>
> As I am currently thinking about the library design for D, I would be
> grateful for any feedback, also regarding any other lessons learned about
> the current C++/Java library design.
>
> Thanks a lot,
> David
>
>
> [1] http://klickverbot.at/code/gsoc/thrift/ (nothing of interest there
> yet)
>
> [2] https://issues.apache.org/jira/browse/THRIFT-769
>