You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@thrift.apache.org by Torsten Curdt <tc...@apache.org> on 2008/08/26 01:36:11 UTC

why...

Hey guys,

I've looked into Thrift recently and a few questions came up:

1. Why a native compiler? Would it me a little bit simpler to have the  
compiler/code generator written in java? No language debate - just a  
curious question for the reason :)

2. Wouldn't it make sense to have a bit of better separation than  
having all code mixed up in the t_*_generator.cc files? Maybe more a  
template approach so adjusting the code that gets generated becomes a  
little bit easier?

3. Why not use the hash code of the attribute names as the sequence id?

4. Why only composition? Even a flattening model of multiple  
inheritance should be quite easy to implement (if overloading is  
forbidden). While in OOP I am big fan of composition over inheritance  
it makes the generated API kind of ugly. Maybe a include mechanism  
would be another way of simplifying composed structures. (Although I  
do realize that with the current model of sequence ids that might be a  
PITA to maintain)

5. If I noticed correctly the names of the attributes are included  
when serialized. Why is that? Shouldn't knowing the sequence id be  
good enough?

6. How do you guys suggest to deal with deterministic semantical  
changes. Let's say you have

struct test {
   required string a;
   required string b;
}

and then you want to combine those values into one attribute

struct test {
   required string ab; // = a + b
}

There are a couple of problems I see here. For one ab will have to  
have a different sequence id. And I guess then the 'required' will  
become a problem for sequence of a and b(?). And finally the  
conversion of ab = a+b needs to be handle on the application level  
while rule is very straight forward and deterministic and *could* be  
expressed in more generic manner.

7. Wouldn't it make sense to separate out the service and exception  
stuff from the actual message versioning/serialization code?

cheers
--
Torsten

RE: why...

Posted by Justin Lebar <ju...@rearden.com>.

With apologies for soapboxing...

---------
From: Chad Walters [mailto:chad@powerset.com] 

> I would actually love to see some mechanism to allow for the compiler to
be abstracted to the point where we could implement it in a broad choice
of languages (C++, Java, Ruby, etc.) and still produce the same target
language bindings. This would free non-C++ shops from needing the C++ tool
chain.
---------

I hate C++ as much as the next guy, but at least the compiler is somewhat
cross-platform.  You can build the compiler under Windows by using Cygwin.
That the C++ client libraries are not cross-platform, however, I see as a
much more considerable issue. 

-Justin

On 8/25/08 4:36 PM, "Torsten Curdt" <tc...@apache.org> wrote:

Hey guys,

I've looked into Thrift recently and a few questions came up:

1. Why a native compiler? Would it me a little bit simpler to have the
compiler/code generator written in java? No language debate - just a
curious question for the reason :)

2. Wouldn't it make sense to have a bit of better separation than
having all code mixed up in the t_*_generator.cc files? Maybe more a
template approach so adjusting the code that gets generated becomes a
little bit easier?

3. Why not use the hash code of the attribute names as the sequence id?

4. Why only composition? Even a flattening model of multiple
inheritance should be quite easy to implement (if overloading is
forbidden). While in OOP I am big fan of composition over inheritance
it makes the generated API kind of ugly. Maybe a include mechanism
would be another way of simplifying composed structures. (Although I
do realize that with the current model of sequence ids that might be a
PITA to maintain)

5. If I noticed correctly the names of the attributes are included
when serialized. Why is that? Shouldn't knowing the sequence id be
good enough?

6. How do you guys suggest to deal with deterministic semantical
changes. Let's say you have

struct test {
   required string a;
   required string b;
}

and then you want to combine those values into one attribute

struct test {
   required string ab; // = a + b
}

There are a couple of problems I see here. For one ab will have to
have a different sequence id. And I guess then the 'required' will
become a problem for sequence of a and b(?). And finally the
conversion of ab = a+b needs to be handle on the application level
while rule is very straight forward and deterministic and *could* be
expressed in more generic manner.

7. Wouldn't it make sense to separate out the service and exception
stuff from the actual message versioning/serialization code?

cheers
--
Torsten

Re: why...

Posted by Chad Walters <ch...@powerset.com>.

I don't think the challenge here is coming up with an IR for the IDL. The challenge is in the backend code generation...

Chad

On 8/27/08 10:26 PM, "Phillip Pearson" <pp...@myelin.co.nz> wrote:

+1

That would be awesome.  You could even make the runtime for the
individual language read the intermediate representation and not require
generated code at all.

Cheers,
Phil

Bryan Duxbury wrote:
> To the idea of using multiple language generators instead of C++, I've
> been thinking that if the compiler itself generated to some common
> intermediate language like JSON, it would be really easy to write a
> generator. JSON (or XML or YAML or something like it) probably already
> has a parser in most languages, so you'd just treat it like an AST and
> generate code however you want. It could be hooked up via
> stdin/stdout. Then, I could generate my Ruby classes with a Ruby
> script :).

Re: why...

Posted by Phillip Pearson <pp...@myelin.co.nz>.

+1

That would be awesome.  You could even make the runtime for the 
individual language read the intermediate representation and not require 
generated code at all.

Cheers,
Phil

Bryan Duxbury wrote:
> To the idea of using multiple language generators instead of C++, I've 
> been thinking that if the compiler itself generated to some common 
> intermediate language like JSON, it would be really easy to write a 
> generator. JSON (or XML or YAML or something like it) probably already 
> has a parser in most languages, so you'd just treat it like an AST and 
> generate code however you want. It could be hooked up via 
> stdin/stdout. Then, I could generate my Ruby classes with a Ruby 
> script :).

Re: why...

Posted by Bryan Duxbury <br...@rapleaf.com>.

To the idea of using multiple language generators instead of C++,  
I've been thinking that if the compiler itself generated to some  
common intermediate language like JSON, it would be really easy to  
write a generator. JSON (or XML or YAML or something like it)  
probably already has a parser in most languages, so you'd just treat  
it like an AST and generate code however you want. It could be hooked  
up via stdin/stdout. Then, I could generate my Ruby classes with a  
Ruby script :).

On Aug 25, 2008, at 5:11 PM, Chad Walters wrote:

>
> Some quick thoughts:
>
> 1. Somewhat historical - Facebook's language of choice for backend  
> stuff was C++ and they were not using Java very much (although  
> their usage seems to have expanded somewhat, what with their use of  
> Hadoop and Zookeeper and their development of Cassandra).
>
> 2. That would be great. However, the current belief is that there  
> is a lot of special-casing for the specifics of each target  
> language and that it's not clear how much commonality could be  
> found to help here.
>
> 3. The current seqid mechanism guarantees uniqueness and also  
> allows the seqid's to be small, which is better for the  
> DenseProtocol and other compact protocols.
>
> 4. Yep, sounds like a PITA. Does it buy that much? Can it be  
> supported across all the languages we are trying to support?
>
> 5. They are available for use by protocols if desired but the seqid  
> is really the important piece of data -- the names are not actually  
> used in the binary protocol or other compact protocols.
>
> 6 and 7. I'll let someone else speak to these issues.
>
> WRT 1 and 2, I would actually love to see some mechanism to allow  
> for the compiler to be abstracted to the point where we could  
> implement it in a broad choice of languages (C++, Java, Ruby, etc.)  
> and still produce the same target language bindings. This would  
> free non-C++ shops from needing the C++ tool chain. Sounds like a  
> pretty interesting and extensive project in and of itself -- if you  
> can figure out how to make this happen, more power to you.
>
> Chad
>
>
> On 8/25/08 4:36 PM, "Torsten Curdt" <tc...@apache.org> wrote:
>
> Hey guys,
>
> I've looked into Thrift recently and a few questions came up:
>
> 1. Why a native compiler? Would it me a little bit simpler to have the
> compiler/code generator written in java? No language debate - just a
> curious question for the reason :)
>
> 2. Wouldn't it make sense to have a bit of better separation than
> having all code mixed up in the t_*_generator.cc files? Maybe more a
> template approach so adjusting the code that gets generated becomes a
> little bit easier?
>
> 3. Why not use the hash code of the attribute names as the sequence  
> id?
>
> 4. Why only composition? Even a flattening model of multiple
> inheritance should be quite easy to implement (if overloading is
> forbidden). While in OOP I am big fan of composition over inheritance
> it makes the generated API kind of ugly. Maybe a include mechanism
> would be another way of simplifying composed structures. (Although I
> do realize that with the current model of sequence ids that might be a
> PITA to maintain)
>
> 5. If I noticed correctly the names of the attributes are included
> when serialized. Why is that? Shouldn't knowing the sequence id be
> good enough?
>
> 6. How do you guys suggest to deal with deterministic semantical
> changes. Let's say you have
>
> struct test {
>    required string a;
>    required string b;
> }
>
> and then you want to combine those values into one attribute
>
> struct test {
>    required string ab; // = a + b
> }
>
> There are a couple of problems I see here. For one ab will have to
> have a different sequence id. And I guess then the 'required' will
> become a problem for sequence of a and b(?). And finally the
> conversion of ab = a+b needs to be handle on the application level
> while rule is very straight forward and deterministic and *could* be
> expressed in more generic manner.
>
> 7. Wouldn't it make sense to separate out the service and exception
> stuff from the actual message versioning/serialization code?
>
> cheers
> --
> Torsten
>
>

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

On Aug 26, 2008, at 02:11, Chad Walters wrote:

<snip/>

> 2. That would be great. However, the current belief is that there is  
> a lot of special-casing for the specifics of each target language  
> and that it's not clear how much commonality could be found to help  
> here.

Well, there is a good set languages supported already.
So maybe we could make guess from those implementations.

> 3. The current seqid mechanism guarantees uniqueness and also allows  
> the seqid's to be small, which is better for the DenseProtocol and  
> other compact protocols.

Of course. But IIRC the sequence id is an integer internally. So is  
the hash code.

> 4. Yep, sounds like a PITA. Does it buy that much? Can it be  
> supported across all the languages we are trying to support?

Does it have to be supported by the language?

I would imagine the structs to be flattened. So that means you still  
end up with
the same code like today. Just with the properties of the parents  
"blended in".

That means you cannot cast to the parent. But you could deserialize as  
all the
different types of the hierarchy. This would be more like duck typing.

> WRT 1 and 2, I would actually love to see some mechanism to allow  
> for the compiler to be abstracted to the point where we could  
> implement it in a broad choice of languages (C++, Java, Ruby, etc.)  
> and still produce the same target language bindings. This would free  
> non-C++ shops from needing the C++ tool chain. Sounds like a pretty  
> interesting and extensive project in and of itself -- if you can  
> figure out how to make this happen, more power to you.

I already have some code :)

cheers
--
Torsten

Re: why...

Posted by Chad Walters <ch...@powerset.com>.

Some quick thoughts:

1. Somewhat historical - Facebook's language of choice for backend stuff was C++ and they were not using Java very much (although their usage seems to have expanded somewhat, what with their use of Hadoop and Zookeeper and their development of Cassandra).

2. That would be great. However, the current belief is that there is a lot of special-casing for the specifics of each target language and that it's not clear how much commonality could be found to help here.

3. The current seqid mechanism guarantees uniqueness and also allows the seqid's to be small, which is better for the DenseProtocol and other compact protocols.

4. Yep, sounds like a PITA. Does it buy that much? Can it be supported across all the languages we are trying to support?

5. They are available for use by protocols if desired but the seqid is really the important piece of data -- the names are not actually used in the binary protocol or other compact protocols.

6 and 7. I'll let someone else speak to these issues.

WRT 1 and 2, I would actually love to see some mechanism to allow for the compiler to be abstracted to the point where we could implement it in a broad choice of languages (C++, Java, Ruby, etc.) and still produce the same target language bindings. This would free non-C++ shops from needing the C++ tool chain. Sounds like a pretty interesting and extensive project in and of itself -- if you can figure out how to make this happen, more power to you.

Chad

On 8/25/08 4:36 PM, "Torsten Curdt" <tc...@apache.org> wrote:

Hey guys,

I've looked into Thrift recently and a few questions came up:

1. Why a native compiler? Would it me a little bit simpler to have the
compiler/code generator written in java? No language debate - just a
curious question for the reason :)

2. Wouldn't it make sense to have a bit of better separation than
having all code mixed up in the t_*_generator.cc files? Maybe more a
template approach so adjusting the code that gets generated becomes a
little bit easier?

3. Why not use the hash code of the attribute names as the sequence id?

4. Why only composition? Even a flattening model of multiple
inheritance should be quite easy to implement (if overloading is
forbidden). While in OOP I am big fan of composition over inheritance
it makes the generated API kind of ugly. Maybe a include mechanism
would be another way of simplifying composed structures. (Although I
do realize that with the current model of sequence ids that might be a
PITA to maintain)

5. If I noticed correctly the names of the attributes are included
when serialized. Why is that? Shouldn't knowing the sequence id be
good enough?

6. How do you guys suggest to deal with deterministic semantical
changes. Let's say you have

struct test {
required string a;
required string b;
}

and then you want to combine those values into one attribute

struct test {
required string ab; // = a + b
}

There are a couple of problems I see here. For one ab will have to
have a different sequence id. And I guess then the 'required' will
become a problem for sequence of a and b(?). And finally the
conversion of ab = a+b needs to be handle on the application level
while rule is very straight forward and deterministic and *could* be
expressed in more generic manner.

7. Wouldn't it make sense to separate out the service and exception
stuff from the actual message versioning/serialization code?

cheers
--
Torsten

Re: why...

Posted by David Reiss <dr...@facebook.com>.

> And requiring specific runtimes for
> code generation each language could cause some frustration.
Yeah.  I think this would be a maintenance nightmare.

> I know some of you have been skeptical of a template approach in the past.
> Perhaps I'm just naïve here, but what exactly is this special-case code?
If you take a look at the generator code, it simply doesn't read
like a template.  Although it might be possible to separate out
some of the structural code from the display code and then use
templates for the latter.

Re: why...

Posted by Carl Byström <ca...@esportnetwork.com>.

On Tue, Aug 26, 2008 at 10:33 AM, David Reiss <dr...@facebook.com> wrote:

> > At least if you expect java to be installed on the system it can't be
> > much easier than that ;)
> It's been my experience that it is easier to get a stable lexx/yacc/g++
> working on a Linux system than Java.  And for distributions that use
> binary packages, the runtime requirements for the Thrift compiler are
> miniscule.

That might be true for Linux systems. But getting things up and running on,
let's say Windows or OS X is usually not that easy.
I might be a bit biased here, but installing the Java runtime and JAR
containing the Thrift compiler (and it's dependencies) is a lot simpler, and
also very cross-platform.

Bryan's idea on having the code generators written in their respective
languages is nice. Not everyone maintaining the Ruby/Erlang/PHP/whatever
libraries are familiar with C++, or would want to be familiar with that.
However, it would introduce some complexity in the form of getting the
generator code cut-off into subprojects. And requiring specific runtimes for
code generation each language could cause some frustration.

>
> > On the
> > other hand I would imagine
> > that implementing new languages would be much easier with a clean
> > templating approach.
> I'm not sure that I agree with this.  If the Thrift data model were
> simpler (which I usually wish it were when working on the compiler),
> then I think this would work well, but my experience is that the
> amount of special-case code required to generate stubs as versatile
> as Thrift's is quite large and could not be succinctly expressed by
> templates.  I would love to be proven wrong on this point, though.
>
>
>
I know some of you have been skeptical of a template approach in the past.
Perhaps I'm just naïve here, but what exactly is this special-case code?
I think templates would be able to handle such problems. For example, take a
look at Enunciate and their code generation for GWT.
http://svn.codehaus.org/enunciate/trunk/enunciate/gwt/src/main/resources/org/codehaus/enunciate/modules/gwt/gwt-endpoint-interface.fmt(see
http://enunciate.codehaus.org/)
They are using FreeMarker for the templates. To me at least, that code gets
closer to the point/domain than the current C++ generator. And still I
consider myself lightyears better at C++ than at the FreeMarker language.

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

On Aug 26, 2008, at 12:30, Ian O'Connell wrote:

> On Tue, Aug 26, 2008 at 10:50 AM, Torsten Curdt <tc...@apache.org>  
> wrote:
>> On Aug 26, 2008, at 10:33, David Reiss wrote:
>>
>>>> At least if you expect java to be installed on the system it  
>>>> can't be
>>>> much easier than that ;)
>>>
>>> It's been my experience that it is easier to get a stable lexx/ 
>>> yacc/g++
>>> working on a Linux system than Java.
>>
>> Ahem ...no offense - but sounds like you are a C guy that just  
>> hates java :)
>
> I doubt he is given facebook do use java too. But there are those of
> us(like me) who don't use java at all, just C/C++ and some scripting
> languages.
> C/C++ compilers are on every dev platform pretty much, I don't have
> java on a single one of mine.

Not that I am one of those but ... do you think windows user agree? :-p

Anyway discussing this point will not lead to anything useful. I just  
would not have picked C/C++ and wondered why it has been used. Done.


>>> Regarding hash codes, I think they would complicated the data model
>>
>>> and make it more difficult to visually parse binary dumps of  
>>> structures.
>>
>> Why? The data model should be the exact same thing. Just that the
>> sequence id would be 343422352 instead of 2 for example.
>>
>>> Also, check out the output of thrift_dump in contrib/.  That would
>>> also be less readable.
>>
>> Again - how is that supposed to be different? 343422352 vs 2.
>>
>> What you really want is to pass in the mapping somehow and get back
>> the attribute name for the sequence id.
>>> Finally, the codes would be larger and would
>>> not lend themselves as well to the variable-length encoding used in
>>> TDenseProtocol.
>>
>> Indeed variable length encoding might not work as good. You would
>> probably have to store the full integer most of the time. On the
>> other hand it could left as a choice to the user:
>
>
> I fail to see the big benefit here? using the sequence id's simplfies
> the implementation which makes it less bug prone across languages.

Frankly speaking I haven't looked into the implementation in detail.  
But I am
not sure the actual cross language implementation really differs?!

What happens when today I write

struct Test {
   required string something [2324532];
}

Unless the sequence id refers to a dedicated slot (and uses a list/ 
array underneath ) there should be no difference at all.

> And
> has been mentioned in the denseprotocol given most people's sequence
> id's remain very low the cost of a hash based method would be pretty
> high.

See above. Especially if you consider an include mechanism there is  
also the maintenance cost.

> Adding it in as a choice to the user just complicates the
> interfaces and means more code/bugs.

But that's something I would like get some more details on.

>> string something [#1] // sequence number 1 (for those who wants to  
>> maintain
>> it)
>>
>> string something // sequence number "something".hashCode (for those  
>> who
>> don't)
>> string something [somename] // sequence number "somename".hashCode
>>
>>> I think inheritance complicates the data model without adding much  
>>> value.
>>
>> Why? What is the problem. For me this would give tremendous value.
>> If you blend in the the attributes it's merrily (/no) more than an  
>> include.
>>
>>> If we were going to do it, I'd want to do it like Protocol Buffers  
>>> do,
>>> which is basically a #include.  (We'd have to have better checking  
>>> for
>>> things like duplicate field ids.)  This could be done by the  
>>> parser and
>>> instantly work in all languages.
>>
>> Not sure that really makes a big difference in terms of checking.
>> But either way - something like that would be nice.
>>
>
> If what you want to achieve is doing the inheritance/include for
> structures and then flatten them surely one could just generate the
> sequence numbers in a deterministic way from the includes rather than
> jumping to hashes?

Suggestions? If you start using the order you are asking for trouble.
What else than the hash would you use?

>>>> Now you focused more on the optional/required and so on. Indeed all
>>>> correct. But my focus was more on the fact that ab can be derived  
>>>> from
>>>> a and b. That means that even old struct implicitly have ab. So  
>>>> when
>>>> you make the switch you either have support this logic in (every)  
>>>> client
>>>> or you switch over to only rely on ab and can no longer read older
>>>> structs.
>>>>
>>>> See what I mean now?
>>>
>>> My understanding is that you want Thrift to generate code for  
>>> combining
>>> values?  This would basically make it a programming language, and  
>>> we are
>>> not going to be doing that.
>>
>> I am talking about a problem here. And I wanted to discuss how to  
>> solve this
>> best.
>> While support for renaming fields is great, renaming really is only  
>> part of
>> the problem.
>
> What is the problem here? if you want to swap to the alternate method
> of representing your data you should use a language specific wrapper
> that will take a and b and pass ab onto your function which processes
> ab. it doesn't sound like a thrift problem?

See my other mail.

>>>> Well, if you only use Thrift for serialization and versioning you
>>>> might not
>>>> always have a need for the service stub generation. While this  
>>>> isn't
>>>> really a big problem I am wondering if these aren't two separate  
>>>> things.
>>>
>>> If you don't declare any services, no stubs will be generated.
>>
>> That was not my point. This is about the code base and the focus of  
>> the
>> project.
>
> What is your point exactly in this regard? i'm not sure how its a
> problem given it doesn't generate the stubs if you don't need them?
> Most of the code in the platform libraries is well seperated so one
> could rebuild lacking the network aspects if one wished..

It has been raised the issue of feature bloat. Do one thing and do it  
well.
Currently I see two things in Thrift - that's my point.

>> David, with all due respect. If you guys want grow a community  
>> around this
>> project
>> you might want to consider becoming a little more open to  
>> discussions.
>
> This has come up time and again recently, and I honistly don't really
> see where people are coming from all that much. Thrift was designed as
> a lightweight protocol, and of late people have suggested alot of
> things which would turn it into a much slower clunky library which
> would be useless for alot of us.

I don't see my suggestions to match that description.

> Sure people have suggested making
> their X suggestion optional given it will slow things down by a huge
> margin. But given a few revisions, layer of new features on these
> 'slow' aspects and eventually we'll end up some overly featured slow
> RPC library like most of those out there.

Not sure what else has been proposed - but maybe you exaggerating a  
little here?

Will check the archives...

>> I am not criticizing your baby here. I am trying to understand  
>> where it came
>> from and try
>> to make suggestions that might make it work better for others (like  
>> me).
>
> And given the number of people doing this of late I think its pretty
> fair to cut them some slack in this regard. There have been lots of
> suggestions of late, and honistly most of them bad...  with the
> original developers taking time to answer all of them in full rather
> than shutting them straight off. Maybe abrupt at times but no one on
> here is a kid who needs to be coddled over their bad idea?

I am  not sure what has been proposed before but if Thrift wants to  
have a healthy community around this:
Big deal - you will have to learn living with that. Every open source  
projects has to. And "shutting off" people is not the Apache way. Sorry.

cheers
--
Torsten

Re: why...

Posted by Ian O'Connell <ia...@maths.tcd.ie>.

On Tue, Aug 26, 2008 at 10:50 AM, Torsten Curdt <tc...@apache.org> wrote:
> On Aug 26, 2008, at 10:33, David Reiss wrote:
>
>>> At least if you expect java to be installed on the system it can't be
>>> much easier than that ;)
>>
>> It's been my experience that it is easier to get a stable lexx/yacc/g++
>> working on a Linux system than Java.
>
> Ahem ...no offense - but sounds like you are a C guy that just hates java :)

I doubt he is given facebook do use java too. But there are those of
us(like me) who don't use java at all, just C/C++ and some scripting
languages.
C/C++ compilers are on every dev platform pretty much, I don't have
java on a single one of mine.

(i don't hate java i just deal with high performance code/CELL/GPU
programming so java would be a no no)

>
> <snip/>
>
>> Regarding hash codes, I think they would complicated the data model
>
>> and make it more difficult to visually parse binary dumps of structures.
>
> Why? The data model should be the exact same thing. Just that the
> sequence id would be 343422352 instead of 2 for example.
>
>> Also, check out the output of thrift_dump in contrib/.  That would
>> also be less readable.
>
> Again - how is that supposed to be different? 343422352 vs 2.
>
> What you really want is to pass in the mapping somehow and get back
> the attribute name for the sequence id.
>>  Finally, the codes would be larger and would
>> not lend themselves as well to the variable-length encoding used in
>> TDenseProtocol.
>
> Indeed variable length encoding might not work as good. You would
> probably have to store the full integer most of the time. On the
> other hand it could left as a choice to the user:

I fail to see the big benefit here? using the sequence id's simplfies
the implementation which makes it less bug prone across languages. And
has been mentioned in the denseprotocol given most people's sequence
id's remain very low the cost of a hash based method would be pretty
high. Adding it in as a choice to the user just complicates the
interfaces and means more code/bugs.

>
>  string something [#1] // sequence number 1 (for those who wants to maintain
> it)
>
>  string something // sequence number "something".hashCode (for those who
> don't)
>  string something [somename] // sequence number "somename".hashCode
>
>> I think inheritance complicates the data model without adding much value.
>
> Why? What is the problem. For me this would give tremendous value.
> If you blend in the the attributes it's merrily (/no) more than an include.
>
>> If we were going to do it, I'd want to do it like Protocol Buffers do,
>> which is basically a #include.  (We'd have to have better checking for
>> things like duplicate field ids.)  This could be done by the parser and
>> instantly work in all languages.
>
> Not sure that really makes a big difference in terms of checking.
> But either way - something like that would be nice.
>

If what you want to achieve is doing the inheritance/include for
structures and then flatten them surely one could just generate the
sequence numbers in a deterministic way from the includes rather than
jumping to hashes?

>>> Now you focused more on the optional/required and so on. Indeed all
>>> correct. But my focus was more on the fact that ab can be derived from
>>> a and b. That means that even old struct implicitly have ab. So when
>>> you make the switch you either have support this logic in (every) client
>>> or you switch over to only rely on ab and can no longer read older
>>> structs.
>>>
>>> See what I mean now?
>>
>> My understanding is that you want Thrift to generate code for combining
>> values?  This would basically make it a programming language, and we are
>> not going to be doing that.
>
> I am talking about a problem here. And I wanted to discuss how to solve this
> best.
> While support for renaming fields is great, renaming really is only part of
> the problem.

What is the problem here? if you want to swap to the alternate method
of representing your data you should use a language specific wrapper
that will take a and b and pass ab onto your function which processes
ab. it doesn't sound like a thrift problem?

>
>>> Well, if you only use Thrift for serialization and versioning you
>>> might not
>>> always have a need for the service stub generation. While this isn't
>>> really a big problem I am wondering if these aren't two separate things.
>>
>> If you don't declare any services, no stubs will be generated.
>
> That was not my point. This is about the code base and the focus of the
> project.

What is your point exactly in this regard? i'm not sure how its a
problem given it doesn't generate the stubs if you don't need them?
Most of the code in the platform libraries is well seperated so one
could rebuild lacking the network aspects if one wished..

>
>
> David, with all due respect. If you guys want grow a community around this
> project
> you might want to consider becoming a little more open to discussions.

This has come up time and again recently, and I honistly don't really
see where people are coming from all that much. Thrift was designed as
a lightweight protocol, and of late people have suggested alot of
things which would turn it into a much slower clunky library which
would be useless for alot of us. Sure people have suggested making
their X suggestion optional given it will slow things down by a huge
margin. But given a few revisions, layer of new features on these
'slow' aspects and eventually we'll end up some overly featured slow
RPC library like most of those out there.

> I am not criticizing your baby here. I am trying to understand where it came
> from and try
> to make suggestions that might make it work better for others (like me).

And given the number of people doing this of late I think its pretty
fair to cut them some slack in this regard. There have been lots of
suggestions of late, and honistly most of them bad...  with the
original developers taking time to answer all of them in full rather
than shutting them straight off. Maybe abrupt at times but no one on
here is a kid who needs to be coddled over their bad idea?

Ian.

Re: why...

Posted by Johan Stuyts <j....@zybber.nl>.

> This is (sort of) OK as long you don't think about an include mechanism.

It works for my inheritance mechanism. I must admit that this mechanism  
has a bit more overhead during serialization than an include mechanism,  
because it is serialized in the standard Thrift way as composition.

> ??? Of course you can't change the hashing afterwards if you want to  
> stay compatible.
>
> If you stick with that renaming works the same as before.

What I meant was that the user does not have to specify the hash code when  
he writes the first version because the compiler generates the value. But  
when he renames a field he suddenly has to:
- add the old field name to his IDL, or
- find this 'weird' number to add to his IDL.

This might be the puzzling bit. I agree that this is easy to overcome but  
so is manually specifying IDs from the start.

> Do I hear a "My stuff is working. Let's not change it"?

I am not saying that. What I am saying is that 'it's a minor change' is  
not a good reason to add it. The consequences for all the related code and  
the possible incompatibilities have to be weighed against the advantages  
before a decision is made.

And 'my stuff' is not working. I don't have the need for other clients  
than Java-based ones at this moment, but I really want as many languages  
as possible to be fully compatible so the software I write is open for  
integration in the current and future infrastructure of my clients.

IMHO adding support for hash codes for IDs does not have enough advantages  
to warrant a change in the syntax.

Adding support for inheritance to languages that do support it is  
interesting however, but in this case I would keep the on-the-wire format  
compatible with languages not supporting inheritance, i.e. write the  
inheritance hierarchy out as composition.

> Who has their own compilers? Care to share?

I am working on a compiler which takes a subset of the Thrift syntax and  
generates:
- an HTML snippet with the syntax highlighted IDL used for showing  
documentation about the services exposed on running servers
- metadata about the IDL files that are included by an IDL file (so all  
dependencies can be determined at runtime, which is needed to make sure  
the documentation of all used IDLs is shown)
- interfaces and implementations for structs (including hierarchies)
- interfaces for services
- classes containing constants and enumerations
- helper classes that can serialize the structs and services to and from a  
Thrift protocol
- helper classes for RMI that allow me to switch to RMI instead of Thrift
- metadata telling the other RPC protocols (currently only XML-RPC) which  
structures are allowed to be serialized, and which namespace and name must  
be used for them on the wire

I want to use Thrift IDLs as the definition of services that I want to  
expose using different protocols (Thrift, RMI, XML-RPC, (very simple)  
SOAP, ...). I am hesitant of changes because I fear incompatibilities.  
Incompatibilities would mean somebody would not be able to take my IDL,  
generate code for a language (which may not have been supported when I  
wrote the IDL), and communicate with my services exposed using Thrift. If  
it comes to this I would have to drop support for exposing my services  
using Thrift because it would become a maintenance nightmare. The only  
thing I could do then is keep my own version of the IDL syntax based on an  
early version of the Thrift syntax, and generate code for other protocols  
where incompatibilities are less of an issue.

This is all a work in progress based on a framework for which I had to  
write the structs and services manually. And for which it was cumbersome  
to specify the metadata for each supported protocol. I could have used  
SOAP to do the same but find Thrift so much easier. The ICE IDL, which is  
similar to Thrift, is also an option, but license costs prevent from from  
using it. Cisco Etch might be another option if I could find out more  
about it.

> Please re-read what I wrote. I was not suggesting to have this included  
> into Thrift. But this still is a common problem. Question is whether  
> there is a way to do version migrations like this somewhere central  
> instead of having that in every client.

I am sorry I misunderstood you. What I don't understand is why you think  
evolution requires changes to every client. This is only needed if you  
want to delete the code that handles evolution on the server. If you don't  
mind leaving the code running on the server you do not have to upgrade  
existing clients (or their IDLs when you rebuild them for that matter).


I would like to stress that in my opinion compabitility is critical to the  
success of Thrift. Given that there currently are a number of transports  
and protocols with varying support in the languages that are supported,  
the compatibility is not as high as I would want it to be. There is  
already discussion of a new protocol, TDenseProtocol, I have proposed a  
multiplexing socket transport, there is talk about asynchronous client  
calls for which incomplete code is currently generated and that probably  
requires a new transport, and it is not clear whether or not support for  
reflective protocols is a primary concern.

-- 
Kind regards,

Johan Stuyts

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

Thanks for the pointer, Pete!!

On Aug 26, 2008, at 20:31, Pete Wyckoff wrote:

>
> Just fyi, the Hadoop Hive project has a module that parses the  
> thrift DDL as
> of about 4 months ago using JavaCC.  It doesn't have a code  
> generator as it
> is more of a runtime serialization/deserialization for database  
> tables, but
> uses Tprotocol and is compatible with thrift serialized data.
>
> https://issues.apache.org/jira/browse/HADOOP-3601
>
> -- pete
>
>
>
> On 8/26/08 11:19 AM, "Kevin Clark" <ke...@gmail.com> wrote:
>
>> Torsten,
>> If you really feel strongly about it, write it. If it isn't accepted
>> into mainline, that's how it goes, but I'm move convinced by code  
>> than
>> talk. Show me that the features you propose are useful, and won't
>> cause usability and performance problems, and you'll have my vote.
>>
>> In the meantime, I feel like neither side is going to agree with the
>> other outright. It's open source. Scratch your own itch, and maybe
>> others will want the same thing. Until something exists, we're  
>> arguing
>> about imaginary code, and the implications of such.
>

Re: why...

Posted by Pete Wyckoff <pw...@facebook.com>.

Just fyi, the Hadoop Hive project has a module that parses the thrift DDL as
of about 4 months ago using JavaCC.  It doesn't have a code generator as it
is more of a runtime serialization/deserialization for database tables, but
uses Tprotocol and is compatible with thrift serialized data.

https://issues.apache.org/jira/browse/HADOOP-3601

-- pete

On 8/26/08 11:19 AM, "Kevin Clark" <ke...@gmail.com> wrote:

> Torsten,
> If you really feel strongly about it, write it. If it isn't accepted
> into mainline, that's how it goes, but I'm move convinced by code than
> talk. Show me that the features you propose are useful, and won't
> cause usability and performance problems, and you'll have my vote.
> 
> In the meantime, I feel like neither side is going to agree with the
> other outright. It's open source. Scratch your own itch, and maybe
> others will want the same thing. Until something exists, we're arguing
> about imaginary code, and the implications of such.

RE: why...

Posted by Aditya Agarwal <ad...@facebook.com>.

Well said.

It might also be worth calling out the core principles and guidelines in
the front page of the new wiki page. 

> -----Original Message-----
> From: Mark Slee [mailto:mslee@facebook.com]
> Sent: Tuesday, September 02, 2008 4:11 PM
> To: thrift-dev@incubator.apache.org
> Subject: RE: why...
> 
> I just wanted to try to bring some closure to this thread since it
> seems
> like there were a lot of different ideas in it, with both disagreement
> and agreement, and no clear resolution.
> 
> Torsten -- we're happy to have you interested in Thrift, and you've
> brought up a number of places for improvement. I think some of your
> specific questions have touched upon specific values of the Thrift
> project that aren't necessarily obvious from the outset -- such as
> simplicity and consistency. We're really trying to ensure that Thrift
> is
> a project that does a few clear things and does them very clearly and
> very well. This leads to pushback on a lot of niche feature additions
> that we don't believe will benefit the project in the long run. Please
> don't interpret this pushback as a disinterest in building community,
> rather we're all just viewing the project from different angles and
> with
> different communication styles.
> 
> Two issues you raised that I think are very well-aligned with the
> values
> and mission of the project were making object composition (or maybe
> inheritance if it can be made portable) easier, and more clearly
> delineating abstracting data serialization from services/RPC (though
> they are currently separate, it's not obvious to new users that Thrift
> might be a good choice just for data serialization needs, or where
> exactly this boundary lies).
> 
> Cheers,
> mcslee
> 
> -----Original Message-----
> From: Torsten Curdt [mailto:tcurdt@apache.org]
> Sent: Tuesday, August 26, 2008 12:29 PM
> To: thrift-dev@incubator.apache.org
> Subject: Re: why...
> 
> Kevin
> 
> > If you really feel strongly about it, write it. If it isn't accepted
> > into mainline, that's how it goes, but I'm move convinced by code
> than
> 
> > talk. Show me that the features you propose are useful, and won't
> > cause usability and performance problems, and you'll have my vote.
> 
> I fear the usefulness rather depends on the use case and therefor
might
> not necessarily convince anyone if you don't see the need for it just
> because the code is in place.
> 
> > In the meantime, I feel like neither side is going to agree with the
> > other outright. It's open source. Scratch your own itch, and maybe
> > others will want the same thing. Until something exists, we're
> arguing
> 
> > about imaginary code, and the implications of such.
> 
> Well, usually it's a good idea to communicate and sync up with the
> developer community first and not just throw code at them. At least
> that's how it usually is known to work at the ASF. And Thrift is still
> in incubation. That means community should be priority number one. If
> were giving a rat's ass about this I wouldn't be on the list but
rather
> just had made the changes myself without this thread.
> 
> cheers
> --
> Torsten

RE: why...

Posted by Mark Slee <ms...@facebook.com>.

I just wanted to try to bring some closure to this thread since it seems
like there were a lot of different ideas in it, with both disagreement
and agreement, and no clear resolution.

Torsten -- we're happy to have you interested in Thrift, and you've
brought up a number of places for improvement. I think some of your
specific questions have touched upon specific values of the Thrift
project that aren't necessarily obvious from the outset -- such as
simplicity and consistency. We're really trying to ensure that Thrift is
a project that does a few clear things and does them very clearly and
very well. This leads to pushback on a lot of niche feature additions
that we don't believe will benefit the project in the long run. Please
don't interpret this pushback as a disinterest in building community,
rather we're all just viewing the project from different angles and with
different communication styles.

Two issues you raised that I think are very well-aligned with the values
and mission of the project were making object composition (or maybe
inheritance if it can be made portable) easier, and more clearly
delineating abstracting data serialization from services/RPC (though
they are currently separate, it's not obvious to new users that Thrift
might be a good choice just for data serialization needs, or where
exactly this boundary lies).

Cheers,
mcslee

-----Original Message-----
From: Torsten Curdt [mailto:tcurdt@apache.org] 
Sent: Tuesday, August 26, 2008 12:29 PM
To: thrift-dev@incubator.apache.org
Subject: Re: why...

Kevin

> If you really feel strongly about it, write it. If it isn't accepted 
> into mainline, that's how it goes, but I'm move convinced by code than

> talk. Show me that the features you propose are useful, and won't 
> cause usability and performance problems, and you'll have my vote.

I fear the usefulness rather depends on the use case and therefor might
not necessarily convince anyone if you don't see the need for it just
because the code is in place.

> In the meantime, I feel like neither side is going to agree with the 
> other outright. It's open source. Scratch your own itch, and maybe 
> others will want the same thing. Until something exists, we're arguing

> about imaginary code, and the implications of such.

Well, usually it's a good idea to communicate and sync up with the
developer community first and not just throw code at them. At least
that's how it usually is known to work at the ASF. And Thrift is still
in incubation. That means community should be priority number one. If
were giving a rat's ass about this I wouldn't be on the list but rather
just had made the changes myself without this thread.

cheers
--
Torsten

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

Kevin

> If you really feel strongly about it, write it. If it isn't accepted
> into mainline, that's how it goes, but I'm move convinced by code than
> talk. Show me that the features you propose are useful, and won't
> cause usability and performance problems, and you'll have my vote.

I fear the usefulness rather depends on the use case and therefor  
might not necessarily convince anyone if you don't see the need for it  
just because the code is in place.

> In the meantime, I feel like neither side is going to agree with the
> other outright. It's open source. Scratch your own itch, and maybe
> others will want the same thing. Until something exists, we're arguing
> about imaginary code, and the implications of such.

Well, usually it's a good idea to communicate and sync up with the  
developer community first and not just throw code at them. At least  
that's how it usually is known to work at the ASF. And Thrift is still  
in incubation. That means community should be priority number one. If  
were giving a rat's ass about this I wouldn't be on the list but  
rather just had made the changes myself without this thread.

cheers
--
Torsten

Re: why...

Posted by Chad Walters <ch...@powerset.com>.

I'm more or less in agreement with Kevin here.

As I said at the outset of this conversation, I think it would be cool to make it possible to have code generators in multiple languages.

The trick is doing it in a way that is maintainable, given all the languages that we support and the fact there is likely to be continued evolution of the IDL and language bindings. If someone brings forward an elegant implementation, I would certainly support its inclusion. In a few months time, I might even have the time and ability to contribute to such an effort if someone else started it and had some solid design ideas.

Chad


On 8/26/08 11:19 AM, "Kevin Clark" <ke...@gmail.com> wrote:

Torsten,
If you really feel strongly about it, write it. If it isn't accepted
into mainline, that's how it goes, but I'm move convinced by code than
talk. Show me that the features you propose are useful, and won't
cause usability and performance problems, and you'll have my vote.

In the meantime, I feel like neither side is going to agree with the
other outright. It's open source. Scratch your own itch, and maybe
others will want the same thing. Until something exists, we're arguing
about imaginary code, and the implications of such.

--
Kevin Clark
http://glu.ttono.us

Re: why...

Posted by Kevin Clark <ke...@gmail.com>.

Torsten,
If you really feel strongly about it, write it. If it isn't accepted
into mainline, that's how it goes, but I'm move convinced by code than
talk. Show me that the features you propose are useful, and won't
cause usability and performance problems, and you'll have my vote.

In the meantime, I feel like neither side is going to agree with the
other outright. It's open source. Scratch your own itch, and maybe
others will want the same thing. Until something exists, we're arguing
about imaginary code, and the implications of such.

-- 
Kevin Clark
http://glu.ttono.us

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

On Aug 26, 2008, at 14:38, Johan Stuyts wrote:

>> But it put the burden of managing sequence ids on the user.  Which  
>> also make the user require to know how Thrift works :)
>
> What burden do you mean? I just have to make sure that each of my  
> sequence IDs is one higher than the previous one. If a remove a  
> field I simply comment it out to indicate that the ID has been used  
> but can no longer be used.

This is (sort of) OK as long you don't think about an include mechanism.

> What if Thrift changed to use the hash method? Thrift lists the  
> ability to rename one of the fields without breaking existing  
> clients as a feature. A user might think that that's useful, so he  
> renames a field, doesn't know how IDs are generated, so does not add  
> the previous hash as the ID, redeploys one side of the system and is  
> disappointed because it does not work. Wouldn't you rather have that  
> he had learned about the IDs from the beginning so he understands  
> why renaming is possible?

??? Of course you can't change the hashing afterwards if you want to  
stay compatible.

If you stick with that renaming works the same as before.

>> Come on! That's a tiny change in the grammar :)
>
> IMHO it's not about your specific change, but all the changes that  
> have been proposed. Every change can have ripple effects for code  
> generation, transports and protocols for many languages. If these  
> changes are not kept in sync, it will result in lots of  
> incompatibility between languages/library versions. Although I have  
> no experience with Corba, I hate to see Thrift move in that  
> direction: a great idea which didn't catch on because of  
> incompatibilities.

Do I hear a "My stuff is working. Let's not change it"?

> Also keep in mind that people may not like the current compiler  
> (including you because you would like a Java version)

Not because it's not java but rather because changing the code  
generation is rather painful.

> and/or the generated code (including me) and build their own  
> compiler in their preferred language. Changes to the IDL will then  
> also require these people to update their compiler.

Who has their own compilers? Care to share?

>> snip: example for evolution
>
> Where will this end? You propose a simple transformation of two  
> fields. Should Thrift also be able to transform (intentionally  
> exaggerated):
> - coordinates between Cartesian and Euclidean spaces
> - angles between degrees and radians
> - amounts between currencies
>
> Yes, evolution of a service is hard. You should think about it and  
> handle it. A solution has already been proposed. Make 'version' and  
> 'os' optional, add the optional 'versionandos', add code to handle  
> both situations to the server, and redeploy the server. Existing  
> older clients will continue to work without a problem, and once all  
> clients have switched over you can drop 'version' and 'os'.
>
> I do not see how this is a limitation of Thrift. In my opinion this  
> is one of its greatest strengths.

Please re-read what I wrote. I was not suggesting to have this  
included into Thrift. But this still is a common problem. Question is  
whether there is a way to do version migrations like this somewhere  
central instead of having that in every client.

cheers
--
Torsten

Re: why...

Posted by Johan Stuyts <j....@zybber.nl>.

> But it put the burden of managing sequence ids on the user.  Which also  
> make the user require to know how Thrift works :)

What burden do you mean? I just have to make sure that each of my sequence  
IDs is one higher than the previous one. If a remove a field I simply  
comment it out to indicate that the ID has been used but can no longer be  
used.

What if Thrift changed to use the hash method? Thrift lists the ability to  
rename one of the fields without breaking existing clients as a feature. A  
user might think that that's useful, so he renames a field, doesn't know  
how IDs are generated, so does not add the previous hash as the ID,  
redeploys one side of the system and is disappointed because it does not  
work. Wouldn't you rather have that he had learned about the IDs from the  
beginning so he understands why renaming is possible?

> Come on! That's a tiny change in the grammar :)

IMHO it's not about your specific change, but all the changes that have  
been proposed. Every change can have ripple effects for code generation,  
transports and protocols for many languages. If these changes are not kept  
in sync, it will result in lots of incompatibility between  
languages/library versions. Although I have no experience with Corba, I  
hate to see Thrift move in that direction: a great idea which didn't catch  
on because of incompatibilities.

Also keep in mind that people may not like the current compiler (including  
you because you would like a Java version) and/or the generated code  
(including me) and build their own compiler in their preferred language.  
Changes to the IDL will then also require these people to update their  
compiler.

> As the model would be flattened we could also just call it an include  
> mechanism. It's not really inheritance in the OOP way.

My compiler detects a specific field in a struct as a declaration for  
inheritance: must have '1' as the ID and 'extendedObject' as the name. The  
Java code it generates is an inheritance hierarchy, but the serialization  
to and from protocols will be compatible with the declarations as found in  
the IDL:
struct PingParameters
{
   1: nl.zybber.lib.stdlib.procedures.common.ProcedureBaseParameters  
extendedObject
   ...
}

Generates this class:
class PingParameters extends ProcedureBaseParameters
{
   ...
}

This class will be serialized as struct 'PingParameters' with field 1 of  
type struct 'ProcedureBaseParameters'. Anybody who takes my IDL and  
generates code from it will be able to communicate with my services.

> snip: example for evolution

Where will this end? You propose a simple transformation of two fields.  
Should Thrift also be able to transform (intentionally exaggerated):
- coordinates between Cartesian and Euclidean spaces
- angles between degrees and radians
- amounts between currencies

Yes, evolution of a service is hard. You should think about it and handle  
it. A solution has already been proposed. Make 'version' and 'os'  
optional, add the optional 'versionandos', add code to handle both  
situations to the server, and redeploy the server. Existing older clients  
will continue to work without a problem, and once all clients have  
switched over you can drop 'version' and 'os'.

I do not see how this is a limitation of Thrift. In my opinion this is one  
of its greatest strengths.

The only thing that might be useful to do automatically is automatic  
transformation of a smaller to a bigger datatype. For example: the old  
version used bytes, but the new version accepts 16-bit integers.

-- 
Kind regards,

Johan Stuyts

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

On Aug 26, 2008, at 12:15, David Reiss wrote:

>> Ahem ...no offense - but sounds like you are a C guy that just hates
>> java :)
> I could say just the opposite of you. :)

Totally not :)

While I did quite a bit java the past years my heritage clearly is C.
It's just that I rather only use it when required these days :)

>>> Regarding hash codes, I think they would complicated the data model
>>
>>> and make it more difficult to visually parse binary dumps of
>>> structures.
>>
>> Why? The data model should be the exact same thing. Just that the
>> sequence id would be 343422352 instead of 2 for example.
> By "data model", I guess I mean everything that is described in the  
> IDL.
> Making the field identifiers implicit would mean one extra thing
> that users would have to learn and remember when dealing with Thrift.
> Making them explicit makes it easier to understand how Thrift works
> and takes approximately two seconds for every new field.

But it put the burden of managing sequence ids on the user.  Which  
also make the user require to know how Thrift works :)

In fact if you go with the hash codes you only have to tell the user/ 
developer about the inner workings of Thrift when there is a  
clash ...and that will probably affect way less users.

>>> Also, check out the output of thrift_dump in contrib/.  That would
>>> also be less readable.
>>
>> Again - how is that supposed to be different? 343422352 vs 2.
> 2 is much easier to read.

Again ...pass in the mapping and you are much better off. Then the  
thrift_dump will do the mapping for you, it's even more readable and  
the 343422352 vs 2 argument does not matter anymore :)

>> Indeed variable length encoding might not work as good. You would
>> probably have to store the full integer most of the time. On the
>> other hand it could left as a choice to the user:
>>
>>  string something [#1] // sequence number 1 (for those who wants to
>> maintain it)
>>
>>  string something // sequence number "something".hashCode (for those
>> who don't)
>>  string something [somename] // sequence number "somename".hashCode
> Yeah, this would work.  It just increases the complexity of the IDL.

Come on! That's a tiny change in the grammar :)

>>> I think inheritance complicates the data model without adding much
>>> value.
>>
>> Why? What is the problem. For me this would give tremendous value.
>> If you blend in the the attributes it's merrily (/no) more than an
>> include.
> I guess that would be fine.  It's just been my experience that
> composition makes more sends for an IDL.  I've only found inheritance
> useful with virtual functions.

I am totally with you in OOP ...can't say I agree in this case though.

As the model would be flattened we could also just call it an include  
mechanism. It's not really inheritance in the OOP way.

>> I am talking about a problem here. And I wanted to discuss how to
>> solve this best.
>> While support for renaming fields is great, renaming really is only
>> part of the problem.
> In that case, I'm still not sure that I fully understand the problem.

OK ...lets assume you have this:

struct Message {
   required string version;
   required string os;
   ...
}

Now we decide that we don't both but want to include the OS version  
into version string.

struct Message {
   required string versionandos;
   ...
}

Now essentially versionandos = version + os But this bascially  
requires to have a different sequence id. So in order for any client  
to read an older version it needs to know and execute that rule. The  
same is true for the other way around. You could split a versionandos  
to get to the value and os. All deterministic rules. The problem I see  
- every client needs to be aware of that rule. And your backwards  
compatibility goes down the drain ....unless you keep sending both.

Now of course this is more in the semantical area of the message  
attributes. Still this is a versioning issue. And since can't be  
handle directly through Thrift the question is - how would you handle  
it? Would it be useful to have some sort of layer on top?

See what I mean now?

>>> If you don't declare any services, no stubs will be generated.
>>
>> That was not my point. This is about the code base and the focus of
>> the project.
> The focus of the project is a remote procedure call library.  If
> someone only wants to use one component of it, they are more than
> welcome.  The library code is quite modular.

Hm ...I see two different things in Thrift. One being the RPC library  
and the other being the actual serialization and versioning. Not sure  
how easy it would be to split though.

>> David, with all due respect. If you guys want grow a community around
>> this project
>> you might want to consider becoming a little more open to  
>> discussions.
> I am completely open to discussion.  In fact, we are having one  
> right now. :)

Glad we are :)

> Just because I have a different view from you doesn't mean that I am  
> not
> willing to discuss it.

Sorry, I read the mail a little different. Good then :)

>> I am not criticizing your baby here. I am trying to understand where
>> it came from and try
>> to make suggestions that might make it work better for others (like  
>> me).
> I fully understand.  However, please understand my point of view.   
> Many
> people who are new to this project (myself included!) have tried to  
> add
> features to it to suit their immediate needs without having a good  
> large-
> scale view of the project.  This has resulted in some fairly limited
> features that have increased the complexity of Thrift without  
> providing
> useful improvements, and also features that *have* been useful but  
> have
> been implemented in individual pieces that could have been combined.

Totally agree. You don't want feature bloat. But I guess a fresh view  
on things might also help sometimes :)

I have also some experience in this area. Unfortunately not so good  
one which is why I am here now ;)
Hoping that Thrift will help to solve them :)

>  Now
> that Thrift is getting a much bigger audience, I think it is  
> important for
> someone who has been around for a while to help make sure that all of
> the newer users don't make the same mistake that I (and others) did.

Totally! And appreciated.

cheers
--
Torsten

Re: why...

Posted by David Reiss <dr...@facebook.com>.

> Ahem ...no offense - but sounds like you are a C guy that just hates 
> java :)
I could say just the opposite of you. :)

>> Regarding hash codes, I think they would complicated the data model
> 
>> and make it more difficult to visually parse binary dumps of 
>> structures.
> 
> Why? The data model should be the exact same thing. Just that the
> sequence id would be 343422352 instead of 2 for example.
By "data model", I guess I mean everything that is described in the IDL.
Making the field identifiers implicit would mean one extra thing
that users would have to learn and remember when dealing with Thrift.
Making them explicit makes it easier to understand how Thrift works
and takes approximately two seconds for every new field.

>> Also, check out the output of thrift_dump in contrib/.  That would
>> also be less readable.
> 
> Again - how is that supposed to be different? 343422352 vs 2.
2 is much easier to read.

> Indeed variable length encoding might not work as good. You would
> probably have to store the full integer most of the time. On the
> other hand it could left as a choice to the user:
> 
>   string something [#1] // sequence number 1 (for those who wants to 
> maintain it)
> 
>   string something // sequence number "something".hashCode (for those 
> who don't)
>   string something [somename] // sequence number "somename".hashCode
Yeah, this would work.  It just increases the complexity of the IDL.

>> I think inheritance complicates the data model without adding much 
>> value.
> 
> Why? What is the problem. For me this would give tremendous value.
> If you blend in the the attributes it's merrily (/no) more than an 
> include.
I guess that would be fine.  It's just been my experience that
composition makes more sends for an IDL.  I've only found inheritance
useful with virtual functions.

> I am talking about a problem here. And I wanted to discuss how to 
> solve this best.
> While support for renaming fields is great, renaming really is only 
> part of the problem.
In that case, I'm still not sure that I fully understand the problem.

>> If you don't declare any services, no stubs will be generated.
> 
> That was not my point. This is about the code base and the focus of 
> the project.
The focus of the project is a remote procedure call library.  If
someone only wants to use one component of it, they are more than
welcome.  The library code is quite modular.

> David, with all due respect. If you guys want grow a community around 
> this project
> you might want to consider becoming a little more open to discussions.
I am completely open to discussion.  In fact, we are having one right now. :)
Just because I have a different view from you doesn't mean that I am not
willing to discuss it.

> I am not criticizing your baby here. I am trying to understand where 
> it came from and try
> to make suggestions that might make it work better for others (like me).
I fully understand.  However, please understand my point of view.  Many
people who are new to this project (myself included!) have tried to add
features to it to suit their immediate needs without having a good large-
scale view of the project.  This has resulted in some fairly limited
features that have increased the complexity of Thrift without providing
useful improvements, and also features that *have* been useful but have
been implemented in individual pieces that could have been combined.  Now
that Thrift is getting a much bigger audience, I think it is important for
someone who has been around for a while to help make sure that all of
the newer users don't make the same mistake that I (and others) did.

--David

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

On Aug 26, 2008, at 10:33, David Reiss wrote:

>> At least if you expect java to be installed on the system it can't be
>> much easier than that ;)
> It's been my experience that it is easier to get a stable lexx/yacc/g 
> ++
> working on a Linux system than Java.

Ahem ...no offense - but sounds like you are a C guy that just hates  
java :)

<snip/>

> Regarding hash codes, I think they would complicated the data model

> and make it more difficult to visually parse binary dumps of  
> structures.

Why? The data model should be the exact same thing. Just that the
sequence id would be 343422352 instead of 2 for example.

> Also, check out the output of thrift_dump in contrib/.  That would
> also be less readable.

Again - how is that supposed to be different? 343422352 vs 2.

What you really want is to pass in the mapping somehow and get back
the attribute name for the sequence id.

>  Finally, the codes would be larger and would
> not lend themselves as well to the variable-length encoding used in
> TDenseProtocol.

Indeed variable length encoding might not work as good. You would
probably have to store the full integer most of the time. On the
other hand it could left as a choice to the user:

  string something [#1] // sequence number 1 (for those who wants to  
maintain it)

  string something // sequence number "something".hashCode (for those  
who don't)
  string something [somename] // sequence number "somename".hashCode

> I think inheritance complicates the data model without adding much  
> value.

Why? What is the problem. For me this would give tremendous value.
If you blend in the the attributes it's merrily (/no) more than an  
include.

> If we were going to do it, I'd want to do it like Protocol Buffers do,
> which is basically a #include.  (We'd have to have better checking for
> things like duplicate field ids.)  This could be done by the parser  
> and
> instantly work in all languages.

Not sure that really makes a big difference in terms of checking.
But either way - something like that would be nice.

>> Now you focused more on the optional/required and so on. Indeed all
>> correct. But my focus was more on the fact that ab can be derived  
>> from
>> a and b. That means that even old struct implicitly have ab. So when
>> you make the switch you either have support this logic in (every)  
>> client
>> or you switch over to only rely on ab and can no longer read older
>> structs.
>>
>> See what I mean now?
> My understanding is that you want Thrift to generate code for  
> combining
> values?  This would basically make it a programming language, and we  
> are
> not going to be doing that.

I am talking about a problem here. And I wanted to discuss how to  
solve this best.
While support for renaming fields is great, renaming really is only  
part of the problem.

>> Well, if you only use Thrift for serialization and versioning you
>> might not
>> always have a need for the service stub generation. While this isn't
>> really a big problem I am wondering if these aren't two separate  
>> things.
> If you don't declare any services, no stubs will be generated.

That was not my point. This is about the code base and the focus of  
the project.

David, with all due respect. If you guys want grow a community around  
this project
you might want to consider becoming a little more open to discussions.

I am not criticizing your baby here. I am trying to understand where  
it came from and try
to make suggestions that might make it work better for others (like me).

cheers
--
Torsten

Re: why...

Posted by David Reiss <dr...@facebook.com>.

> At least if you expect java to be installed on the system it can't be 
> much easier than that ;)
It's been my experience that it is easier to get a stable lexx/yacc/g++
working on a Linux system than Java.  And for distributions that use
binary packages, the runtime requirements for the Thrift compiler are
miniscule.

> On the 
> other hand I would imagine
> that implementing new languages would be much easier with a clean 
> templating approach.
I'm not sure that I agree with this.  If the Thrift data model were
simpler (which I usually wish it were when working on the compiler),
then I think this would work well, but my experience is that the
amount of special-case code required to generate stubs as versatile
as Thrift's is quite large and could not be succinctly expressed by
templates.  I would love to be proven wrong on this point, though.

Regarding hash codes, I think they would complicated the data model
and make it more difficult to visually parse binary dumps of structures.
Also, check out the output of thrift_dump in contrib/.  That would
also be less readable.  Finally, the codes would be larger and would
not lend themselves as well to the variable-length encoding used in
TDenseProtocol.

I think inheritance complicates the data model without adding much value.
If we were going to do it, I'd want to do it like Protocol Buffers do,
which is basically a #include.  (We'd have to have better checking for
things like duplicate field ids.)  This could be done by the parser and
instantly work in all languages.

> Now you focused more on the optional/required and so on. Indeed all
> correct. But my focus was more on the fact that ab can be derived from
> a and b. That means that even old struct implicitly have ab. So when
> you make the switch you either have support this logic in (every) client
> or you switch over to only rely on ab and can no longer read older 
> structs.
> 
> See what I mean now?
My understanding is that you want Thrift to generate code for combining
values?  This would basically make it a programming language, and we are
not going to be doing that.

> Well, if you only use Thrift for serialization and versioning you 
> might not
> always have a need for the service stub generation. While this isn't
> really a big problem I am wondering if these aren't two separate things.
If you don't declare any services, no stubs will be generated.

--David

Re: why...

Posted by Eric Anderson <an...@hpl.hp.com>.

Torsten Curdt writes:
 > Well, if you only use Thrift for serialization and versioning you  
 > might not
 > always have a need for the service stub generation. While this isn't
 > really a big problem I am wondering if these aren't two separate things.

Just using thrift for serialization and versioning works fine.  This
is how the project I'm working on is using it.  We had to extend
thrift to let us specify a base class for structures so that we could
pass around arbitrary "messages" and serialize/deserialize them, but
it wasn't a big change.
	-Eric

Re: why...

Posted by Torsten Curdt <tc...@apache.org>.

On Aug 26, 2008, at 02:08, Mark Slee wrote:

> 1. This decision was made because lexx/yacc, despite being C, is still
> one of the most common lexing/parsing toolkits around. It's also the
> easiest to install on a *nix system (almost every linux distro has all
> the libs installed off the shelf). The standard Java release doesn't
> have lexing/parsing tools, so already Java would require 3rd party  
> libs
> which is enough to turn some people off.

It's more Chad's pointer to be historical than I can understand very  
much :)
As for standard and 3rd part and being installed: antlr then bundle  
everything in one jar.
At least if you expect java to be installed on the system it can't be  
much easier than that ;)

> 2. Yes, some of the generator stuff has gotten a bit unwieldy over  
> time
> as we've added more features. I wouldn't mind a templating system
> either, but this is relatively low-leverage work given that the end
> result is the same, just with cleaner code.

"just" ;)

> Generally, I give much
> higher priority to the quality of the language runtime libraries  
> than to
> the quality of the code generator internals. Most Thrift users should
> never need to touch the code generator. We should work on improving it
> to the extent that it'll help us continue to develop faster in the
> future.

Well, true. Hopefully the code generation is minimal and the most  
crucial stuff will
be handled by the runtime. So agree with that perspective. On the  
other hand I would imagine
that implementing new languages would be much easier with a clean  
templating approach.

> 3. What if you decide you gave a variable a stupid name and want to
> change it, but you've already deployed production code? Separating  
> names
> in code from transport makes this painless, and saves a lot of
> frustration/confusing legacy naming issues.

Indeed - but using the hash does not make it any different. Thinking  
about this:

   required string somename -> sequence id =  "somename".hashCode()

   required string somenamenew [somename] -> sequence id =  
"somename".hashCode()

You can easily change the name. All that really matters is it's hash  
code.
For the API users it would make no difference - except that maintaing  
the
sequence id would become a magnitude less of a hassle.

> Also, to make the hash
> system provably correct we'd have to have a conflict resolution  
> system,
> which would be quite complex.

Indeed. You would have to check the hash codes for unique-ness.
But being pragmatic here you could

1. Just check and fail if there are the same hash codes used
2. Ask the user to rename one of the clashing fields or
3. Offer a generated unique id that could be use instead in such  
(probably rare) cases

> 4. Inheritance can be problematic due to the use of unique field
> identifiers. If developer A owns struct A and developer B subclasses  
> it
> with struct B, problems ensue. If B chooses to use field identifiers
> that A later adds to struct A, downstream breakage happens.

Of course you have to check for uniqueness. If you flatten the structs  
as a last step
that should be fairly straight forward to do. The pain comes only from  
the way how
currently the sequence id are maintained :) ..which is why I was  
thinking about the
hash based approach.

> 5. This is optional, for readable protocols which would like to  
> include
> them. The TProtocol abstraction supports sending names, but if you  
> look
> at the TBinaryProtocol implementation it actually doesn't send the  
> names
> over the wire. You're correct, the sequence ids are good enough.

Ah ...OK. Sorry - missed that :) Thanks for clarification.

> 6. I would do this in 2 steps. First, move from "required a,  
> required b"
> to "optional a, optional b, optional ab." These changes can be rolled
> out without any breakage. Then, you can switch your client side to
> "required ab" and finally switch the server side to "required ab,"
> dropping the individual fields. I cannot think of any way to do this  
> in
> only one switch without breakage.

Now you focused more on the optional/required and so on. Indeed all
correct. But my focus was more on the fact that ab can be derived from
a and b. That means that even old struct implicitly have ab. So when
you make the switch you either have support this logic in (every) client
or you switch over to only rely on ab and can no longer read older  
structs.

See what I mean now?

> 7. I'm not sure exactly what you mean here. Which parts do you feel  
> are
> not separated? The versioning and encoding is all isolated into the
> TProtocol abstraction, transfer lives exclusively in TTransport, and  
> the
> generated TProcessors deal only with actual message dispatching.

Well, if you only use Thrift for serialization and versioning you  
might not
always have a need for the service stub generation. While this isn't
really a big problem I am wondering if these aren't two separate things.

cheers
--
Torsten

RE: why...

Posted by Mark Slee <ms...@facebook.com>.

1. This decision was made because lexx/yacc, despite being C, is still
one of the most common lexing/parsing toolkits around. It's also the
easiest to install on a *nix system (almost every linux distro has all
the libs installed off the shelf). The standard Java release doesn't
have lexing/parsing tools, so already Java would require 3rd party libs
which is enough to turn some people off.

2. Yes, some of the generator stuff has gotten a bit unwieldy over time
as we've added more features. I wouldn't mind a templating system
either, but this is relatively low-leverage work given that the end
result is the same, just with cleaner code. Generally, I give much
higher priority to the quality of the language runtime libraries than to
the quality of the code generator internals. Most Thrift users should
never need to touch the code generator. We should work on improving it
to the extent that it'll help us continue to develop faster in the
future.

3. What if you decide you gave a variable a stupid name and want to
change it, but you've already deployed production code? Separating names
in code from transport makes this painless, and saves a lot of
frustration/confusing legacy naming issues. Also, to make the hash
system provably correct we'd have to have a conflict resolution system,
which would be quite complex.

4. Inheritance can be problematic due to the use of unique field
identifiers. If developer A owns struct A and developer B subclasses it
with struct B, problems ensue. If B chooses to use field identifiers
that A later adds to struct A, downstream breakage happens. Composition
is free from these problems, albeit less convenient in some instances.
I'd definitely endorse development work on tools to make composition
easier.

5. This is optional, for readable protocols which would like to include
them. The TProtocol abstraction supports sending names, but if you look
at the TBinaryProtocol implementation it actually doesn't send the names
over the wire. You're correct, the sequence ids are good enough.

6. I would do this in 2 steps. First, move from "required a, required b"
to "optional a, optional b, optional ab." These changes can be rolled
out without any breakage. Then, you can switch your client side to
"required ab" and finally switch the server side to "required ab,"
dropping the individual fields. I cannot think of any way to do this in
only one switch without breakage.

7. I'm not sure exactly what you mean here. Which parts do you feel are
not separated? The versioning and encoding is all isolated into the
TProtocol abstraction, transfer lives exclusively in TTransport, and the
generated TProcessors deal only with actual message dispatching.

Cheers,
Mark

-----Original Message-----
From: Torsten Curdt [mailto:tcurdt@apache.org] 
Sent: Monday, August 25, 2008 4:36 PM
To: thrift-dev@incubator.apache.org
Subject: why...

Hey guys,

I've looked into Thrift recently and a few questions came up:

1. Why a native compiler? Would it me a little bit simpler to have the
compiler/code generator written in java? No language debate - just a
curious question for the reason :)

2. Wouldn't it make sense to have a bit of better separation than having
all code mixed up in the t_*_generator.cc files? Maybe more a template
approach so adjusting the code that gets generated becomes a little bit
easier?

3. Why not use the hash code of the attribute names as the sequence id?

4. Why only composition? Even a flattening model of multiple inheritance
should be quite easy to implement (if overloading is forbidden). While
in OOP I am big fan of composition over inheritance it makes the
generated API kind of ugly. Maybe a include mechanism would be another
way of simplifying composed structures. (Although I do realize that with
the current model of sequence ids that might be a PITA to maintain)

5. If I noticed correctly the names of the attributes are included when
serialized. Why is that? Shouldn't knowing the sequence id be good
enough?

6. How do you guys suggest to deal with deterministic semantical
changes. Let's say you have

struct test {
   required string a;
   required string b;
}

and then you want to combine those values into one attribute

struct test {
   required string ab; // = a + b
}

There are a couple of problems I see here. For one ab will have to have
a different sequence id. And I guess then the 'required' will become a
problem for sequence of a and b(?). And finally the conversion of ab =
a+b needs to be handle on the application level while rule is very
straight forward and deterministic and *could* be expressed in more
generic manner.

7. Wouldn't it make sense to separate out the service and exception
stuff from the actual message versioning/serialization code?

cheers
--
Torsten