You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@thrift.apache.org by "John R. Frank" <jr...@mit.edu> on 2013/03/14 11:46:52 UTC

updating to longer ints

Thrift Experts,

Is there any way to make a thrift message with a longer integer read an 
older message definition with a smaller integer?

struct Thing_original {
   1:  i16 idx,
}

struct Thing_newer {
   1:  i32 idx,
}


I'd like to be able to have Thing_newer read *both* kinds of idx.  Simple 
tests show that this does not work.

Is there a way to tell Thing_newer to fallback to looking for (idx, i16) 
if it cannot find (idx, i32)?


JOhn

Re: updating to longer ints

Posted by "John R. Frank" <jr...@mit.edu>.

> The Thrift "way" would be to create a new field for the i32 field, say 
> 101.
>
> In other words, the logic that you propose, would have to be handled at 
> the application layer.

To enable that application layer, there would have to be two differently 
names properties on Thing, like idx and idx32

This raises a bigger question:  versioning.

The real form of our Thing is this public interface, which has a 
"Versions" struct that is used today just for debugging.  I wish I could 
use it for more.

https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus.thrift

We use these thrift interfaces to store this large public data set of 
text:

         http://s3.amazonaws.com/aws-publicdatasets/trec/index.html

Right now, any program that uses the streamcorpus.thrift interfaces must 
be told by the user whether it is processing v0_1_0 or v0_2_0.  Soon there 
will be v0_3_0 to cope with our picking i16 when we needed i32 for 
MentionID.  We do re-spin the entire corpus occassionally, but we prefer 
to save up changes and spend a CPU-decade on several changes at once.

It would be really helpful/powerful for thrift to handle such versioning 
for us.  Most simply, there could be a thrift primitive at the level of a 
whole message that says "I am version X", where the engineer gets to 
specify X in the .thrift file.

At a more granular level, versioning could be used to enable falling back 
to different meanings for properties with the same name -- probably with 
different slot numbers and types.

Does anything like this exist yet?

Would the Thrift community be open to considering such a feature?  Maybe 
this has already been discussed?

Thanks for listening,
jrf

Re: updating to longer ints

Posted by George Chung <ge...@glympse.com>.

I'm pretty sure the answer would be "no".

The Thrift "way" would be to create a new field for the i32 field, say 101.

The backwards compatible service would be able to read from field 1 or
field 101. Newer services would only produce an integer value at field 101.
Older services would provide the integer value at field 1.

In other words, the logic that you propose, would have to be handled at the
application layer.

On Thu, Mar 14, 2013 at 3:46 AM, John R. Frank <jr...@mit.edu> wrote:

> Thrift Experts,
>
> Is there any way to make a thrift message with a longer integer read an
> older message definition with a smaller integer?
>
> struct Thing_original {
>   1:  i16 idx,
> }
>
> struct Thing_newer {
>   1:  i32 idx,
> }
>
>
> I'd like to be able to have Thing_newer read *both* kinds of idx.  Simple
> tests show that this does not work.
>
> Is there a way to tell Thing_newer to fallback to looking for (idx, i16)
> if it cannot find (idx, i32)?
>
>
> JOhn
>

Re: updating to longer ints

Posted by Jens Geyer <je...@hotmail.com>.

Hi Henrique,

> Since the i16 side wouldn't be able to read the i32's
> that would came either as an input or output of your
> function. Therefore, to make it work we would have
> to be able to read larger numbers too, and cast them
> down.

Agree. The size of the internal data field is the limit that counts, as the 
data have to be stored somewhere. So the last one would become then 
something like:

        case ::apache::thrift::protocol::T_I64:
            int64_t  tmp4712;
            xfer += iprot->readI16(tmp4712);
            if ( (tmp4712 < std::numeric_limits<int32_t>::min()) ||
                  (tmp4712 > std::numeric_limits<int32_t>::max())) {
              throw TProtocolException( TProtocolException::SIZE_LIMIT);
            }
            this->thing = (int32_t)tmp4712;
            this->__isset.thing = true;
            break;

All of this should be achieveable by means of a specialized protocol 
wrapper, so we don't even have to bloat (and modify) the generated code. 
Looking for a good name .... do you have one?

Jens


-----Ursprüngliche Nachricht----- 
From: Henrique Mendonça
Sent: Tuesday, March 26, 2013 11:03 PM
To: user@thrift.apache.org ; dev@thrift.apache.org
Subject: Re: updating to longer ints

Hi Jens,

That's an interesting proposition, sorry for the delayed answer, but if I'm
not wrong with your example it would generate exceptions independently of
the side you update first.

Since the i16 side wouldn't be able to read the i32's that would came
either as an input or output of your function. Therefore, to make it work
we would have to be able to read larger numbers too, and cast them down.
This could be a little tricky for an user to understand the results, and
increase the complexity of the compiler quite a bit.

However, it's theoretically possible and I don't think it would affect the
performance that much.

- Henrique


On 15 March 2013 19:57, Jens Geyer <je...@hotmail.com> wrote:

> I think TCompactProtocol uses var length ints, achieving
>> something similar to what you mentioned.
>>
>
> Yep, but good point anyway.
>
>
>  Not that this would help w/ the original issue.
>>
>
> Of course, because it is only "something similar" and suffers from the
> same limitations. What the OP suggests and/or what I was thinking of is
> something like this (this would be generated C++ code for one field) :
>
>      case 1:
>        switch( ftype) {
>        case ::apache::thrift::protocol::T_BYTE:
>            int8_t   tmp4710;
>            xfer += iprot->readByte(tmp4710);
>            this->thing = tmp4710;
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_I16:
>            int16_t  tmp4711;
>            xfer += iprot->readI16(tmp4711);
>            this->thing = tmp4711;
>            this->__isset.thing = true;
>            break;
>
>        case ::apache::thrift::protocol::T_I32:
>            xfer += iprot->readI32(this->thing);
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_I64:
>            throw TProtocolException( TProtocolException::SIZE_LIMIT);
>        else
>          xfer += iprot->skip(ftype);
>            break;
>        }
>        break;
>
> and similarly with the generated write() code. This way all protocols
> would profit form it (except for those where it makes no difference with
> regard to serialized data, such as JSON).
>
> The two biggest problem that I see are a) it breaks compatibility, so we
> have to increase the VERSION number, and b) it might affect overall
> serialization/deserialization performance.
>
> Jens
>
>
>
>
>
>
>
>
>
>

Re: updating to longer ints

Posted by Jens Geyer <je...@hotmail.com>.

Hi Henrique,

> Since the i16 side wouldn't be able to read the i32's
> that would came either as an input or output of your
> function. Therefore, to make it work we would have
> to be able to read larger numbers too, and cast them
> down.

Agree. The size of the internal data field is the limit that counts, as the 
data have to be stored somewhere. So the last one would become then 
something like:

        case ::apache::thrift::protocol::T_I64:
            int64_t  tmp4712;
            xfer += iprot->readI16(tmp4712);
            if ( (tmp4712 < std::numeric_limits<int32_t>::min()) ||
                  (tmp4712 > std::numeric_limits<int32_t>::max())) {
              throw TProtocolException( TProtocolException::SIZE_LIMIT);
            }
            this->thing = (int32_t)tmp4712;
            this->__isset.thing = true;
            break;

All of this should be achieveable by means of a specialized protocol 
wrapper, so we don't even have to bloat (and modify) the generated code. 
Looking for a good name .... do you have one?

Jens


-----Ursprüngliche Nachricht----- 
From: Henrique Mendonça
Sent: Tuesday, March 26, 2013 11:03 PM
To: user@thrift.apache.org ; dev@thrift.apache.org
Subject: Re: updating to longer ints

Hi Jens,

That's an interesting proposition, sorry for the delayed answer, but if I'm
not wrong with your example it would generate exceptions independently of
the side you update first.

Since the i16 side wouldn't be able to read the i32's that would came
either as an input or output of your function. Therefore, to make it work
we would have to be able to read larger numbers too, and cast them down.
This could be a little tricky for an user to understand the results, and
increase the complexity of the compiler quite a bit.

However, it's theoretically possible and I don't think it would affect the
performance that much.

- Henrique


On 15 March 2013 19:57, Jens Geyer <je...@hotmail.com> wrote:

> I think TCompactProtocol uses var length ints, achieving
>> something similar to what you mentioned.
>>
>
> Yep, but good point anyway.
>
>
>  Not that this would help w/ the original issue.
>>
>
> Of course, because it is only "something similar" and suffers from the
> same limitations. What the OP suggests and/or what I was thinking of is
> something like this (this would be generated C++ code for one field) :
>
>      case 1:
>        switch( ftype) {
>        case ::apache::thrift::protocol::T_BYTE:
>            int8_t   tmp4710;
>            xfer += iprot->readByte(tmp4710);
>            this->thing = tmp4710;
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_I16:
>            int16_t  tmp4711;
>            xfer += iprot->readI16(tmp4711);
>            this->thing = tmp4711;
>            this->__isset.thing = true;
>            break;
>
>        case ::apache::thrift::protocol::T_I32:
>            xfer += iprot->readI32(this->thing);
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_I64:
>            throw TProtocolException( TProtocolException::SIZE_LIMIT);
>        else
>          xfer += iprot->skip(ftype);
>            break;
>        }
>        break;
>
> and similarly with the generated write() code. This way all protocols
> would profit form it (except for those where it makes no difference with
> regard to serialized data, such as JSON).
>
> The two biggest problem that I see are a) it breaks compatibility, so we
> have to increase the VERSION number, and b) it might affect overall
> serialization/deserialization performance.
>
> Jens
>
>
>
>
>
>
>
>
>
>

Re: updating to longer ints

Posted by Henrique Mendonça <he...@apache.org>.

Hi Jens,

That's an interesting proposition, sorry for the delayed answer, but if I'm
not wrong with your example it would generate exceptions independently of
the side you update first.

Since the i16 side wouldn't be able to read the i32's that would came
either as an input or output of your function. Therefore, to make it work
we would have to be able to read larger numbers too, and cast them down.
This could be a little tricky for an user to understand the results, and
increase the complexity of the compiler quite a bit.

However, it's theoretically possible and I don't think it would affect the
performance that much.

- Henrique


On 15 March 2013 19:57, Jens Geyer <je...@hotmail.com> wrote:

> I think TCompactProtocol uses var length ints, achieving
>> something similar to what you mentioned.
>>
>
> Yep, but good point anyway.
>
>
>  Not that this would help w/ the original issue.
>>
>
> Of course, because it is only "something similar" and suffers from the
> same limitations. What the OP suggests and/or what I was thinking of is
> something like this (this would be generated C++ code for one field) :
>
>      case 1:
>        switch( ftype) {
>        case ::apache::thrift::protocol::T_**BYTE:
>            int8_t   tmp4710;
>            xfer += iprot->readByte(tmp4710);
>            this->thing = tmp4710;
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_**I16:
>            int16_t  tmp4711;
>            xfer += iprot->readI16(tmp4711);
>            this->thing = tmp4711;
>            this->__isset.thing = true;
>            break;
>
>        case ::apache::thrift::protocol::T_**I32:
>            xfer += iprot->readI32(this->thing);
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_**I64:
>            throw TProtocolException(**TProtocolException::SIZE_**LIMIT);
>        else
>          xfer += iprot->skip(ftype);
>            break;
>        }
>        break;
>
> and similarly with the generated write() code. This way all protocols
> would profit form it (except for those where it makes no difference with
> regard to serialized data, such as JSON).
>
> The two biggest problem that I see are a) it breaks compatibility, so we
> have to increase the VERSION number, and b) it might affect overall
> serialization/deserialization performance.
>
> Jens
>
>
>
>
>
>
>
>
>
>

Re: updating to longer ints

Posted by Henrique Mendonça <he...@apache.org>.

Hi Jens,

That's an interesting proposition, sorry for the delayed answer, but if I'm
not wrong with your example it would generate exceptions independently of
the side you update first.

Since the i16 side wouldn't be able to read the i32's that would came
either as an input or output of your function. Therefore, to make it work
we would have to be able to read larger numbers too, and cast them down.
This could be a little tricky for an user to understand the results, and
increase the complexity of the compiler quite a bit.

However, it's theoretically possible and I don't think it would affect the
performance that much.

- Henrique


On 15 March 2013 19:57, Jens Geyer <je...@hotmail.com> wrote:

> I think TCompactProtocol uses var length ints, achieving
>> something similar to what you mentioned.
>>
>
> Yep, but good point anyway.
>
>
>  Not that this would help w/ the original issue.
>>
>
> Of course, because it is only "something similar" and suffers from the
> same limitations. What the OP suggests and/or what I was thinking of is
> something like this (this would be generated C++ code for one field) :
>
>      case 1:
>        switch( ftype) {
>        case ::apache::thrift::protocol::T_**BYTE:
>            int8_t   tmp4710;
>            xfer += iprot->readByte(tmp4710);
>            this->thing = tmp4710;
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_**I16:
>            int16_t  tmp4711;
>            xfer += iprot->readI16(tmp4711);
>            this->thing = tmp4711;
>            this->__isset.thing = true;
>            break;
>
>        case ::apache::thrift::protocol::T_**I32:
>            xfer += iprot->readI32(this->thing);
>            this->__isset.thing = true;
>            break;
>        case ::apache::thrift::protocol::T_**I64:
>            throw TProtocolException(**TProtocolException::SIZE_**LIMIT);
>        else
>          xfer += iprot->skip(ftype);
>            break;
>        }
>        break;
>
> and similarly with the generated write() code. This way all protocols
> would profit form it (except for those where it makes no difference with
> regard to serialized data, such as JSON).
>
> The two biggest problem that I see are a) it breaks compatibility, so we
> have to increase the VERSION number, and b) it might affect overall
> serialization/deserialization performance.
>
> Jens
>
>
>
>
>
>
>
>
>
>

Re: updating to longer ints

Posted by Jens Geyer <je...@hotmail.com>.

> I think TCompactProtocol uses var length ints, achieving
> something similar to what you mentioned.

Yep, but good point anyway.

> Not that this would help w/ the original issue.

Of course, because it is only "something similar" and suffers from the same 
limitations. What the OP suggests and/or what I was thinking of is something 
like this (this would be generated C++ code for one field) :

      case 1:
        switch( ftype) {
        case ::apache::thrift::protocol::T_BYTE:
            int8_t   tmp4710;
            xfer += iprot->readByte(tmp4710);
            this->thing = tmp4710;
            this->__isset.thing = true;
            break;
        case ::apache::thrift::protocol::T_I16:
            int16_t  tmp4711;
            xfer += iprot->readI16(tmp4711);
            this->thing = tmp4711;
            this->__isset.thing = true;
            break;

        case ::apache::thrift::protocol::T_I32:
            xfer += iprot->readI32(this->thing);
            this->__isset.thing = true;
            break;
        case ::apache::thrift::protocol::T_I64:
            throw TProtocolException(TProtocolException::SIZE_LIMIT);
        else
          xfer += iprot->skip(ftype);
            break;
        }
        break;

and similarly with the generated write() code. This way all protocols would 
profit form it (except for those where it makes no difference with regard to 
serialized data, such as JSON).

The two biggest problem that I see are a) it breaks compatibility, so we 
have to increase the VERSION number, and b) it might affect overall 
serialization/deserialization performance.

Jens

Re: updating to longer ints

Posted by Keith Turner <ke...@deenlo.com>.

On Fri, Mar 15, 2013 at 4:07 AM, Jens Geyer <je...@hotmail.com> wrote:
> Hi,
>
> I like the idea of the data types getting promoted to bigger ones on read,
> but I'm still afraid it won't work. Lets assume a service like this:
>
>    struct foo {
>      1: i16 bignumber,
>      2: i16 anotherone,
>      3: i16 third
>    }
>
>    service bar {
>      foo Something( 1: foo input)
>    }
>
> Now, after some time, you find out, that i16 is way too small, so you change
> foo{} to hold all i32's.
>
> First, the real issue is, when not all clients and servers are under your
> control and can be updated at once, it will break. The generated code holds
> declarations for all variables, so it can't be promoted on the fly only as
> far as the current data structures allow this.
>
> But OTOH the idea has some good aspects, since it brings in the idea that
> the size of the variable must not necessarily be equal to the amount of the
> bytes on the wire, e.g. a int32 = 0x00001234 could be read/written as i16
> without losing information. The only thing that may be affected here is the
> performance, because the code has to check for appropriate write and read
> sizes.

I think TCompactProtocol uses var length ints, achieving something
similar to what you mentioned.  Not that this would help w/ the
original issue.

>
> Jens
>
>
> -----Ursprüngliche Nachricht----- From: Will Pierce
> Sent: Thursday, March 14, 2013 10:47 PM
> To: user@thrift.apache.org
> Subject: Re: updating to longer ints
>
>
> Definitely not an easy problem.  On the wire these will both encode as the
> same field ID (1) and a type ID of 6 for the i16 or a type ID of 8 for the
> i32.  Most (all?) thrift language bindings check to ensure the type ID for
> a field matches the expected type ID for that particular field, silently
> skipping past any field with mismatched type IDs.
>
> George Chung's suggestion is the safest/cleanest approach.
>
> I had exactly this same problem where I needed to expand from i32 to i64.
> It was a "flag day" cutover for me, since I had control of all the clients
> and servers.  But I can see how you frequently can't upgrade all the client
> code out in the wild.  (Or data files, if these are persisted.)
>
> Rather than try to solve the broader problem of versioning, maybe we should
> consider adding optional behavior to the decoder to permit
> *type-promotion*when the field's on-wire type can be safely decoded
> into the expected type?
>
> Provided that the expected type can faithfully represent all the values of
> the wire type, it seems like a way to follow the "TCP" pattern of being
> permissive in what you receive, but strict in what you send.
>
> In terms of thrift TTypes, we could allow a decoder to promote:
> bool > i08 > i16 > i32 > (double or i64)
>
> This wouldn't solve every problem - like sending a response struct with an
> i32 field to a client than expects i16.  No go there.  This would be only
> useful in cases of having to read client input from a mixed bag of old and
> new clients, where the server response struct doesn't need any wider
> fields.  I'm not sure if that is your use-case...
>
> Those using thrift for serialization to/from disk, I would imagine
> type-promotion could save a lot of effort in re-encoding any old files that
> use narrower numeric types than newer files...
>
> Thoughts?

Re: updating to longer ints

Posted by Jens Geyer <je...@hotmail.com>.

Hi,

I like the idea of the data types getting promoted to bigger ones on read, 
but I'm still afraid it won't work. Lets assume a service like this:

    struct foo {
      1: i16 bignumber,
      2: i16 anotherone,
      3: i16 third
    }

    service bar {
      foo Something( 1: foo input)
    }

Now, after some time, you find out, that i16 is way too small, so you change 
foo{} to hold all i32's.

First, the real issue is, when not all clients and servers are under your 
control and can be updated at once, it will break. The generated code holds 
declarations for all variables, so it can't be promoted on the fly only as 
far as the current data structures allow this.

But OTOH the idea has some good aspects, since it brings in the idea that 
the size of the variable must not necessarily be equal to the amount of the 
bytes on the wire, e.g. a int32 = 0x00001234 could be read/written as i16 
without losing information. The only thing that may be affected here is the 
performance, because the code has to check for appropriate write and read 
sizes.

Jens


-----Ursprüngliche Nachricht----- 
From: Will Pierce
Sent: Thursday, March 14, 2013 10:47 PM
To: user@thrift.apache.org
Subject: Re: updating to longer ints

Definitely not an easy problem.  On the wire these will both encode as the
same field ID (1) and a type ID of 6 for the i16 or a type ID of 8 for the
i32.  Most (all?) thrift language bindings check to ensure the type ID for
a field matches the expected type ID for that particular field, silently
skipping past any field with mismatched type IDs.

George Chung's suggestion is the safest/cleanest approach.

I had exactly this same problem where I needed to expand from i32 to i64.
It was a "flag day" cutover for me, since I had control of all the clients
and servers.  But I can see how you frequently can't upgrade all the client
code out in the wild.  (Or data files, if these are persisted.)

Rather than try to solve the broader problem of versioning, maybe we should
consider adding optional behavior to the decoder to permit
*type-promotion*when the field's on-wire type can be safely decoded
into the expected type?

Provided that the expected type can faithfully represent all the values of
the wire type, it seems like a way to follow the "TCP" pattern of being
permissive in what you receive, but strict in what you send.

In terms of thrift TTypes, we could allow a decoder to promote:
bool > i08 > i16 > i32 > (double or i64)

This wouldn't solve every problem - like sending a response struct with an
i32 field to a client than expects i16.  No go there.  This would be only
useful in cases of having to read client input from a mixed bag of old and
new clients, where the server response struct doesn't need any wider
fields.  I'm not sure if that is your use-case...

Those using thrift for serialization to/from disk, I would imagine
type-promotion could save a lot of effort in re-encoding any old files that
use narrower numeric types than newer files...

Thoughts?

Re: updating to longer ints

Posted by Will Pierce <wi...@nuclei.com>.

Definitely not an easy problem.  On the wire these will both encode as the
same field ID (1) and a type ID of 6 for the i16 or a type ID of 8 for the
i32.  Most (all?) thrift language bindings check to ensure the type ID for
a field matches the expected type ID for that particular field, silently
skipping past any field with mismatched type IDs.

George Chung's suggestion is the safest/cleanest approach.

I had exactly this same problem where I needed to expand from i32 to i64.
 It was a "flag day" cutover for me, since I had control of all the clients
and servers.  But I can see how you frequently can't upgrade all the client
code out in the wild.  (Or data files, if these are persisted.)

Rather than try to solve the broader problem of versioning, maybe we should
consider adding optional behavior to the decoder to permit
*type-promotion*when the field's on-wire type can be safely decoded
into the expected type?

Provided that the expected type can faithfully represent all the values of
the wire type, it seems like a way to follow the "TCP" pattern of being
permissive in what you receive, but strict in what you send.

In terms of thrift TTypes, we could allow a decoder to promote:
 bool > i08 > i16 > i32 > (double or i64)

This wouldn't solve every problem - like sending a response struct with an
i32 field to a client than expects i16.  No go there.  This would be only
useful in cases of having to read client input from a mixed bag of old and
new clients, where the server response struct doesn't need any wider
fields.  I'm not sure if that is your use-case...

Those using thrift for serialization to/from disk, I would imagine
type-promotion could save a lot of effort in re-encoding any old files that
use narrower numeric types than newer files...

Thoughts?