You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Jacob Barrett <jb...@pivotal.io> on 2017/12/01 21:13:38 UTC

DISCUSS: Deprecating and replacing current serializable string encodings

The current data serializer support 4 different encodings (in addition to
the 3 or 4 other encodings I have found elsewhere in Geode). It supports
encoding ASCII prefixed with an uint16 length (64K), ASCII prefixed with an
uint32 length, Java Modified UTF-8 with unit16 length (64k) and finally
UTF-16 with uint32 length (count of UTF-16 code units, not bytes). When
serializing a string it is first scanned for any non-ascii characters and
the number of bytes a modified UTF-8 encoding would require is calculated.
If the encoded length is equal to the original length then it is ASCII and
if the length is less than 2^16 then it uses the 16-bit length version
otherwise the 32-bit length version. If the encoded length is greater than
original but less than 2^16 then modified UTF-8 is used with 16-bit length
otherwise UTF16 and 32-bit length is used.

When working with non-Java clients dealing with modified UTF-8 requires
conversion between it and standard UTF-8 as modified UTF-8 encodes the NULL
character in two bytes. Only Java's DataInput/Output and JNI use this
encoding and ti was intended for internal and Java serialization only. The
StreamReader/Writer use standard UTF-8. Since our serialization expects
modified UTF-8 strings to be prefixed with at 16-bit length care has to be
taken to calculate the modified length up front (or seek the buffer and
rewrite the length if nulls are encountered). Since the modified length may
vary from the standard length care must then be take to make sure the if
the string must be truncated to fit in the 16bit limit that character is
not truncated in a multibyte sequence.

Encoding in UTF-16 isn't all that bad except that it is mostly wasted space
when strings are ASCII. There are no real encoding issues between languages
since most support it or use it their internal representation, like Java
does. But we are talking about serialization here and typically space is
the constraint. Most latin characters are low enough in the basic plain to
be encoded in 2 bytes of UTF-8 and take up no more space than the UTF-16
encoded version. Other characters will take up more space.

Since we took the care to optimize ASCII one can assume we figure out that
ASCII was our most common character sets. Regardless of the correctness of
this assertion it makes no sense to treat ASCII and UTF-8 streams
differently as ASCII can be fully encoded byte for byte in UTF-8 without
any overhead.

So what I would like to propose is that we deprecate all these methods and
replace them with standard UTF-8 prefixed with uint64 length. It is
preferable that the length be run length encoded to reduce the overhead of
encoding small strings. Why such a large length, well consider that
different languages have different limits as well as Java stores strings
internally as UTF-16.

A java UTF-16 string has max length of 2^31-1, encoded in UTF-8 it would
have a maximum, though highly improbably, length of 2^33-1. Serializing as
UTF-8 with a uint32 length limits the max Java string length to 2^29−1
or 536870911 UTF-16 code points. This is probably a reasonable limitation
but we have the technology to do better. ;) Since the server is Java it is
reasonable to limit the max string length we serialize consistent therefore
we need at least 33 bits of length.

For reference a C++11 std::basic_string has the max capacity that is
platform dependent but on 64bit linux and GCC it is 2^63-1. The
basic_string can be UTF-8, UTF-16 or UTF-32

The important part of this proposal is to convert everything to using
standard UTF-8 and deprecate all the other methods. I would ask that we
drop the other methods completely at the next major release. Not having to
implement 4 encodings in each of our clients will help development of new
clients. Not having to translate between standards and non standards string
types will help performance and reduce coding errors. All the other string
encodings I have found should be handled in the new protocol we are working
on, which is now using standard UTF-8, and are therefore outside the scope
of this proposal and discussion.

-Jake

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Jacob Barrett <jb...@pivotal.io>.

On Fri, Dec 1, 2017 at 1:25 PM Bruce Schuchardt <bs...@pivotal.io>
wrote:

> I'm wondering how we would maintain backward compatibility with this
> change?  Geode accepts serialized data from a client and keeps it in
> serialized form and might transmit this serialized data to an older
> version peer or client.  An older peer or client wouldn't be able to
> handle a new string encoding if it tried to access and deserialize the
> data.
>

While not efficient for long term it would be sufficient for rolling
upgrades to have the client/server layer transcode the serialized form for
older clients.

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Michael Stolz <ms...@pivotal.io>.

It could always WRITE UTF-16 strings, but it will still need to be able to
read the others.

--
Mike Stolz
Principal Engineer, GemFire Product Lead
Mobile: +1-631-835-4771

On Fri, Dec 1, 2017 at 7:58 PM, Dan Smith <ds...@pivotal.io> wrote:

> I think I'm kinda with Mike on this one. The existing string format does
> seem pretty gnarly. But the complexity of implementing and testing all of
> the backwards compatibility transcoding that would be required in order to
> move to the new proposed format seems to be way more work with much more
> possibility for errors. Do we really expect people to be writing new
> clients that use DataSerializable? It hasn't happened yet, and we're
> working on a new protocol that uses protobuf right now.
>
> If the issue is really the complexity of serialization from the C++ client,
> maybe the C++ client could always write UTF-16 strings?
>
> -Dan
>
> On Fri, Dec 1, 2017 at 4:17 PM, Michael Stolz <ms...@pivotal.io> wrote:
>
> > My opinion is that risk/reward on this on is not worth it
> >
> > --
> > Mike Stolz
> > Principal Engineer - Gemfire Product Manager
> > Mobile: 631-835-4771
> >
> > On Dec 1, 2017 5:19 PM, "Jacob Barrett" <jb...@pivotal.io> wrote:
> >
> > > On Fri, Dec 1, 2017 at 1:48 PM Michael Stolz <ms...@pivotal.io>
> wrote:
> > >
> > > > There also would likely be Disk Stores that would need to be
> converted.
> > > > That would be real ugly too.
> > >
> > >
> > > Disk store could transcode each entry on demand or all at once on first
> > > load. Not saying it will be easy but progress rarely is.
> > >
> > > -Jake
> > >
> >
>

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Michael William Dodge <md...@pivotal.io>.

I think there is value in having a single string encoding.

Sarge

> On 1 Dec, 2017, at 17:35, Jacob Barrett <jb...@pivotal.io> wrote:
> 
> On Fri, Dec 1, 2017 at 4:59 PM Dan Smith <ds...@pivotal.io> wrote:
> 
>> I think I'm kinda with Mike on this one. The existing string format does
>> seem pretty gnarly. But the complexity of implementing and testing all of
>> the backwards compatibility transcoding that would be required in order to
>> move to the new proposed format seems to be way more work with much more
>> possibility for errors. Do we really expect people to be writing new
>> clients that use DataSerializable? It hasn't happened yet, and we're
>> working on a new protocol that uses protobuf right now.
>> 
> 
> Consider that any new clients written would have to implement all these
> encodings. This is going to make writing new clients using the upcoming new
> protocol laborious. The new protocol does not define object encoding, it
> strictly defines message encoding. Objects sent over the protocol will have
> to be serialized in some format, like PDX or data serializer. We could
> alway develop a better serialization format than what we have now. If we
> don't develop something new then we have to use the old. Wouldn't it be
> nice if the new clients didn't have to deal with legacy encodings?
> 
> If the issue is really the complexity of serialization from the C++ client,
>> maybe the C++ client could always write UTF-16 strings?
>> 
> 
> You can't assume that a client in one language will only be serializing
> strings for it's own consumption. We have many people using strings in PDX
> to transform between C++, .NET and Java.
> 
> The risk is high not to remove this debt. If I am developing a new Ruby
> client I am forced to deal with all 4 of these encodings. Am I really going
> to want to build a Ruby client for Geode, am I going to get these encodings
> correct? I can tell you that getting them correct may be a challenge if the
> current C++ client is any indication, it has a few incorrect assumptions in
> its encoding of ASCII and modified UTF-8.
> 
> I am fine with a compromise that deprecates but doesn't remove the old
> encodings for a few releases. This would give time for users to update. New
> clients written would not be be able to read this old data but could read
> and write new data.
> 
> 
> 
> -Jake

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Jacob Barrett <jb...@pivotal.io>.

On Mon, Dec 4, 2017 at 1:52 PM Dan Smith <ds...@pivotal.io> wrote:

> The new protocol is currently translating from PDX->JSON before sending
> results to the clients so the client doesn't have to understand PDX or
> DataSerializable.
>

For now it does, but we all know that isn't a longer term viable solution.
We will either have to support PDX, at which time all clients will have to
implement nightmare that is PDX, or replace it with something better.

> There is a lot more to DataSerializable than just how a String is
> serialized. And it's not even documented that I am aware of. Just tweaking
> the string format is not going to make that much better. Your hypothetical
> ruby developer is in trouble with or without this proposed change.
>

Very true but just because you can't fix everything doesn't mean you
shouldn't try to fix something.

> Breaking compatibility is a huge PITA for our users. We should do that when
> we are actually giving them real benefits. In this case if we were
> switching to some newer PDX format that was actually easy to implement
> deserialization logic I could see the argument for breaking compatibility.
> Just changing the string format without fixing the rest of the issues
> around DataSerializable isn't providing real benefits.
>

Great, let's prioritize some new PDX format that doesn't have 4 different
string encodings, 2 of which are incorrect and one that isn't a standard.

> You can't assume that a client in one language will only be serializing
> > strings for it's own consumption.
> >
>
> I wasn't making that assumption. The suggestion is that the C++ client
> would have to deserialize all 4 valid formats, but it could just always
> serialize data using the valid UTF-16 format. All other clients should be
> able to deserialize that.
>

Unfortunately its the Java Modified UTF-8 to anything that is the PITA
since it doesn't comply with any standard it must be implemented by hand.
Reading it and not writing it only makes it 1/2 a PITA.

-Jake

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Jacob Barrett <jb...@pivotal.io>.

On Mon, Dec 4, 2017 at 8:45 PM Michael Stolz <ms...@pivotal.io> wrote:

> Anything that breaks data on disk is also a big PITA. This change would
> break data on disk.

Changing the on wire serialization format doesn't necessitate changing the
on disk format. While it would be easier or more performant to have them be
the same it isn't necessary. It also isn't necessary that all entries be
stored in the same format.

-Jake

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Michael Stolz <ms...@pivotal.io>.

Anything that breaks data on disk is also a big PITA. This change would
break data on disk.

--
Mike Stolz
Principal Engineer, GemFire Product Lead
Mobile: +1-631-835-4771

On Mon, Dec 4, 2017 at 1:52 PM, Dan Smith <ds...@pivotal.io> wrote:

> The new protocol is currently translating from PDX->JSON before sending
> results to the clients so the client doesn't have to understand PDX or
> DataSerializable.
>
> There is a lot more to DataSerializable than just how a String is
> serialized. And it's not even documented that I am aware of. Just tweaking
> the string format is not going to make that much better. Your hypothetical
> ruby developer is in trouble with or without this proposed change.
>
> Breaking compatibility is a huge PITA for our users. We should do that when
> we are actually giving them real benefits. In this case if we were
> switching to some newer PDX format that was actually easy to implement
> deserialization logic I could see the argument for breaking compatibility.
> Just changing the string format without fixing the rest of the issues
> around DataSerializable isn't providing real benefits.
>
> You can't assume that a client in one language will only be serializing
> > strings for it's own consumption.
> >
>
> I wasn't making that assumption. The suggestion is that the C++ client
> would have to deserialize all 4 valid formats, but it could just always
> serialize data using the valid UTF-16 format. All other clients should be
> able to deserialize that.
>
> -Dan
>
> On Fri, Dec 1, 2017 at 5:35 PM, Jacob Barrett <jb...@pivotal.io> wrote:
>
> > On Fri, Dec 1, 2017 at 4:59 PM Dan Smith <ds...@pivotal.io> wrote:
> >
> > > I think I'm kinda with Mike on this one. The existing string format
> does
> > > seem pretty gnarly. But the complexity of implementing and testing all
> of
> > > the backwards compatibility transcoding that would be required in order
> > to
> > > move to the new proposed format seems to be way more work with much
> more
> > > possibility for errors. Do we really expect people to be writing new
> > > clients that use DataSerializable? It hasn't happened yet, and we're
> > > working on a new protocol that uses protobuf right now.
> > >
> >
> > Consider that any new clients written would have to implement all these
> > encodings. This is going to make writing new clients using the upcoming
> new
> > protocol laborious. The new protocol does not define object encoding, it
> > strictly defines message encoding. Objects sent over the protocol will
> have
> > to be serialized in some format, like PDX or data serializer. We could
> > alway develop a better serialization format than what we have now. If we
> > don't develop something new then we have to use the old. Wouldn't it be
> > nice if the new clients didn't have to deal with legacy encodings?
> >
> > If the issue is really the complexity of serialization from the C++
> client,
> > > maybe the C++ client could always write UTF-16 strings?
> > >
> >
> > You can't assume that a client in one language will only be serializing
> > strings for it's own consumption. We have many people using strings in
> PDX
> > to transform between C++, .NET and Java.
> >
> > The risk is high not to remove this debt. If I am developing a new Ruby
> > client I am forced to deal with all 4 of these encodings. Am I really
> going
> > to want to build a Ruby client for Geode, am I going to get these
> encodings
> > correct? I can tell you that getting them correct may be a challenge if
> the
> > current C++ client is any indication, it has a few incorrect assumptions
> in
> > its encoding of ASCII and modified UTF-8.
> >
> > I am fine with a compromise that deprecates but doesn't remove the old
> > encodings for a few releases. This would give time for users to update.
> New
> > clients written would not be be able to read this old data but could read
> > and write new data.
> >
> >
> >
> > -Jake
> >
>

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Dan Smith <ds...@pivotal.io>.

The new protocol is currently translating from PDX->JSON before sending
results to the clients so the client doesn't have to understand PDX or
DataSerializable.

There is a lot more to DataSerializable than just how a String is
serialized. And it's not even documented that I am aware of. Just tweaking
the string format is not going to make that much better. Your hypothetical
ruby developer is in trouble with or without this proposed change.

Breaking compatibility is a huge PITA for our users. We should do that when
we are actually giving them real benefits. In this case if we were
switching to some newer PDX format that was actually easy to implement
deserialization logic I could see the argument for breaking compatibility.
Just changing the string format without fixing the rest of the issues
around DataSerializable isn't providing real benefits.

You can't assume that a client in one language will only be serializing
> strings for it's own consumption.
>

I wasn't making that assumption. The suggestion is that the C++ client
would have to deserialize all 4 valid formats, but it could just always
serialize data using the valid UTF-16 format. All other clients should be
able to deserialize that.

-Dan

On Fri, Dec 1, 2017 at 5:35 PM, Jacob Barrett <jb...@pivotal.io> wrote:

> On Fri, Dec 1, 2017 at 4:59 PM Dan Smith <ds...@pivotal.io> wrote:
>
> > I think I'm kinda with Mike on this one. The existing string format does
> > seem pretty gnarly. But the complexity of implementing and testing all of
> > the backwards compatibility transcoding that would be required in order
> to
> > move to the new proposed format seems to be way more work with much more
> > possibility for errors. Do we really expect people to be writing new
> > clients that use DataSerializable? It hasn't happened yet, and we're
> > working on a new protocol that uses protobuf right now.
> >
>
> Consider that any new clients written would have to implement all these
> encodings. This is going to make writing new clients using the upcoming new
> protocol laborious. The new protocol does not define object encoding, it
> strictly defines message encoding. Objects sent over the protocol will have
> to be serialized in some format, like PDX or data serializer. We could
> alway develop a better serialization format than what we have now. If we
> don't develop something new then we have to use the old. Wouldn't it be
> nice if the new clients didn't have to deal with legacy encodings?
>
> If the issue is really the complexity of serialization from the C++ client,
> > maybe the C++ client could always write UTF-16 strings?
> >
>
> You can't assume that a client in one language will only be serializing
> strings for it's own consumption. We have many people using strings in PDX
> to transform between C++, .NET and Java.
>
> The risk is high not to remove this debt. If I am developing a new Ruby
> client I am forced to deal with all 4 of these encodings. Am I really going
> to want to build a Ruby client for Geode, am I going to get these encodings
> correct? I can tell you that getting them correct may be a challenge if the
> current C++ client is any indication, it has a few incorrect assumptions in
> its encoding of ASCII and modified UTF-8.
>
> I am fine with a compromise that deprecates but doesn't remove the old
> encodings for a few releases. This would give time for users to update. New
> clients written would not be be able to read this old data but could read
> and write new data.
>
>
>
> -Jake
>

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Jacob Barrett <jb...@pivotal.io>.

On Fri, Dec 1, 2017 at 4:59 PM Dan Smith <ds...@pivotal.io> wrote:

> I think I'm kinda with Mike on this one. The existing string format does
> seem pretty gnarly. But the complexity of implementing and testing all of
> the backwards compatibility transcoding that would be required in order to
> move to the new proposed format seems to be way more work with much more
> possibility for errors. Do we really expect people to be writing new
> clients that use DataSerializable? It hasn't happened yet, and we're
> working on a new protocol that uses protobuf right now.
>

Consider that any new clients written would have to implement all these
encodings. This is going to make writing new clients using the upcoming new
protocol laborious. The new protocol does not define object encoding, it
strictly defines message encoding. Objects sent over the protocol will have
to be serialized in some format, like PDX or data serializer. We could
alway develop a better serialization format than what we have now. If we
don't develop something new then we have to use the old. Wouldn't it be
nice if the new clients didn't have to deal with legacy encodings?

If the issue is really the complexity of serialization from the C++ client,
> maybe the C++ client could always write UTF-16 strings?
>

You can't assume that a client in one language will only be serializing
strings for it's own consumption. We have many people using strings in PDX
to transform between C++, .NET and Java.

The risk is high not to remove this debt. If I am developing a new Ruby
client I am forced to deal with all 4 of these encodings. Am I really going
to want to build a Ruby client for Geode, am I going to get these encodings
correct? I can tell you that getting them correct may be a challenge if the
current C++ client is any indication, it has a few incorrect assumptions in
its encoding of ASCII and modified UTF-8.

I am fine with a compromise that deprecates but doesn't remove the old
encodings for a few releases. This would give time for users to update. New
clients written would not be be able to read this old data but could read
and write new data.

-Jake

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Dan Smith <ds...@pivotal.io>.

I think I'm kinda with Mike on this one. The existing string format does
seem pretty gnarly. But the complexity of implementing and testing all of
the backwards compatibility transcoding that would be required in order to
move to the new proposed format seems to be way more work with much more
possibility for errors. Do we really expect people to be writing new
clients that use DataSerializable? It hasn't happened yet, and we're
working on a new protocol that uses protobuf right now.

If the issue is really the complexity of serialization from the C++ client,
maybe the C++ client could always write UTF-16 strings?

-Dan

On Fri, Dec 1, 2017 at 4:17 PM, Michael Stolz <ms...@pivotal.io> wrote:

> My opinion is that risk/reward on this on is not worth it
>
> --
> Mike Stolz
> Principal Engineer - Gemfire Product Manager
> Mobile: 631-835-4771
>
> On Dec 1, 2017 5:19 PM, "Jacob Barrett" <jb...@pivotal.io> wrote:
>
> > On Fri, Dec 1, 2017 at 1:48 PM Michael Stolz <ms...@pivotal.io> wrote:
> >
> > > There also would likely be Disk Stores that would need to be converted.
> > > That would be real ugly too.
> >
> >
> > Disk store could transcode each entry on demand or all at once on first
> > load. Not saying it will be easy but progress rarely is.
> >
> > -Jake
> >
>

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Michael Stolz <ms...@pivotal.io>.

My opinion is that risk/reward on this on is not worth it

--
Mike Stolz
Principal Engineer - Gemfire Product Manager
Mobile: 631-835-4771

On Dec 1, 2017 5:19 PM, "Jacob Barrett" <jb...@pivotal.io> wrote:

> On Fri, Dec 1, 2017 at 1:48 PM Michael Stolz <ms...@pivotal.io> wrote:
>
> > There also would likely be Disk Stores that would need to be converted.
> > That would be real ugly too.
>
>
> Disk store could transcode each entry on demand or all at once on first
> load. Not saying it will be easy but progress rarely is.
>
> -Jake
>

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Jacob Barrett <jb...@pivotal.io>.

On Fri, Dec 1, 2017 at 1:48 PM Michael Stolz <ms...@pivotal.io> wrote:

> There also would likely be Disk Stores that would need to be converted.
> That would be real ugly too.

Disk store could transcode each entry on demand or all at once on first
load. Not saying it will be easy but progress rarely is.

-Jake

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Michael Stolz <ms...@pivotal.io>.

There also would likely be Disk Stores that would need to be converted.
That would be real ugly too.

--
Mike Stolz
Principal Engineer - Gemfire Product Manager
Mobile: 631-835-4771

On Dec 1, 2017 4:25 PM, "Bruce Schuchardt" <bs...@pivotal.io> wrote:

> I'm wondering how we would maintain backward compatibility with this
> change?  Geode accepts serialized data from a client and keeps it in
> serialized form and might transmit this serialized data to an older version
> peer or client.  An older peer or client wouldn't be able to handle a new
> string encoding if it tried to access and deserialize the data.
>
> On 12/1/17 1:13 PM, Jacob Barrett wrote:
>
>> So what I would like to propose is that we deprecate all these methods and
>> replace them with standard UTF-8 prefixed with uint64 length. It is
>> preferable that the length be run length encoded to reduce the overhead of
>> encoding small strings. Why such a large length, well consider that
>> different languages have different limits as well as Java stores strings
>> internally as UTF-16.
>>
>
>

Re: DISCUSS: Deprecating and replacing current serializable string encodings

Posted by Bruce Schuchardt <bs...@pivotal.io>.

I'm wondering how we would maintain backward compatibility with this 
change?  Geode accepts serialized data from a client and keeps it in 
serialized form and might transmit this serialized data to an older 
version peer or client.  An older peer or client wouldn't be able to 
handle a new string encoding if it tried to access and deserialize the data.

On 12/1/17 1:13 PM, Jacob Barrett wrote:
> So what I would like to propose is that we deprecate all these methods and
> replace them with standard UTF-8 prefixed with uint64 length. It is
> preferable that the length be run length encoded to reduce the overhead of
> encoding small strings. Why such a large length, well consider that
> different languages have different limits as well as Java stores strings
> internally as UTF-16.