You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Almog Gavra <al...@confluent.io> on 2019/08/06 20:40:31 UTC

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Hello Everyone,

Summarizing an in-person discussion with Randall (this is copied from the
KIP):

The original KIP suggested supporting an additional representation - base10
encoded text (e.g. `{"asText":"10.2345"}`). This causes issues because it
is impossible to disambiguate between TEXT and BINARY without an additional
config - furthermore, this makes the migration from one to the other nearly
impossible because it would require that all consumers stop consuming and
producers stop producing and atomically updating the config on all of them
after deploying the new code, or waiting for the full retention period to
pass - neither option is viable. The suggestion in the KIP is strictly an
improvement over the existing behavior, even if it doesn't support all
combinations.

It seems that since most real-world use cases actually use the numeric
representation (not string) we can consider this an improvement. With the
new suggestion, we don't need a deserialization configuration (only a
serialization option) and the consumers will be able to always
automatically determine the serialization format.

Based on this, I'll be opening up the simplified version of the KIP to a
vote.

Almog

On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io> wrote:

> I'm mostly happy with your current suggestion (two configs, one for
> serialization and one for deserialization) and your implementation
> suggestion. One thing to note:
>
> > We should _always_ be able to deserialize a standard JSON
> > number as a decimal
>
> I was doing some research into decimals and JSON, and I can imagine a
> compelling reason to require string representations to avoid losing
> precision and to be certain that whomever is sending the data isn't losing
> precision (e.g. https://stackoverflow.com/a/38357877/2258040).
>
> I'm okay with always allowing numerics, but thought it's worth raising the
> thought.
>
> On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io> wrote:
>
>> The way I see it, we need to control two seperate things:
>>
>> 1. How do we _deserialize_ a decimal type if we encounter a text node in
>> the JSON?    (We should _always_ be able to deserialize a standard JSON
>> number as a decimal).
>> 2. How do we chose how we want decimals to be _serialized_.
>>
>> This looks to fits well with your second suggestion of slightly different
>> configs names for serialization vs deserialization.
>> a, For deserialization we only care about how to handle text nodes: `
>> deserialization.decimal.*text*.format`, which should only have two valid
>> values BINARY | TEXT.
>> b. For serialization we need all three: `serialization.decimal.format`,
>> which should support all three options: BINARY | TEXT | NUMERIC.
>>
>> Implementation wise, I think these should be two separate enums, rather
>> than one shared enum and throwing an error if the deserializer is set to
>> NUMERIC.  Mainly as this means the enums reflect the options available,
>> rather than this being hidden in config checking code.  But that's a minor
>> implementation detail.
>>
>> Personally, I'd be tempted to have the BINARY value named something like
>> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move away
>> from
>> it.
>>
>> It's a real shame that both of these settings require a default of BINARY
>> for backwards compatibility, but I agree that discussions / plans around
>> switching the defaults should not block this KIP.
>>
>> Andy
>>
>>
>> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io> wrote:
>>
>> > Thanks for the replies Andy and Andrew (2x Andy?)!
>> >
>> > > Is the text decimal a base16 encoded number, or is it base16 encoded
>> > binary
>> > > form of the number?
>> >
>> > The conversion happens as decimal.unscaledValue().toByteArray() and then
>> > the byte array is converted to a hex string, so it's definitely the
>> binary
>> > form of the number converted to base16. Whether or not that's the same
>> as
>> > the base16 encoded number is a good question (toByteArray returns a byte
>> > array containing a signed, big-endian, two's complement representation
>> of
>> > the big integer).
>> >
>> > > One suggestion I have is to change the proposed new config to only
>> affect
>> > > decimals stored as text, i.e. to switch between the current base16 and
>> > the
>> > > more common base10.   Then add another config to the serializer only
>> that
>> > > controls if decimals should be serialized as text or numeric.
>> >
>> > I think we need to be able to handle all mappings from serialization
>> format
>> > to deserialization format (e.g. read in BINARY and output TEXT), which I
>> > think would be impossible with the alternative suggestion. I agree that
>> > automatically deserializing numerics is valuable. I see two other ways
>> to
>> > get this, both keeping the serialization.format config the same:
>> >
>> > - have json.decimal.deserialization.format accept all three formats. if
>> set
>> > to BINARY/TEXT, numerics would be automatically supported. If set to
>> > NUMERIC, then any string coming in would result in deserialization error
>> > (defaults to BINARY for backwards compatibility)
>> > - change json.decimal.deserialization.format to
>> > json.decimal.deserialization.string.format which accepts only
>> BINARY/TEXT
>> > (defaults to BINARY for backwards compatibility)
>> >
>> > > would be a breaking change in that things that previously failed would
>> > > suddenly start deserializing.  This is a price I'm willing to pay.
>> >
>> > I agree. I'm willing to pay this price too.
>> >
>> > > IMHO, we should then plan to switch the default of decimal
>> serialization
>> > to
>> > > numeric, and text serialization to base 10 in the next major release.
>> >
>> > I think that can be a separate discussion, I don't want to block this
>> KIP
>> > on it.
>> >
>> > Thoughts?
>> >
>> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <ot...@wikimedia.org> wrote:
>> >
>> > > This is a bit orthogonal, but in JsonSchemaConverter I use
>> JSONSchemas to
>> > > indicate whether a JSON number should be deserialized as an integer
>> or a
>> > > decimal
>> > > <
>> > >
>> >
>> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
>> > > >.
>> > > Not everyone is going to have JSONSchemas available when converting,
>> but
>> > if
>> > > you do, it is an easy way to support JSON numbers as decimals.
>> > >
>> > > Carry on! :)
>> > >
>> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <an...@confluent.io>
>> wrote:
>> > >
>> > > > Hi Almog,
>> > > >
>> > > > Like the KIP - I think being able to support decimals in JSON in the
>> > same
>> > > > way most other systems do is a great improvement.
>> > > >
>> > > > It's not 100% clear to me from the KIP what the current format is.
>> Is
>> > > the
>> > > > text decimal a base16 encoded number, or is it base16 encoded binary
>> > form
>> > > > of the number? (I've not tried to get my head around if these two
>> are
>> > > even
>> > > > different!)
>> > > >
>> > > > One suggestion I have is to change the proposed new config to only
>> > affect
>> > > > decimals stored as text, i.e. to switch between the current base16
>> and
>> > > the
>> > > > more common base10.   Then add another config to the serialzier only
>> > that
>> > > > controls if decimals should be serialized as text or numeric.  The
>> > > benefit
>> > > > of this approach is it allows us to enhance the deserializer to
>> > > > automatically handle numeric decimals even without any config
>> having to
>> > > be
>> > > > set, i.e. default config in the deserializer would be able to handle
>> > > > numeric decimals.  Of course, this is a two edged sword: this would
>> > make
>> > > > the deserializer work out of the box with numeric decimals, (yay!),
>> but
>> > > > would be a breaking change in that things that previously failed
>> would
>> > > > suddenly start deserializing.  This is a price I'm willing to pay.
>> > > >
>> > > > IMHO, we should then plan to switch the default of decimal
>> > serialization
>> > > to
>> > > > numeric, and text serialization to base 10 in the next major
>> release.
>> > > > (With upgrade notes to match). Though I know this is more
>> contentious,
>> > I
>> > > > think it moves us forward in a much more standard way that the
>> current
>> > > > encoding of decimals.
>> > > >
>> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <al...@confluent.io>
>> wrote:
>> > > >
>> > > > > Hi Everyone!
>> > > > >
>> > > > > Kicking off discussion for a new KIP:
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
>> > > > >
>> > > > > For those who are interested, I have a prototype implementation
>> that
>> > > > helped
>> > > > > guide my design: https://github.com/agavra/kafka/pull/1
>> > > > >
>> > > > > Cheers,
>> > > > > Almog
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Randall Hauch <rh...@gmail.com>.
Thanks for all the changes, Almog. The current KIP looks good to me.

Randall

On Fri, Aug 30, 2019 at 11:28 AM Almog Gavra <al...@confluent.io> wrote:

> Thanks again Randall! Here are the changes I made:
>
> - Defaults. The KIP had mentioned that the default would be BASE64 in the
> "Public Interfaces" section. I have also added your suggestion in "Proposed
> Changes".
> - I've added a bullet point to change the internal converter to use
> decimal.format=NUMERIC in Proposed Changes. If I see that this is not
> possible or cumbersome during implementation I will amend the KIP to keep
> this change minimally scoped.
> - Adopted your third suggestion
> - Adopted your migration plan proposal
> - Added rejected alternative for creating a new converter
>
> > calling `NumericNode.decimalValue()` will always return a
> java.math.BigDecimal even
> > if the underlying Number was not a BigDecimal
>
> I'm not sure I understand your comment here. NumericNode#decimalValue will
> always return a BigDecimal, regardless of what the underlying JSON data is
> - that hasn't changed with this KIP. The decimalValue() will only be called
> when converting to a DECIMAL logical type.
>
> What has changed, is that the NumericNode will store a BigDecimal instead
> of a double whenever the JSON value is a number with a decimal point in it
> (this should answer your fourth point, about deserializing a BigDecimal for
> all floating points).
>
> > Shouldn't the `convertToConnect(Schema, JsonNode) instead use the type of
> Number value as
> > parsed and returned by the JsonDeserializer to determine the proper
> schema
> > type
>
> There is no way to infer the "proper schema type" in the JsonDeserializer.
> If my JSON value is {"foo": 1.234} I have no idea whether "foo" is a
> decimal or a double - that's the reason we need a configuration value in
> first place. This means that in order to avoid precision loss, we must
> deserialize any floating point number first as a BigDecimal.
>
> > and then get the proper value type using that schema type's converter?
>
> That's exactly the proposal. I think this will be clear in the code.
>
> Almog
>
> On Fri, Aug 30, 2019 at 8:00 AM Randall Hauch <rh...@gmail.com> wrote:
>
> > Thanks for the updates, Almog. This looks really good, but I have a few
> > more comments (most wording, though one potentially thorny issue):
> >
> > First, the KIP should define the default value for the `decimal.format`
> > property. IIUC, it will be BASE64, and to gain the new behavior users
> will
> > have to explicitly set this to NUMERIC. In fact, I'd recommend changing
> > this bullet item from:
> >
> >    - JsonConverter will accept the decimal.format configuration to
> >    determine the serialization format. If the value is BASE64, the
> behavior
> >    remains unchanged (i.e. it serializes it as a base64 text). If the
> > value is
> >    NUMERIC, the JSON node will be a number representing that decimal
> (e.g.
> >    10.2345 instead of "D3J5").
> >
> > to something like:
> >
> >    - Define a new `decimal.format` configuration property on
> JsonConverter
> >    to specify the serialization format for Connect DECIMAL logical type
> >    values, with two allowed literal values for the configuration
> property:
> >    - `BASE64` specifies the existing behavior of serializing DECIMAL
> >       logical types as base64 encoded binary data (e.g., "D3J5" in the
> > example
> >       above), and will be the default; and
> >       - `NUMERIC` will serialize Connect DECIMAL logical type values in
> >       JSON as a number representing that decimal
> >
> > Second, since the default will be the existing BASE64 representation,
> what
> > do you think about changing the JsonConverter instances used by the
> Connect
> > worker for internal topics to enable `decimal.format=NUMERIC`? I don't
> > think we actually use decimals in the internal messages, but if we do at
> > some point then the converters will store them with the improved natural
> > representation.
> >
> > Third, the following bullet item could be more clear:
> >
> >    - JsonConverter will automatically handle deserialization of either
> >    serialization format given a Decimal logical type schema, i.e. it will
> >    accept both a deserialized BinaryNode and NumericNode. If the value
> is a
> >    BinaryNode, it will construct a java BigDecimal from the binaryValue()
> >    (which is a btye[]). If the value is a NumericNode, it will simply
> pass
> >    through the decimalValue() deserialized by the JsonDeserializer.
> >
> > such as maybe:
> >
> >    - The JsonConverter deserialization method currently expects only a
> >    BinaryNode, but will be changed to also handle NumericNode by calling
> >    NumericNode.decimalValue().
> >
> > This brings up an interesting potential issue: if `schemas.enable=false`,
> > then there will be no schema in the record, and calling
> > `NumericNode.decimalValue()` will always return a java.math.BigDecimal
> even
> > if the underlying Number was not a BigDecimal. Shouldn't the
> > `convertToConnect(Schema, JsonNode) instead use the type of Number value
> as
> > parsed and returned by the JsonDeserializer to determine the proper
> schema
> > type, and then get the proper value type using that schema type's
> > converter?
> >
> > Fourth, I'm not sure I understand the following bullet:
> >
> >    - JsonDeserializer will now default floating point deserialization to
> >    BigDecimal to avoid losing precision. This may impact performance when
> >    deserializing doubles - a JMH microbenchmark on my local MBP, this
> >    estimated about 3x degradation for deserializing JSON floating points.
> > If
> >    the connect schema is not the decimal logical type, the JsonConverter
> > will
> >    convert this BigDecimal value into the corresponding floating point
> java
> >    object.
> >
> > Fifth, I think the migration plan is not quite accurate. Step 3 mentions
> > changing the Connect worker config's key and value converters to use the
> > new setting, and a restart (step 4) is necessary after this step. Perhaps
> > step 3 should be "If the Connect worker uses the JsonConverter for the
> key
> > and/or value converters, optionally set the `decimal.format=NUMERIC` for
> > the key and/or value converter and restart the workers." followed by Step
> > 4: "If desired, update any source connector configs that use the
> > JsonConverter for key and/or value converters to use
> > `decimal.format=NUMERIC`."
> >
> > Finally, should we add a short discussion in the Rejected Alternatives
> > about the option of leaving JsonConverter untouched and creating a
> > different converter implementation?
> >
> > Thanks!
> >
> > Randall
> >
> >
> >
> > On Mon, Aug 26, 2019 at 11:51 AM Almog Gavra <al...@confluent.io> wrote:
> >
> > > Thanks for the feedback Randall! I have updated the KIP with the
> > following
> > > edits:
> > >
> > > * Updated the reference from "producer" to "source" (I had missed that
> > > one!)
> > > * Changed the config from "json.decimal.serialization.format" to
> > > "decimal.format"
> > > * Clarified case sensitivity
> > > * Clarified the proposed changes to note that deserialization is not
> > > affected by the config
> > > * Clarified the changes in JsonConverter to handle deserialization (see
> > my
> > > third bullet below)
> > > * Added a clear migration plan and simplified compatibility
> > >
> > > Here are also some clarifications based on your comments.
> > >
> > > * I think "json" has limited value in the configuration name. If put
> in a
> > > top-level worker config, it clarifies that it only affects connectors
> > using
> > > the JsonConverter. I have opted for your suggestion and dropped it.
> > > * I think "serialization" has limited value in the configuration name.
> If
> > > we ever want to introduce "deserialization" configurations, there will
> be
> > > asymmetry in the configuration names. I have opted for your suggestion
> > and
> > > dropped it.
> > > * The JsonConverter will not "always look for numbers". The converter
> > will
> > > receive from the Jackson Object Mapper either a NumericNode containing
> a
> > > big decimal or a BinaryNode containing a btye[]. Based on the type of
> > this
> > > node, it will convert the value to a BigDecimal appropriately (or any
> > other
> > > Connect java type based on the schema).
> > > * "the ... JsonDeserializer are not affected" is not exactly true, but
> > > semantically correct. See the note in the KIP about defaulting floating
> > > points to BigDecimal to avoid precision loss.
> > > * "The resulting application, however, may need to handle a wider range
> > of
> > > numeric values." Unless I misunderstand what you're saying, I don't
> think
> > > this is correct. The resulting application will still receive exactly
> the
> > > same Connect data object from the JsonConverter as it was before - only
> > the
> > > SerDe layer is affected.
> > >
> > > Cheers,
> > > Almog
> > >
> > > On Sun, Aug 25, 2019 at 4:28 PM Randall Hauch <rh...@gmail.com>
> wrote:
> > >
> > > > Thanks for all the work, Almog.
> > > >
> > > > For the most part, I think this KIP will be a great improvement, and
> > IMO
> > > is
> > > > almost ready to go. However, I do have a few suggestions that affect
> > the
> > > > wording more than the intent.
> > > >
> > > > First, the name of the `json.decimal.serialization.format` property
> is
> > > > pretty long, especially when it is prefixed in the Worker config or
> in
> > a
> > > > connector config as `key.converer.json.decimal.serialization.format`
> or
> > > > `value.converter.json.decimal.serialization.format` . Have you
> > > considered a
> > > > shorter config property name, such as maybe `decimal.format`? Is
> there
> > > any
> > > > benefit to include "json" and "serialization" in the property name?
> > Also,
> > > > we should be clear that the value will not be case sensitive (e.g.,
> > > > "numeric" and "NUMERIC" would be equivalent), to keep in alignment
> with
> > > > other enumeration literals in Connect configurations. The goal should
> > be
> > > > simple
> > > >
> > > > Second, the Motivation section, has the following sentence:
> > > >
> > > > "A new configuration for producers json.decimal.serialization.format
> > will
> > > > be introduced to the JsonConverter configuration to help control
> > whether
> > > > source converters will serialize decimals in numeric or binary
> > formats."
> > > >
> > > >
> > > > I agree with an earlier comment from Konstantine that "producers"
> here
> > is
> > > > distracting and does not mirror the normal definition of "producers"
> > > within
> > > > the Kafka context. I suggest rephrasing this to something like
> > > >
> > > > "Introduce to the JsonConverter a new configuration property named
> > > > json.decimal.serialization.format to control whether source
> converters
> > > will
> > > > serialize decimals in numeric or binary formats."
> > > >
> > > >
> > > > Third, the KIP should be more clear about whether the
> > > > `json.decimal.serialization.format` setting does or does not affect
> > > > deserialization? IIUC, the deserialization logic will always look for
> > > JSON
> > > > numbers, and will always use the Schema to define whether it should
> > > convert
> > > > the value to a different number type. Is that a fair statement?
> > > >
> > > > Fourth, the JsonSerializer and JsonDeserializer are not affected, yet
> > are
> > > > still compatible with the old and new behavior. Because the primary
> > > purpose
> > > > of this new setting is to define how Connect DECIMAL logical type
> > values
> > > > are serialized in JSON documents, the JsonDeserializer will still be
> > able
> > > > to deserialize the JSON document correctly. The resulting
> application,
> > > > however, may need to handle a wider range of numeric values.
> > > >
> > > > Fifth, the Compatibility section seems more complicated than perhaps
> it
> > > > needs to be, maybe because it seems to distinguish between upgrading
> > and
> > > > setting the decimal serialization format. Maybe it would be
> sufficient
> > to
> > > > simply emphasize that all of the sink connectors (or consumer
> > > applications)
> > > > using the JsonConverter with
> > > > the `json.decimal.serialization.format=NUMERIC` setting consuming
> > records
> > > > from a set of topics be upgraded and changed *before* any of the
> source
> > > > connectors (or other producer applications) using the JsonConverter
> to
> > > > serialize records are changed to use
> > > > the `json.decimal.serialization.format=NUMERIC` setting? It may also
> > > > warrant giving more concrete advice on upgrade procedures. For
> example,
> > > how
> > > > does a user upgrade a set of Connect workers to use this new
> property?
> > Do
> > > > they upgrade first and restart to ensure everything runs as-is, and
> > then
> > > > upgrade their source connectors to set
> > > > `json.decimal.serialization.format=NUMERIC`via connector
> configurations
> > > or
> > > > worker configs?
> > > >
> > > > Anyway, great job so far!
> > > >
> > > > Best regards,
> > > >
> > > > Randall
> > > >
> > > > On Thu, Aug 15, 2019 at 12:00 PM Konstantine Karantasis <
> > > > konstantine@confluent.io> wrote:
> > > >
> > > > > Thanks! KIP reads even better for me now.
> > > > > Just voted. +1 non-binding
> > > > >
> > > > > Konstantine
> > > > >
> > > > > On Wed, Aug 14, 2019 at 7:00 PM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > > >
> > > > > > Thanks for the review Konstantine!
> > > > > >
> > > > > > I think the terminology suggestion definitely makes things
> clearer
> > -
> > > I
> > > > > will
> > > > > > update the documentation based on your suggestion (e.g.
> > > s/Consumer/Sink
> > > > > > Converter/g and s/Producer/Source Converter/g).
> > > > > >
> > > > > > Cheers,
> > > > > > Almog
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
> > > > > > konstantine@confluent.io> wrote:
> > > > > >
> > > > > > > Thanks Almog for preparing this KIP!
> > > > > > > I think it will improve usability and troubleshooting with JSON
> > > data
> > > > a
> > > > > > lot.
> > > > > > >
> > > > > > > The finalized plan seems quite concrete now. I also liked that
> > some
> > > > > > > implementation specific implications (such as setting the
> > > > ObjectMapper
> > > > > to
> > > > > > > deserialize floating point as BigDecimal) are highlighted in
> the
> > > KIP.
> > > > > > >
> > > > > > > Still, as I was reading the KIP, the main obstacle I
> encountered
> > > was
> > > > > > around
> > > > > > > terminology. I couldn't get used to reading "producer" and
> > > "consumer"
> > > > > and
> > > > > > > not thinking in terms of Kafka producers and consumers - which
> > are
> > > > not
> > > > > > > relevant to what this KIP proposes. Thus, I'd suggest replacing
> > > > > > > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with
> > > "Sink
> > > > > > > Converter(s)" (even if "Converter used by Source Connector" or
> > > > > "Converter
> > > > > > > used by Sink Connector" would be even more accurate - maybe
> this
> > > > could
> > > > > be
> > > > > > > an explanation in a footnote). Terminology around converters
> has
> > > been
> > > > > > > tricky in the past and adding producers/consumers in the mix
> > might
> > > > add
> > > > > to
> > > > > > > the confusion.
> > > > > > >
> > > > > > > Another example where I'd apply this different terminology
> would
> > be
> > > > to
> > > > > a
> > > > > > > phrase such as the following:
> > > > > > > "Because of this, users must take care to first ensure that all
> > > > > consumers
> > > > > > > have upgraded to the new code before upgrading producers to
> make
> > > use
> > > > of
> > > > > > the
> > > > > > > NUMERIC serialization format."
> > > > > > > which I'd write
> > > > > > > "Because of this, users must take care to first ensure that all
> > > sink
> > > > > > > connectors have upgraded to the new converter code before
> > upgrading
> > > > > > source
> > > > > > > connectors to make use of the NUMERIC serialization format in
> > > > > > > JsonConverter."
> > > > > > >
> > > > > > > Let me know if you think this suggestion makes the KIP easier
> to
> > > > > follow.
> > > > > > > Otherwise I think it's a solid proposal.
> > > > > > >
> > > > > > > I'm concluding with a couple of nits:
> > > > > > >
> > > > > > > - "Upgraded Producer with BASE64 serialization, Legacy
> Consumer:
> > > this
> > > > > > > scenario is okay as the upgraded ~producer~ consumer will be
> able
> > > to
> > > > > read
> > > > > > > binary as today" (again according to my suggestion above, it
> > could
> > > be
> > > > > as
> > > > > > > the upgraded source converter ...)
> > > > > > >
> > > > > > > - "consumers cannot consumer NUMERIC data. " -> "consumers
> cannot
> > > > read
> > > > > > > NUMERIC data"
> > > > > > >
> > > > > > > Best,
> > > > > > > Konstantine
> > > > > > >
> > > > > > > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <almog@confluent.io
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Good catches! Fixed :)
> > > > > > > >
> > > > > > > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <
> > > > arjun.satish@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Cool!
> > > > > > > > >
> > > > > > > > > Couple of nits:
> > > > > > > > >
> > > > > > > > > - In public interfaces, typo:
> > > *json.decimal.serialization.fromat*
> > > > > > > > > - In public interfaces, you use the term "HEX" instead of
> > > > "BASE64".
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <
> > almog@confluent.io
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > EDIT: everywhere I've been using "HEX" I meant to be
> using
> > > > > > "BASE64".
> > > > > > > I
> > > > > > > > > will
> > > > > > > > > > update the KIP to reflect this.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <
> > > almog@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback Arjun! I'm happy changing the
> > > default
> > > > > > > config
> > > > > > > > to
> > > > > > > > > > > HEX instead of BINARY, no strong feelings there.
> > > > > > > > > > >
> > > > > > > > > > > I'll also clarify the example in the KIP to be clearer:
> > > > > > > > > > >
> > > > > > > > > > > - serialize the decimal field "foo" with value
> "10.2345"
> > > with
> > > > > the
> > > > > > > HEX
> > > > > > > > > > > setting: {"foo": "D3J5"}
> > > > > > > > > > > - serialize the decimal field "foo" with value
> "10.2345"
> > > with
> > > > > the
> > > > > > > > > NUMERIC
> > > > > > > > > > > setting: {"foo": 10.2345}
> > > > > > > > > > >
> > > > > > > > > > > With regards to the precision issue, that was my
> original
> > > > > concern
> > > > > > > as
> > > > > > > > > well
> > > > > > > > > > > (and why I originally suggested a TEXT format). Many
> JSON
> > > > > > > > deserializers
> > > > > > > > > > > (e.g. Jackson with
> > > > > > > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > > > > > > > however, have the ability to deserialize decimals
> > correctly
> > > > so
> > > > > I
> > > > > > > will
> > > > > > > > > > > configure that as the default for Connect's
> > > JsonDeserializer.
> > > > > > It's
> > > > > > > > > > probably
> > > > > > > > > > > a good idea to call out that using other deserializers
> > must
> > > > be
> > > > > > done
> > > > > > > > > with
> > > > > > > > > > > care - I will add that documentation to the
> serialization
> > > > > config.
> > > > > > > > > > >
> > > > > > > > > > > Note that there would not be an issue on the
> > > _serialization_
> > > > > side
> > > > > > > of
> > > > > > > > > > > things as Jackson respects BigDecimal.
> > > > > > > > > > >
> > > > > > > > > > > Almog
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> > > > > > > arjun.satish@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not
> > > late
> > > > > > since
> > > > > > > > you
> > > > > > > > > > >> started the voting thread already):
> > > > > > > > > > >>
> > > > > > > > > > >> can you please add some examples to show the changes
> > that
> > > > you
> > > > > > are
> > > > > > > > > > >> proposing. makes me think that for a given decimal
> > number,
> > > > we
> > > > > > will
> > > > > > > > > have
> > > > > > > > > > >> two
> > > > > > > > > > >> encodings: “asHex” and “asNumber”.
> > > > > > > > > > >>
> > > > > > > > > > >> should we call the default config value “HEX” instead
> of
> > > > > > “BINARY”?
> > > > > > > > > > >>
> > > > > > > > > > >> Should we call out the fact that JS systems might be
> > > > > susceptible
> > > > > > > to
> > > > > > > > > > double
> > > > > > > > > > >> precision round offs with the new numeric format? here
> > are
> > > > > some
> > > > > > > > people
> > > > > > > > > > >> discussing a similar problem
> > > > > > > > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > > > > > > > >>
> > > > > > > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <
> > > > > almog@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hello Everyone,
> > > > > > > > > > >> >
> > > > > > > > > > >> > Summarizing an in-person discussion with Randall
> (this
> > > is
> > > > > > copied
> > > > > > > > > from
> > > > > > > > > > >> the
> > > > > > > > > > >> > KIP):
> > > > > > > > > > >> >
> > > > > > > > > > >> > The original KIP suggested supporting an additional
> > > > > > > > representation -
> > > > > > > > > > >> base10
> > > > > > > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This
> > causes
> > > > > issues
> > > > > > > > > because
> > > > > > > > > > >> it
> > > > > > > > > > >> > is impossible to disambiguate between TEXT and
> BINARY
> > > > > without
> > > > > > an
> > > > > > > > > > >> additional
> > > > > > > > > > >> > config - furthermore, this makes the migration from
> > one
> > > to
> > > > > the
> > > > > > > > other
> > > > > > > > > > >> nearly
> > > > > > > > > > >> > impossible because it would require that all
> consumers
> > > > stop
> > > > > > > > > consuming
> > > > > > > > > > >> and
> > > > > > > > > > >> > producers stop producing and atomically updating the
> > > > config
> > > > > on
> > > > > > > all
> > > > > > > > > of
> > > > > > > > > > >> them
> > > > > > > > > > >> > after deploying the new code, or waiting for the
> full
> > > > > > retention
> > > > > > > > > period
> > > > > > > > > > >> to
> > > > > > > > > > >> > pass - neither option is viable. The suggestion in
> the
> > > KIP
> > > > > is
> > > > > > > > > strictly
> > > > > > > > > > >> an
> > > > > > > > > > >> > improvement over the existing behavior, even if it
> > > doesn't
> > > > > > > support
> > > > > > > > > all
> > > > > > > > > > >> > combinations.
> > > > > > > > > > >> >
> > > > > > > > > > >> > It seems that since most real-world use cases
> actually
> > > use
> > > > > the
> > > > > > > > > numeric
> > > > > > > > > > >> > representation (not string) we can consider this an
> > > > > > improvement.
> > > > > > > > > With
> > > > > > > > > > >> the
> > > > > > > > > > >> > new suggestion, we don't need a deserialization
> > > > > configuration
> > > > > > > > (only
> > > > > > > > > a
> > > > > > > > > > >> > serialization option) and the consumers will be able
> > to
> > > > > always
> > > > > > > > > > >> > automatically determine the serialization format.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Based on this, I'll be opening up the simplified
> > version
> > > > of
> > > > > > the
> > > > > > > > KIP
> > > > > > > > > > to a
> > > > > > > > > > >> > vote.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Almog
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <
> > > > > > almog@confluent.io
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > I'm mostly happy with your current suggestion (two
> > > > > configs,
> > > > > > > one
> > > > > > > > > for
> > > > > > > > > > >> > > serialization and one for deserialization) and
> your
> > > > > > > > implementation
> > > > > > > > > > >> > > suggestion. One thing to note:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > We should _always_ be able to deserialize a
> > standard
> > > > > JSON
> > > > > > > > > > >> > > > number as a decimal
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > I was doing some research into decimals and JSON,
> > and
> > > I
> > > > > can
> > > > > > > > > imagine
> > > > > > > > > > a
> > > > > > > > > > >> > > compelling reason to require string
> representations
> > to
> > > > > avoid
> > > > > > > > > losing
> > > > > > > > > > >> > > precision and to be certain that whomever is
> sending
> > > the
> > > > > > data
> > > > > > > > > isn't
> > > > > > > > > > >> > losing
> > > > > > > > > > >> > > precision (e.g.
> > > > > > https://stackoverflow.com/a/38357877/2258040
> > > > > > > ).
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > I'm okay with always allowing numerics, but
> thought
> > > it's
> > > > > > worth
> > > > > > > > > > raising
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > thought.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> > > > > > > andy@confluent.io>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >> The way I see it, we need to control two seperate
> > > > things:
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> 1. How do we _deserialize_ a decimal type if we
> > > > > encounter a
> > > > > > > > text
> > > > > > > > > > >> node in
> > > > > > > > > > >> > >> the JSON?    (We should _always_ be able to
> > > > deserialize a
> > > > > > > > > standard
> > > > > > > > > > >> JSON
> > > > > > > > > > >> > >> number as a decimal).
> > > > > > > > > > >> > >> 2. How do we chose how we want decimals to be
> > > > > _serialized_.
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> This looks to fits well with your second
> suggestion
> > > of
> > > > > > > slightly
> > > > > > > > > > >> > different
> > > > > > > > > > >> > >> configs names for serialization vs
> deserialization.
> > > > > > > > > > >> > >> a, For deserialization we only care about how to
> > > handle
> > > > > > text
> > > > > > > > > > nodes: `
> > > > > > > > > > >> > >> deserialization.decimal.*text*.format`, which
> > should
> > > > only
> > > > > > > have
> > > > > > > > > two
> > > > > > > > > > >> valid
> > > > > > > > > > >> > >> values BINARY | TEXT.
> > > > > > > > > > >> > >> b. For serialization we need all three:
> > > > > > > > > > >> `serialization.decimal.format`,
> > > > > > > > > > >> > >> which should support all three options: BINARY |
> > > TEXT |
> > > > > > > > NUMERIC.
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> Implementation wise, I think these should be two
> > > > separate
> > > > > > > > enums,
> > > > > > > > > > >> rather
> > > > > > > > > > >> > >> than one shared enum and throwing an error if the
> > > > > > > deserializer
> > > > > > > > is
> > > > > > > > > > >> set to
> > > > > > > > > > >> > >> NUMERIC.  Mainly as this means the enums reflect
> > the
> > > > > > options
> > > > > > > > > > >> available,
> > > > > > > > > > >> > >> rather than this being hidden in config checking
> > > code.
> > > > > But
> > > > > > > > > that's
> > > > > > > > > > a
> > > > > > > > > > >> > minor
> > > > > > > > > > >> > >> implementation detail.
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> Personally, I'd be tempted to have the BINARY
> value
> > > > named
> > > > > > > > > something
> > > > > > > > > > >> like
> > > > > > > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of
> encouraging
> > > > users
> > > > > > to
> > > > > > > > move
> > > > > > > > > > >> away
> > > > > > > > > > >> > >> from
> > > > > > > > > > >> > >> it.
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> It's a real shame that both of these settings
> > > require a
> > > > > > > default
> > > > > > > > > of
> > > > > > > > > > >> > BINARY
> > > > > > > > > > >> > >> for backwards compatibility, but I agree that
> > > > > discussions /
> > > > > > > > plans
> > > > > > > > > > >> around
> > > > > > > > > > >> > >> switching the defaults should not block this KIP.
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> Andy
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> > > > > > > almog@confluent.io>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >> > Thanks for the replies Andy and Andrew (2x
> > Andy?)!
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > > Is the text decimal a base16 encoded number,
> or
> > > is
> > > > it
> > > > > > > > base16
> > > > > > > > > > >> encoded
> > > > > > > > > > >> > >> > binary
> > > > > > > > > > >> > >> > > form of the number?
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > The conversion happens as
> > > > > > > > decimal.unscaledValue().toByteArray()
> > > > > > > > > > and
> > > > > > > > > > >> > then
> > > > > > > > > > >> > >> > the byte array is converted to a hex string, so
> > > it's
> > > > > > > > definitely
> > > > > > > > > > the
> > > > > > > > > > >> > >> binary
> > > > > > > > > > >> > >> > form of the number converted to base16. Whether
> > or
> > > > not
> > > > > > > that's
> > > > > > > > > the
> > > > > > > > > > >> same
> > > > > > > > > > >> > >> as
> > > > > > > > > > >> > >> > the base16 encoded number is a good question
> > > > > (toByteArray
> > > > > > > > > > returns a
> > > > > > > > > > >> > byte
> > > > > > > > > > >> > >> > array containing a signed, big-endian, two's
> > > > complement
> > > > > > > > > > >> representation
> > > > > > > > > > >> > >> of
> > > > > > > > > > >> > >> > the big integer).
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > > One suggestion I have is to change the
> proposed
> > > new
> > > > > > > config
> > > > > > > > to
> > > > > > > > > > >> only
> > > > > > > > > > >> > >> affect
> > > > > > > > > > >> > >> > > decimals stored as text, i.e. to switch
> between
> > > the
> > > > > > > current
> > > > > > > > > > >> base16
> > > > > > > > > > >> > and
> > > > > > > > > > >> > >> > the
> > > > > > > > > > >> > >> > > more common base10.   Then add another config
> > to
> > > > the
> > > > > > > > > serializer
> > > > > > > > > > >> only
> > > > > > > > > > >> > >> that
> > > > > > > > > > >> > >> > > controls if decimals should be serialized as
> > text
> > > > or
> > > > > > > > numeric.
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > I think we need to be able to handle all
> mappings
> > > > from
> > > > > > > > > > >> serialization
> > > > > > > > > > >> > >> format
> > > > > > > > > > >> > >> > to deserialization format (e.g. read in BINARY
> > and
> > > > > output
> > > > > > > > > TEXT),
> > > > > > > > > > >> > which I
> > > > > > > > > > >> > >> > think would be impossible with the alternative
> > > > > > suggestion.
> > > > > > > I
> > > > > > > > > > agree
> > > > > > > > > > >> > that
> > > > > > > > > > >> > >> > automatically deserializing numerics is
> > valuable. I
> > > > see
> > > > > > two
> > > > > > > > > other
> > > > > > > > > > >> ways
> > > > > > > > > > >> > >> to
> > > > > > > > > > >> > >> > get this, both keeping the serialization.format
> > > > config
> > > > > > the
> > > > > > > > > same:
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > - have json.decimal.deserialization.format
> accept
> > > all
> > > > > > three
> > > > > > > > > > >> formats.
> > > > > > > > > > >> > if
> > > > > > > > > > >> > >> set
> > > > > > > > > > >> > >> > to BINARY/TEXT, numerics would be automatically
> > > > > > supported.
> > > > > > > If
> > > > > > > > > set
> > > > > > > > > > >> to
> > > > > > > > > > >> > >> > NUMERIC, then any string coming in would result
> > in
> > > > > > > > > > deserialization
> > > > > > > > > > >> > error
> > > > > > > > > > >> > >> > (defaults to BINARY for backwards
> compatibility)
> > > > > > > > > > >> > >> > - change json.decimal.deserialization.format to
> > > > > > > > > > >> > >> > json.decimal.deserialization.string.format
> which
> > > > > accepts
> > > > > > > only
> > > > > > > > > > >> > >> BINARY/TEXT
> > > > > > > > > > >> > >> > (defaults to BINARY for backwards
> compatibility)
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > > would be a breaking change in that things
> that
> > > > > > previously
> > > > > > > > > > failed
> > > > > > > > > > >> > would
> > > > > > > > > > >> > >> > > suddenly start deserializing.  This is a
> price
> > > I'm
> > > > > > > willing
> > > > > > > > to
> > > > > > > > > > >> pay.
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > > IMHO, we should then plan to switch the
> default
> > > of
> > > > > > > decimal
> > > > > > > > > > >> > >> serialization
> > > > > > > > > > >> > >> > to
> > > > > > > > > > >> > >> > > numeric, and text serialization to base 10 in
> > the
> > > > > next
> > > > > > > > major
> > > > > > > > > > >> > release.
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > I think that can be a separate discussion, I
> > don't
> > > > want
> > > > > > to
> > > > > > > > > block
> > > > > > > > > > >> this
> > > > > > > > > > >> > >> KIP
> > > > > > > > > > >> > >> > on it.
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > Thoughts?
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > > > > > > > otto@wikimedia.org>
> > > > > > > > > > >> > wrote:
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >> > > This is a bit orthogonal, but in
> > > > JsonSchemaConverter
> > > > > I
> > > > > > > use
> > > > > > > > > > >> > >> JSONSchemas to
> > > > > > > > > > >> > >> > > indicate whether a JSON number should be
> > > > deserialized
> > > > > > as
> > > > > > > an
> > > > > > > > > > >> integer
> > > > > > > > > > >> > >> or a
> > > > > > > > > > >> > >> > > decimal
> > > > > > > > > > >> > >> > > <
> > > > > > > > > > >> > >> > >
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >>
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > > > > > > > >> > >> > > >.
> > > > > > > > > > >> > >> > > Not everyone is going to have JSONSchemas
> > > available
> > > > > > when
> > > > > > > > > > >> converting,
> > > > > > > > > > >> > >> but
> > > > > > > > > > >> > >> > if
> > > > > > > > > > >> > >> > > you do, it is an easy way to support JSON
> > numbers
> > > > as
> > > > > > > > > decimals.
> > > > > > > > > > >> > >> > >
> > > > > > > > > > >> > >> > > Carry on! :)
> > > > > > > > > > >> > >> > >
> > > > > > > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > > > > > > > andy@confluent.io
> > > > > > > > > > >
> > > > > > > > > > >> > >> wrote:
> > > > > > > > > > >> > >> > >
> > > > > > > > > > >> > >> > > > Hi Almog,
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > > > Like the KIP - I think being able to
> support
> > > > > decimals
> > > > > > > in
> > > > > > > > > JSON
> > > > > > > > > > >> in
> > > > > > > > > > >> > the
> > > > > > > > > > >> > >> > same
> > > > > > > > > > >> > >> > > > way most other systems do is a great
> > > improvement.
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > > > It's not 100% clear to me from the KIP what
> > the
> > > > > > current
> > > > > > > > > > format
> > > > > > > > > > >> is.
> > > > > > > > > > >> > >> Is
> > > > > > > > > > >> > >> > > the
> > > > > > > > > > >> > >> > > > text decimal a base16 encoded number, or is
> > it
> > > > > base16
> > > > > > > > > encoded
> > > > > > > > > > >> > binary
> > > > > > > > > > >> > >> > form
> > > > > > > > > > >> > >> > > > of the number? (I've not tried to get my
> head
> > > > > around
> > > > > > if
> > > > > > > > > these
> > > > > > > > > > >> two
> > > > > > > > > > >> > >> are
> > > > > > > > > > >> > >> > > even
> > > > > > > > > > >> > >> > > > different!)
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > > > One suggestion I have is to change the
> > proposed
> > > > new
> > > > > > > > config
> > > > > > > > > to
> > > > > > > > > > >> only
> > > > > > > > > > >> > >> > affect
> > > > > > > > > > >> > >> > > > decimals stored as text, i.e. to switch
> > between
> > > > the
> > > > > > > > current
> > > > > > > > > > >> base16
> > > > > > > > > > >> > >> and
> > > > > > > > > > >> > >> > > the
> > > > > > > > > > >> > >> > > > more common base10.   Then add another
> config
> > > to
> > > > > the
> > > > > > > > > > serialzier
> > > > > > > > > > >> > only
> > > > > > > > > > >> > >> > that
> > > > > > > > > > >> > >> > > > controls if decimals should be serialized
> as
> > > text
> > > > > or
> > > > > > > > > numeric.
> > > > > > > > > > >> The
> > > > > > > > > > >> > >> > > benefit
> > > > > > > > > > >> > >> > > > of this approach is it allows us to enhance
> > the
> > > > > > > > > deserializer
> > > > > > > > > > to
> > > > > > > > > > >> > >> > > > automatically handle numeric decimals even
> > > > without
> > > > > > any
> > > > > > > > > config
> > > > > > > > > > >> > >> having to
> > > > > > > > > > >> > >> > > be
> > > > > > > > > > >> > >> > > > set, i.e. default config in the
> deserializer
> > > > would
> > > > > be
> > > > > > > > able
> > > > > > > > > to
> > > > > > > > > > >> > handle
> > > > > > > > > > >> > >> > > > numeric decimals.  Of course, this is a two
> > > edged
> > > > > > > sword:
> > > > > > > > > this
> > > > > > > > > > >> > would
> > > > > > > > > > >> > >> > make
> > > > > > > > > > >> > >> > > > the deserializer work out of the box with
> > > numeric
> > > > > > > > decimals,
> > > > > > > > > > >> > (yay!),
> > > > > > > > > > >> > >> but
> > > > > > > > > > >> > >> > > > would be a breaking change in that things
> > that
> > > > > > > previously
> > > > > > > > > > >> failed
> > > > > > > > > > >> > >> would
> > > > > > > > > > >> > >> > > > suddenly start deserializing.  This is a
> > price
> > > > I'm
> > > > > > > > willing
> > > > > > > > > to
> > > > > > > > > > >> pay.
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > > > IMHO, we should then plan to switch the
> > default
> > > > of
> > > > > > > > decimal
> > > > > > > > > > >> > >> > serialization
> > > > > > > > > > >> > >> > > to
> > > > > > > > > > >> > >> > > > numeric, and text serialization to base 10
> in
> > > the
> > > > > > next
> > > > > > > > > major
> > > > > > > > > > >> > >> release.
> > > > > > > > > > >> > >> > > > (With upgrade notes to match). Though I
> know
> > > this
> > > > > is
> > > > > > > more
> > > > > > > > > > >> > >> contentious,
> > > > > > > > > > >> > >> > I
> > > > > > > > > > >> > >> > > > think it moves us forward in a much more
> > > standard
> > > > > way
> > > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > >> > >> current
> > > > > > > > > > >> > >> > > > encoding of decimals.
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > > > > > > > almog@confluent.io>
> > > > > > > > > > >> > >> wrote:
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > > > > Hi Everyone!
> > > > > > > > > > >> > >> > > > >
> > > > > > > > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > > > > > > > >> > >> > > > >
> > > > > > > > > > >> > >> > > > >
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > >
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >>
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > > > > > > > >> > >> > > > >
> > > > > > > > > > >> > >> > > > > For those who are interested, I have a
> > > > prototype
> > > > > > > > > > >> implementation
> > > > > > > > > > >> > >> that
> > > > > > > > > > >> > >> > > > helped
> > > > > > > > > > >> > >> > > > > guide my design:
> > > > > > > > https://github.com/agavra/kafka/pull/1
> > > > > > > > > > >> > >> > > > >
> > > > > > > > > > >> > >> > > > > Cheers,
> > > > > > > > > > >> > >> > > > > Almog
> > > > > > > > > > >> > >> > > > >
> > > > > > > > > > >> > >> > > >
> > > > > > > > > > >> > >> > >
> > > > > > > > > > >> > >> >
> > > > > > > > > > >> > >>
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Almog Gavra <al...@confluent.io>.
Thanks again Randall! Here are the changes I made:

- Defaults. The KIP had mentioned that the default would be BASE64 in the
"Public Interfaces" section. I have also added your suggestion in "Proposed
Changes".
- I've added a bullet point to change the internal converter to use
decimal.format=NUMERIC in Proposed Changes. If I see that this is not
possible or cumbersome during implementation I will amend the KIP to keep
this change minimally scoped.
- Adopted your third suggestion
- Adopted your migration plan proposal
- Added rejected alternative for creating a new converter

> calling `NumericNode.decimalValue()` will always return a
java.math.BigDecimal even
> if the underlying Number was not a BigDecimal

I'm not sure I understand your comment here. NumericNode#decimalValue will
always return a BigDecimal, regardless of what the underlying JSON data is
- that hasn't changed with this KIP. The decimalValue() will only be called
when converting to a DECIMAL logical type.

What has changed, is that the NumericNode will store a BigDecimal instead
of a double whenever the JSON value is a number with a decimal point in it
(this should answer your fourth point, about deserializing a BigDecimal for
all floating points).

> Shouldn't the `convertToConnect(Schema, JsonNode) instead use the type of
Number value as
> parsed and returned by the JsonDeserializer to determine the proper schema
> type

There is no way to infer the "proper schema type" in the JsonDeserializer.
If my JSON value is {"foo": 1.234} I have no idea whether "foo" is a
decimal or a double - that's the reason we need a configuration value in
first place. This means that in order to avoid precision loss, we must
deserialize any floating point number first as a BigDecimal.

> and then get the proper value type using that schema type's converter?

That's exactly the proposal. I think this will be clear in the code.

Almog

On Fri, Aug 30, 2019 at 8:00 AM Randall Hauch <rh...@gmail.com> wrote:

> Thanks for the updates, Almog. This looks really good, but I have a few
> more comments (most wording, though one potentially thorny issue):
>
> First, the KIP should define the default value for the `decimal.format`
> property. IIUC, it will be BASE64, and to gain the new behavior users will
> have to explicitly set this to NUMERIC. In fact, I'd recommend changing
> this bullet item from:
>
>    - JsonConverter will accept the decimal.format configuration to
>    determine the serialization format. If the value is BASE64, the behavior
>    remains unchanged (i.e. it serializes it as a base64 text). If the
> value is
>    NUMERIC, the JSON node will be a number representing that decimal (e.g.
>    10.2345 instead of "D3J5").
>
> to something like:
>
>    - Define a new `decimal.format` configuration property on JsonConverter
>    to specify the serialization format for Connect DECIMAL logical type
>    values, with two allowed literal values for the configuration property:
>    - `BASE64` specifies the existing behavior of serializing DECIMAL
>       logical types as base64 encoded binary data (e.g., "D3J5" in the
> example
>       above), and will be the default; and
>       - `NUMERIC` will serialize Connect DECIMAL logical type values in
>       JSON as a number representing that decimal
>
> Second, since the default will be the existing BASE64 representation, what
> do you think about changing the JsonConverter instances used by the Connect
> worker for internal topics to enable `decimal.format=NUMERIC`? I don't
> think we actually use decimals in the internal messages, but if we do at
> some point then the converters will store them with the improved natural
> representation.
>
> Third, the following bullet item could be more clear:
>
>    - JsonConverter will automatically handle deserialization of either
>    serialization format given a Decimal logical type schema, i.e. it will
>    accept both a deserialized BinaryNode and NumericNode. If the value is a
>    BinaryNode, it will construct a java BigDecimal from the binaryValue()
>    (which is a btye[]). If the value is a NumericNode, it will simply pass
>    through the decimalValue() deserialized by the JsonDeserializer.
>
> such as maybe:
>
>    - The JsonConverter deserialization method currently expects only a
>    BinaryNode, but will be changed to also handle NumericNode by calling
>    NumericNode.decimalValue().
>
> This brings up an interesting potential issue: if `schemas.enable=false`,
> then there will be no schema in the record, and calling
> `NumericNode.decimalValue()` will always return a java.math.BigDecimal even
> if the underlying Number was not a BigDecimal. Shouldn't the
> `convertToConnect(Schema, JsonNode) instead use the type of Number value as
> parsed and returned by the JsonDeserializer to determine the proper schema
> type, and then get the proper value type using that schema type's
> converter?
>
> Fourth, I'm not sure I understand the following bullet:
>
>    - JsonDeserializer will now default floating point deserialization to
>    BigDecimal to avoid losing precision. This may impact performance when
>    deserializing doubles - a JMH microbenchmark on my local MBP, this
>    estimated about 3x degradation for deserializing JSON floating points.
> If
>    the connect schema is not the decimal logical type, the JsonConverter
> will
>    convert this BigDecimal value into the corresponding floating point java
>    object.
>
> Fifth, I think the migration plan is not quite accurate. Step 3 mentions
> changing the Connect worker config's key and value converters to use the
> new setting, and a restart (step 4) is necessary after this step. Perhaps
> step 3 should be "If the Connect worker uses the JsonConverter for the key
> and/or value converters, optionally set the `decimal.format=NUMERIC` for
> the key and/or value converter and restart the workers." followed by Step
> 4: "If desired, update any source connector configs that use the
> JsonConverter for key and/or value converters to use
> `decimal.format=NUMERIC`."
>
> Finally, should we add a short discussion in the Rejected Alternatives
> about the option of leaving JsonConverter untouched and creating a
> different converter implementation?
>
> Thanks!
>
> Randall
>
>
>
> On Mon, Aug 26, 2019 at 11:51 AM Almog Gavra <al...@confluent.io> wrote:
>
> > Thanks for the feedback Randall! I have updated the KIP with the
> following
> > edits:
> >
> > * Updated the reference from "producer" to "source" (I had missed that
> > one!)
> > * Changed the config from "json.decimal.serialization.format" to
> > "decimal.format"
> > * Clarified case sensitivity
> > * Clarified the proposed changes to note that deserialization is not
> > affected by the config
> > * Clarified the changes in JsonConverter to handle deserialization (see
> my
> > third bullet below)
> > * Added a clear migration plan and simplified compatibility
> >
> > Here are also some clarifications based on your comments.
> >
> > * I think "json" has limited value in the configuration name. If put in a
> > top-level worker config, it clarifies that it only affects connectors
> using
> > the JsonConverter. I have opted for your suggestion and dropped it.
> > * I think "serialization" has limited value in the configuration name. If
> > we ever want to introduce "deserialization" configurations, there will be
> > asymmetry in the configuration names. I have opted for your suggestion
> and
> > dropped it.
> > * The JsonConverter will not "always look for numbers". The converter
> will
> > receive from the Jackson Object Mapper either a NumericNode containing a
> > big decimal or a BinaryNode containing a btye[]. Based on the type of
> this
> > node, it will convert the value to a BigDecimal appropriately (or any
> other
> > Connect java type based on the schema).
> > * "the ... JsonDeserializer are not affected" is not exactly true, but
> > semantically correct. See the note in the KIP about defaulting floating
> > points to BigDecimal to avoid precision loss.
> > * "The resulting application, however, may need to handle a wider range
> of
> > numeric values." Unless I misunderstand what you're saying, I don't think
> > this is correct. The resulting application will still receive exactly the
> > same Connect data object from the JsonConverter as it was before - only
> the
> > SerDe layer is affected.
> >
> > Cheers,
> > Almog
> >
> > On Sun, Aug 25, 2019 at 4:28 PM Randall Hauch <rh...@gmail.com> wrote:
> >
> > > Thanks for all the work, Almog.
> > >
> > > For the most part, I think this KIP will be a great improvement, and
> IMO
> > is
> > > almost ready to go. However, I do have a few suggestions that affect
> the
> > > wording more than the intent.
> > >
> > > First, the name of the `json.decimal.serialization.format` property is
> > > pretty long, especially when it is prefixed in the Worker config or in
> a
> > > connector config as `key.converer.json.decimal.serialization.format` or
> > > `value.converter.json.decimal.serialization.format` . Have you
> > considered a
> > > shorter config property name, such as maybe `decimal.format`? Is there
> > any
> > > benefit to include "json" and "serialization" in the property name?
> Also,
> > > we should be clear that the value will not be case sensitive (e.g.,
> > > "numeric" and "NUMERIC" would be equivalent), to keep in alignment with
> > > other enumeration literals in Connect configurations. The goal should
> be
> > > simple
> > >
> > > Second, the Motivation section, has the following sentence:
> > >
> > > "A new configuration for producers json.decimal.serialization.format
> will
> > > be introduced to the JsonConverter configuration to help control
> whether
> > > source converters will serialize decimals in numeric or binary
> formats."
> > >
> > >
> > > I agree with an earlier comment from Konstantine that "producers" here
> is
> > > distracting and does not mirror the normal definition of "producers"
> > within
> > > the Kafka context. I suggest rephrasing this to something like
> > >
> > > "Introduce to the JsonConverter a new configuration property named
> > > json.decimal.serialization.format to control whether source converters
> > will
> > > serialize decimals in numeric or binary formats."
> > >
> > >
> > > Third, the KIP should be more clear about whether the
> > > `json.decimal.serialization.format` setting does or does not affect
> > > deserialization? IIUC, the deserialization logic will always look for
> > JSON
> > > numbers, and will always use the Schema to define whether it should
> > convert
> > > the value to a different number type. Is that a fair statement?
> > >
> > > Fourth, the JsonSerializer and JsonDeserializer are not affected, yet
> are
> > > still compatible with the old and new behavior. Because the primary
> > purpose
> > > of this new setting is to define how Connect DECIMAL logical type
> values
> > > are serialized in JSON documents, the JsonDeserializer will still be
> able
> > > to deserialize the JSON document correctly. The resulting application,
> > > however, may need to handle a wider range of numeric values.
> > >
> > > Fifth, the Compatibility section seems more complicated than perhaps it
> > > needs to be, maybe because it seems to distinguish between upgrading
> and
> > > setting the decimal serialization format. Maybe it would be sufficient
> to
> > > simply emphasize that all of the sink connectors (or consumer
> > applications)
> > > using the JsonConverter with
> > > the `json.decimal.serialization.format=NUMERIC` setting consuming
> records
> > > from a set of topics be upgraded and changed *before* any of the source
> > > connectors (or other producer applications) using the JsonConverter to
> > > serialize records are changed to use
> > > the `json.decimal.serialization.format=NUMERIC` setting? It may also
> > > warrant giving more concrete advice on upgrade procedures. For example,
> > how
> > > does a user upgrade a set of Connect workers to use this new property?
> Do
> > > they upgrade first and restart to ensure everything runs as-is, and
> then
> > > upgrade their source connectors to set
> > > `json.decimal.serialization.format=NUMERIC`via connector configurations
> > or
> > > worker configs?
> > >
> > > Anyway, great job so far!
> > >
> > > Best regards,
> > >
> > > Randall
> > >
> > > On Thu, Aug 15, 2019 at 12:00 PM Konstantine Karantasis <
> > > konstantine@confluent.io> wrote:
> > >
> > > > Thanks! KIP reads even better for me now.
> > > > Just voted. +1 non-binding
> > > >
> > > > Konstantine
> > > >
> > > > On Wed, Aug 14, 2019 at 7:00 PM Almog Gavra <al...@confluent.io>
> > wrote:
> > > >
> > > > > Thanks for the review Konstantine!
> > > > >
> > > > > I think the terminology suggestion definitely makes things clearer
> -
> > I
> > > > will
> > > > > update the documentation based on your suggestion (e.g.
> > s/Consumer/Sink
> > > > > Converter/g and s/Producer/Source Converter/g).
> > > > >
> > > > > Cheers,
> > > > > Almog
> > > > >
> > > > > On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
> > > > > konstantine@confluent.io> wrote:
> > > > >
> > > > > > Thanks Almog for preparing this KIP!
> > > > > > I think it will improve usability and troubleshooting with JSON
> > data
> > > a
> > > > > lot.
> > > > > >
> > > > > > The finalized plan seems quite concrete now. I also liked that
> some
> > > > > > implementation specific implications (such as setting the
> > > ObjectMapper
> > > > to
> > > > > > deserialize floating point as BigDecimal) are highlighted in the
> > KIP.
> > > > > >
> > > > > > Still, as I was reading the KIP, the main obstacle I encountered
> > was
> > > > > around
> > > > > > terminology. I couldn't get used to reading "producer" and
> > "consumer"
> > > > and
> > > > > > not thinking in terms of Kafka producers and consumers - which
> are
> > > not
> > > > > > relevant to what this KIP proposes. Thus, I'd suggest replacing
> > > > > > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with
> > "Sink
> > > > > > Converter(s)" (even if "Converter used by Source Connector" or
> > > > "Converter
> > > > > > used by Sink Connector" would be even more accurate - maybe this
> > > could
> > > > be
> > > > > > an explanation in a footnote). Terminology around converters has
> > been
> > > > > > tricky in the past and adding producers/consumers in the mix
> might
> > > add
> > > > to
> > > > > > the confusion.
> > > > > >
> > > > > > Another example where I'd apply this different terminology would
> be
> > > to
> > > > a
> > > > > > phrase such as the following:
> > > > > > "Because of this, users must take care to first ensure that all
> > > > consumers
> > > > > > have upgraded to the new code before upgrading producers to make
> > use
> > > of
> > > > > the
> > > > > > NUMERIC serialization format."
> > > > > > which I'd write
> > > > > > "Because of this, users must take care to first ensure that all
> > sink
> > > > > > connectors have upgraded to the new converter code before
> upgrading
> > > > > source
> > > > > > connectors to make use of the NUMERIC serialization format in
> > > > > > JsonConverter."
> > > > > >
> > > > > > Let me know if you think this suggestion makes the KIP easier to
> > > > follow.
> > > > > > Otherwise I think it's a solid proposal.
> > > > > >
> > > > > > I'm concluding with a couple of nits:
> > > > > >
> > > > > > - "Upgraded Producer with BASE64 serialization, Legacy Consumer:
> > this
> > > > > > scenario is okay as the upgraded ~producer~ consumer will be able
> > to
> > > > read
> > > > > > binary as today" (again according to my suggestion above, it
> could
> > be
> > > > as
> > > > > > the upgraded source converter ...)
> > > > > >
> > > > > > - "consumers cannot consumer NUMERIC data. " -> "consumers cannot
> > > read
> > > > > > NUMERIC data"
> > > > > >
> > > > > > Best,
> > > > > > Konstantine
> > > > > >
> > > > > > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > > > >
> > > > > > > Good catches! Fixed :)
> > > > > > >
> > > > > > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <
> > > arjun.satish@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Cool!
> > > > > > > >
> > > > > > > > Couple of nits:
> > > > > > > >
> > > > > > > > - In public interfaces, typo:
> > *json.decimal.serialization.fromat*
> > > > > > > > - In public interfaces, you use the term "HEX" instead of
> > > "BASE64".
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <
> almog@confluent.io
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > EDIT: everywhere I've been using "HEX" I meant to be using
> > > > > "BASE64".
> > > > > > I
> > > > > > > > will
> > > > > > > > > update the KIP to reflect this.
> > > > > > > > >
> > > > > > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <
> > almog@confluent.io
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback Arjun! I'm happy changing the
> > default
> > > > > > config
> > > > > > > to
> > > > > > > > > > HEX instead of BINARY, no strong feelings there.
> > > > > > > > > >
> > > > > > > > > > I'll also clarify the example in the KIP to be clearer:
> > > > > > > > > >
> > > > > > > > > > - serialize the decimal field "foo" with value "10.2345"
> > with
> > > > the
> > > > > > HEX
> > > > > > > > > > setting: {"foo": "D3J5"}
> > > > > > > > > > - serialize the decimal field "foo" with value "10.2345"
> > with
> > > > the
> > > > > > > > NUMERIC
> > > > > > > > > > setting: {"foo": 10.2345}
> > > > > > > > > >
> > > > > > > > > > With regards to the precision issue, that was my original
> > > > concern
> > > > > > as
> > > > > > > > well
> > > > > > > > > > (and why I originally suggested a TEXT format). Many JSON
> > > > > > > deserializers
> > > > > > > > > > (e.g. Jackson with
> > > > > > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > > > > > > however, have the ability to deserialize decimals
> correctly
> > > so
> > > > I
> > > > > > will
> > > > > > > > > > configure that as the default for Connect's
> > JsonDeserializer.
> > > > > It's
> > > > > > > > > probably
> > > > > > > > > > a good idea to call out that using other deserializers
> must
> > > be
> > > > > done
> > > > > > > > with
> > > > > > > > > > care - I will add that documentation to the serialization
> > > > config.
> > > > > > > > > >
> > > > > > > > > > Note that there would not be an issue on the
> > _serialization_
> > > > side
> > > > > > of
> > > > > > > > > > things as Jackson respects BigDecimal.
> > > > > > > > > >
> > > > > > > > > > Almog
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> > > > > > arjun.satish@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not
> > late
> > > > > since
> > > > > > > you
> > > > > > > > > >> started the voting thread already):
> > > > > > > > > >>
> > > > > > > > > >> can you please add some examples to show the changes
> that
> > > you
> > > > > are
> > > > > > > > > >> proposing. makes me think that for a given decimal
> number,
> > > we
> > > > > will
> > > > > > > > have
> > > > > > > > > >> two
> > > > > > > > > >> encodings: “asHex” and “asNumber”.
> > > > > > > > > >>
> > > > > > > > > >> should we call the default config value “HEX” instead of
> > > > > “BINARY”?
> > > > > > > > > >>
> > > > > > > > > >> Should we call out the fact that JS systems might be
> > > > susceptible
> > > > > > to
> > > > > > > > > double
> > > > > > > > > >> precision round offs with the new numeric format? here
> are
> > > > some
> > > > > > > people
> > > > > > > > > >> discussing a similar problem
> > > > > > > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > > > > > > >>
> > > > > > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <
> > > > almog@confluent.io>
> > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Hello Everyone,
> > > > > > > > > >> >
> > > > > > > > > >> > Summarizing an in-person discussion with Randall (this
> > is
> > > > > copied
> > > > > > > > from
> > > > > > > > > >> the
> > > > > > > > > >> > KIP):
> > > > > > > > > >> >
> > > > > > > > > >> > The original KIP suggested supporting an additional
> > > > > > > representation -
> > > > > > > > > >> base10
> > > > > > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This
> causes
> > > > issues
> > > > > > > > because
> > > > > > > > > >> it
> > > > > > > > > >> > is impossible to disambiguate between TEXT and BINARY
> > > > without
> > > > > an
> > > > > > > > > >> additional
> > > > > > > > > >> > config - furthermore, this makes the migration from
> one
> > to
> > > > the
> > > > > > > other
> > > > > > > > > >> nearly
> > > > > > > > > >> > impossible because it would require that all consumers
> > > stop
> > > > > > > > consuming
> > > > > > > > > >> and
> > > > > > > > > >> > producers stop producing and atomically updating the
> > > config
> > > > on
> > > > > > all
> > > > > > > > of
> > > > > > > > > >> them
> > > > > > > > > >> > after deploying the new code, or waiting for the full
> > > > > retention
> > > > > > > > period
> > > > > > > > > >> to
> > > > > > > > > >> > pass - neither option is viable. The suggestion in the
> > KIP
> > > > is
> > > > > > > > strictly
> > > > > > > > > >> an
> > > > > > > > > >> > improvement over the existing behavior, even if it
> > doesn't
> > > > > > support
> > > > > > > > all
> > > > > > > > > >> > combinations.
> > > > > > > > > >> >
> > > > > > > > > >> > It seems that since most real-world use cases actually
> > use
> > > > the
> > > > > > > > numeric
> > > > > > > > > >> > representation (not string) we can consider this an
> > > > > improvement.
> > > > > > > > With
> > > > > > > > > >> the
> > > > > > > > > >> > new suggestion, we don't need a deserialization
> > > > configuration
> > > > > > > (only
> > > > > > > > a
> > > > > > > > > >> > serialization option) and the consumers will be able
> to
> > > > always
> > > > > > > > > >> > automatically determine the serialization format.
> > > > > > > > > >> >
> > > > > > > > > >> > Based on this, I'll be opening up the simplified
> version
> > > of
> > > > > the
> > > > > > > KIP
> > > > > > > > > to a
> > > > > > > > > >> > vote.
> > > > > > > > > >> >
> > > > > > > > > >> > Almog
> > > > > > > > > >> >
> > > > > > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <
> > > > > almog@confluent.io
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > I'm mostly happy with your current suggestion (two
> > > > configs,
> > > > > > one
> > > > > > > > for
> > > > > > > > > >> > > serialization and one for deserialization) and your
> > > > > > > implementation
> > > > > > > > > >> > > suggestion. One thing to note:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > We should _always_ be able to deserialize a
> standard
> > > > JSON
> > > > > > > > > >> > > > number as a decimal
> > > > > > > > > >> > >
> > > > > > > > > >> > > I was doing some research into decimals and JSON,
> and
> > I
> > > > can
> > > > > > > > imagine
> > > > > > > > > a
> > > > > > > > > >> > > compelling reason to require string representations
> to
> > > > avoid
> > > > > > > > losing
> > > > > > > > > >> > > precision and to be certain that whomever is sending
> > the
> > > > > data
> > > > > > > > isn't
> > > > > > > > > >> > losing
> > > > > > > > > >> > > precision (e.g.
> > > > > https://stackoverflow.com/a/38357877/2258040
> > > > > > ).
> > > > > > > > > >> > >
> > > > > > > > > >> > > I'm okay with always allowing numerics, but thought
> > it's
> > > > > worth
> > > > > > > > > raising
> > > > > > > > > >> > the
> > > > > > > > > >> > > thought.
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> > > > > > andy@confluent.io>
> > > > > > > > > >> wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > >> The way I see it, we need to control two seperate
> > > things:
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> 1. How do we _deserialize_ a decimal type if we
> > > > encounter a
> > > > > > > text
> > > > > > > > > >> node in
> > > > > > > > > >> > >> the JSON?    (We should _always_ be able to
> > > deserialize a
> > > > > > > > standard
> > > > > > > > > >> JSON
> > > > > > > > > >> > >> number as a decimal).
> > > > > > > > > >> > >> 2. How do we chose how we want decimals to be
> > > > _serialized_.
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> This looks to fits well with your second suggestion
> > of
> > > > > > slightly
> > > > > > > > > >> > different
> > > > > > > > > >> > >> configs names for serialization vs deserialization.
> > > > > > > > > >> > >> a, For deserialization we only care about how to
> > handle
> > > > > text
> > > > > > > > > nodes: `
> > > > > > > > > >> > >> deserialization.decimal.*text*.format`, which
> should
> > > only
> > > > > > have
> > > > > > > > two
> > > > > > > > > >> valid
> > > > > > > > > >> > >> values BINARY | TEXT.
> > > > > > > > > >> > >> b. For serialization we need all three:
> > > > > > > > > >> `serialization.decimal.format`,
> > > > > > > > > >> > >> which should support all three options: BINARY |
> > TEXT |
> > > > > > > NUMERIC.
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> Implementation wise, I think these should be two
> > > separate
> > > > > > > enums,
> > > > > > > > > >> rather
> > > > > > > > > >> > >> than one shared enum and throwing an error if the
> > > > > > deserializer
> > > > > > > is
> > > > > > > > > >> set to
> > > > > > > > > >> > >> NUMERIC.  Mainly as this means the enums reflect
> the
> > > > > options
> > > > > > > > > >> available,
> > > > > > > > > >> > >> rather than this being hidden in config checking
> > code.
> > > > But
> > > > > > > > that's
> > > > > > > > > a
> > > > > > > > > >> > minor
> > > > > > > > > >> > >> implementation detail.
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> Personally, I'd be tempted to have the BINARY value
> > > named
> > > > > > > > something
> > > > > > > > > >> like
> > > > > > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging
> > > users
> > > > > to
> > > > > > > move
> > > > > > > > > >> away
> > > > > > > > > >> > >> from
> > > > > > > > > >> > >> it.
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> It's a real shame that both of these settings
> > require a
> > > > > > default
> > > > > > > > of
> > > > > > > > > >> > BINARY
> > > > > > > > > >> > >> for backwards compatibility, but I agree that
> > > > discussions /
> > > > > > > plans
> > > > > > > > > >> around
> > > > > > > > > >> > >> switching the defaults should not block this KIP.
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> Andy
> > > > > > > > > >> > >>
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> > > > > > almog@confluent.io>
> > > > > > > > > >> wrote:
> > > > > > > > > >> > >>
> > > > > > > > > >> > >> > Thanks for the replies Andy and Andrew (2x
> Andy?)!
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > > Is the text decimal a base16 encoded number, or
> > is
> > > it
> > > > > > > base16
> > > > > > > > > >> encoded
> > > > > > > > > >> > >> > binary
> > > > > > > > > >> > >> > > form of the number?
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > The conversion happens as
> > > > > > > decimal.unscaledValue().toByteArray()
> > > > > > > > > and
> > > > > > > > > >> > then
> > > > > > > > > >> > >> > the byte array is converted to a hex string, so
> > it's
> > > > > > > definitely
> > > > > > > > > the
> > > > > > > > > >> > >> binary
> > > > > > > > > >> > >> > form of the number converted to base16. Whether
> or
> > > not
> > > > > > that's
> > > > > > > > the
> > > > > > > > > >> same
> > > > > > > > > >> > >> as
> > > > > > > > > >> > >> > the base16 encoded number is a good question
> > > > (toByteArray
> > > > > > > > > returns a
> > > > > > > > > >> > byte
> > > > > > > > > >> > >> > array containing a signed, big-endian, two's
> > > complement
> > > > > > > > > >> representation
> > > > > > > > > >> > >> of
> > > > > > > > > >> > >> > the big integer).
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > > One suggestion I have is to change the proposed
> > new
> > > > > > config
> > > > > > > to
> > > > > > > > > >> only
> > > > > > > > > >> > >> affect
> > > > > > > > > >> > >> > > decimals stored as text, i.e. to switch between
> > the
> > > > > > current
> > > > > > > > > >> base16
> > > > > > > > > >> > and
> > > > > > > > > >> > >> > the
> > > > > > > > > >> > >> > > more common base10.   Then add another config
> to
> > > the
> > > > > > > > serializer
> > > > > > > > > >> only
> > > > > > > > > >> > >> that
> > > > > > > > > >> > >> > > controls if decimals should be serialized as
> text
> > > or
> > > > > > > numeric.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > I think we need to be able to handle all mappings
> > > from
> > > > > > > > > >> serialization
> > > > > > > > > >> > >> format
> > > > > > > > > >> > >> > to deserialization format (e.g. read in BINARY
> and
> > > > output
> > > > > > > > TEXT),
> > > > > > > > > >> > which I
> > > > > > > > > >> > >> > think would be impossible with the alternative
> > > > > suggestion.
> > > > > > I
> > > > > > > > > agree
> > > > > > > > > >> > that
> > > > > > > > > >> > >> > automatically deserializing numerics is
> valuable. I
> > > see
> > > > > two
> > > > > > > > other
> > > > > > > > > >> ways
> > > > > > > > > >> > >> to
> > > > > > > > > >> > >> > get this, both keeping the serialization.format
> > > config
> > > > > the
> > > > > > > > same:
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > - have json.decimal.deserialization.format accept
> > all
> > > > > three
> > > > > > > > > >> formats.
> > > > > > > > > >> > if
> > > > > > > > > >> > >> set
> > > > > > > > > >> > >> > to BINARY/TEXT, numerics would be automatically
> > > > > supported.
> > > > > > If
> > > > > > > > set
> > > > > > > > > >> to
> > > > > > > > > >> > >> > NUMERIC, then any string coming in would result
> in
> > > > > > > > > deserialization
> > > > > > > > > >> > error
> > > > > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > > > > >> > >> > - change json.decimal.deserialization.format to
> > > > > > > > > >> > >> > json.decimal.deserialization.string.format which
> > > > accepts
> > > > > > only
> > > > > > > > > >> > >> BINARY/TEXT
> > > > > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > > would be a breaking change in that things that
> > > > > previously
> > > > > > > > > failed
> > > > > > > > > >> > would
> > > > > > > > > >> > >> > > suddenly start deserializing.  This is a price
> > I'm
> > > > > > willing
> > > > > > > to
> > > > > > > > > >> pay.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > > IMHO, we should then plan to switch the default
> > of
> > > > > > decimal
> > > > > > > > > >> > >> serialization
> > > > > > > > > >> > >> > to
> > > > > > > > > >> > >> > > numeric, and text serialization to base 10 in
> the
> > > > next
> > > > > > > major
> > > > > > > > > >> > release.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > I think that can be a separate discussion, I
> don't
> > > want
> > > > > to
> > > > > > > > block
> > > > > > > > > >> this
> > > > > > > > > >> > >> KIP
> > > > > > > > > >> > >> > on it.
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > Thoughts?
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > > > > > > otto@wikimedia.org>
> > > > > > > > > >> > wrote:
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >> > > This is a bit orthogonal, but in
> > > JsonSchemaConverter
> > > > I
> > > > > > use
> > > > > > > > > >> > >> JSONSchemas to
> > > > > > > > > >> > >> > > indicate whether a JSON number should be
> > > deserialized
> > > > > as
> > > > > > an
> > > > > > > > > >> integer
> > > > > > > > > >> > >> or a
> > > > > > > > > >> > >> > > decimal
> > > > > > > > > >> > >> > > <
> > > > > > > > > >> > >> > >
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >>
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > > > > > > >> > >> > > >.
> > > > > > > > > >> > >> > > Not everyone is going to have JSONSchemas
> > available
> > > > > when
> > > > > > > > > >> converting,
> > > > > > > > > >> > >> but
> > > > > > > > > >> > >> > if
> > > > > > > > > >> > >> > > you do, it is an easy way to support JSON
> numbers
> > > as
> > > > > > > > decimals.
> > > > > > > > > >> > >> > >
> > > > > > > > > >> > >> > > Carry on! :)
> > > > > > > > > >> > >> > >
> > > > > > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > > > > > > andy@confluent.io
> > > > > > > > > >
> > > > > > > > > >> > >> wrote:
> > > > > > > > > >> > >> > >
> > > > > > > > > >> > >> > > > Hi Almog,
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > > > Like the KIP - I think being able to support
> > > > decimals
> > > > > > in
> > > > > > > > JSON
> > > > > > > > > >> in
> > > > > > > > > >> > the
> > > > > > > > > >> > >> > same
> > > > > > > > > >> > >> > > > way most other systems do is a great
> > improvement.
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > > > It's not 100% clear to me from the KIP what
> the
> > > > > current
> > > > > > > > > format
> > > > > > > > > >> is.
> > > > > > > > > >> > >> Is
> > > > > > > > > >> > >> > > the
> > > > > > > > > >> > >> > > > text decimal a base16 encoded number, or is
> it
> > > > base16
> > > > > > > > encoded
> > > > > > > > > >> > binary
> > > > > > > > > >> > >> > form
> > > > > > > > > >> > >> > > > of the number? (I've not tried to get my head
> > > > around
> > > > > if
> > > > > > > > these
> > > > > > > > > >> two
> > > > > > > > > >> > >> are
> > > > > > > > > >> > >> > > even
> > > > > > > > > >> > >> > > > different!)
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > > > One suggestion I have is to change the
> proposed
> > > new
> > > > > > > config
> > > > > > > > to
> > > > > > > > > >> only
> > > > > > > > > >> > >> > affect
> > > > > > > > > >> > >> > > > decimals stored as text, i.e. to switch
> between
> > > the
> > > > > > > current
> > > > > > > > > >> base16
> > > > > > > > > >> > >> and
> > > > > > > > > >> > >> > > the
> > > > > > > > > >> > >> > > > more common base10.   Then add another config
> > to
> > > > the
> > > > > > > > > serialzier
> > > > > > > > > >> > only
> > > > > > > > > >> > >> > that
> > > > > > > > > >> > >> > > > controls if decimals should be serialized as
> > text
> > > > or
> > > > > > > > numeric.
> > > > > > > > > >> The
> > > > > > > > > >> > >> > > benefit
> > > > > > > > > >> > >> > > > of this approach is it allows us to enhance
> the
> > > > > > > > deserializer
> > > > > > > > > to
> > > > > > > > > >> > >> > > > automatically handle numeric decimals even
> > > without
> > > > > any
> > > > > > > > config
> > > > > > > > > >> > >> having to
> > > > > > > > > >> > >> > > be
> > > > > > > > > >> > >> > > > set, i.e. default config in the deserializer
> > > would
> > > > be
> > > > > > > able
> > > > > > > > to
> > > > > > > > > >> > handle
> > > > > > > > > >> > >> > > > numeric decimals.  Of course, this is a two
> > edged
> > > > > > sword:
> > > > > > > > this
> > > > > > > > > >> > would
> > > > > > > > > >> > >> > make
> > > > > > > > > >> > >> > > > the deserializer work out of the box with
> > numeric
> > > > > > > decimals,
> > > > > > > > > >> > (yay!),
> > > > > > > > > >> > >> but
> > > > > > > > > >> > >> > > > would be a breaking change in that things
> that
> > > > > > previously
> > > > > > > > > >> failed
> > > > > > > > > >> > >> would
> > > > > > > > > >> > >> > > > suddenly start deserializing.  This is a
> price
> > > I'm
> > > > > > > willing
> > > > > > > > to
> > > > > > > > > >> pay.
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > > > IMHO, we should then plan to switch the
> default
> > > of
> > > > > > > decimal
> > > > > > > > > >> > >> > serialization
> > > > > > > > > >> > >> > > to
> > > > > > > > > >> > >> > > > numeric, and text serialization to base 10 in
> > the
> > > > > next
> > > > > > > > major
> > > > > > > > > >> > >> release.
> > > > > > > > > >> > >> > > > (With upgrade notes to match). Though I know
> > this
> > > > is
> > > > > > more
> > > > > > > > > >> > >> contentious,
> > > > > > > > > >> > >> > I
> > > > > > > > > >> > >> > > > think it moves us forward in a much more
> > standard
> > > > way
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > >> > >> current
> > > > > > > > > >> > >> > > > encoding of decimals.
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > > > > > > almog@confluent.io>
> > > > > > > > > >> > >> wrote:
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > > > > Hi Everyone!
> > > > > > > > > >> > >> > > > >
> > > > > > > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > > > > > > >> > >> > > > >
> > > > > > > > > >> > >> > > > >
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > >
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >>
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > > > > > > >> > >> > > > >
> > > > > > > > > >> > >> > > > > For those who are interested, I have a
> > > prototype
> > > > > > > > > >> implementation
> > > > > > > > > >> > >> that
> > > > > > > > > >> > >> > > > helped
> > > > > > > > > >> > >> > > > > guide my design:
> > > > > > > https://github.com/agavra/kafka/pull/1
> > > > > > > > > >> > >> > > > >
> > > > > > > > > >> > >> > > > > Cheers,
> > > > > > > > > >> > >> > > > > Almog
> > > > > > > > > >> > >> > > > >
> > > > > > > > > >> > >> > > >
> > > > > > > > > >> > >> > >
> > > > > > > > > >> > >> >
> > > > > > > > > >> > >>
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Randall Hauch <rh...@gmail.com>.
Thanks for the updates, Almog. This looks really good, but I have a few
more comments (most wording, though one potentially thorny issue):

First, the KIP should define the default value for the `decimal.format`
property. IIUC, it will be BASE64, and to gain the new behavior users will
have to explicitly set this to NUMERIC. In fact, I'd recommend changing
this bullet item from:

   - JsonConverter will accept the decimal.format configuration to
   determine the serialization format. If the value is BASE64, the behavior
   remains unchanged (i.e. it serializes it as a base64 text). If the value is
   NUMERIC, the JSON node will be a number representing that decimal (e.g.
   10.2345 instead of "D3J5").

to something like:

   - Define a new `decimal.format` configuration property on JsonConverter
   to specify the serialization format for Connect DECIMAL logical type
   values, with two allowed literal values for the configuration property:
   - `BASE64` specifies the existing behavior of serializing DECIMAL
      logical types as base64 encoded binary data (e.g., "D3J5" in the example
      above), and will be the default; and
      - `NUMERIC` will serialize Connect DECIMAL logical type values in
      JSON as a number representing that decimal

Second, since the default will be the existing BASE64 representation, what
do you think about changing the JsonConverter instances used by the Connect
worker for internal topics to enable `decimal.format=NUMERIC`? I don't
think we actually use decimals in the internal messages, but if we do at
some point then the converters will store them with the improved natural
representation.

Third, the following bullet item could be more clear:

   - JsonConverter will automatically handle deserialization of either
   serialization format given a Decimal logical type schema, i.e. it will
   accept both a deserialized BinaryNode and NumericNode. If the value is a
   BinaryNode, it will construct a java BigDecimal from the binaryValue()
   (which is a btye[]). If the value is a NumericNode, it will simply pass
   through the decimalValue() deserialized by the JsonDeserializer.

such as maybe:

   - The JsonConverter deserialization method currently expects only a
   BinaryNode, but will be changed to also handle NumericNode by calling
   NumericNode.decimalValue().

This brings up an interesting potential issue: if `schemas.enable=false`,
then there will be no schema in the record, and calling
`NumericNode.decimalValue()` will always return a java.math.BigDecimal even
if the underlying Number was not a BigDecimal. Shouldn't the
`convertToConnect(Schema, JsonNode) instead use the type of Number value as
parsed and returned by the JsonDeserializer to determine the proper schema
type, and then get the proper value type using that schema type's converter?

Fourth, I'm not sure I understand the following bullet:

   - JsonDeserializer will now default floating point deserialization to
   BigDecimal to avoid losing precision. This may impact performance when
   deserializing doubles - a JMH microbenchmark on my local MBP, this
   estimated about 3x degradation for deserializing JSON floating points. If
   the connect schema is not the decimal logical type, the JsonConverter will
   convert this BigDecimal value into the corresponding floating point java
   object.

Fifth, I think the migration plan is not quite accurate. Step 3 mentions
changing the Connect worker config's key and value converters to use the
new setting, and a restart (step 4) is necessary after this step. Perhaps
step 3 should be "If the Connect worker uses the JsonConverter for the key
and/or value converters, optionally set the `decimal.format=NUMERIC` for
the key and/or value converter and restart the workers." followed by Step
4: "If desired, update any source connector configs that use the
JsonConverter for key and/or value converters to use
`decimal.format=NUMERIC`."

Finally, should we add a short discussion in the Rejected Alternatives
about the option of leaving JsonConverter untouched and creating a
different converter implementation?

Thanks!

Randall



On Mon, Aug 26, 2019 at 11:51 AM Almog Gavra <al...@confluent.io> wrote:

> Thanks for the feedback Randall! I have updated the KIP with the following
> edits:
>
> * Updated the reference from "producer" to "source" (I had missed that
> one!)
> * Changed the config from "json.decimal.serialization.format" to
> "decimal.format"
> * Clarified case sensitivity
> * Clarified the proposed changes to note that deserialization is not
> affected by the config
> * Clarified the changes in JsonConverter to handle deserialization (see my
> third bullet below)
> * Added a clear migration plan and simplified compatibility
>
> Here are also some clarifications based on your comments.
>
> * I think "json" has limited value in the configuration name. If put in a
> top-level worker config, it clarifies that it only affects connectors using
> the JsonConverter. I have opted for your suggestion and dropped it.
> * I think "serialization" has limited value in the configuration name. If
> we ever want to introduce "deserialization" configurations, there will be
> asymmetry in the configuration names. I have opted for your suggestion and
> dropped it.
> * The JsonConverter will not "always look for numbers". The converter will
> receive from the Jackson Object Mapper either a NumericNode containing a
> big decimal or a BinaryNode containing a btye[]. Based on the type of this
> node, it will convert the value to a BigDecimal appropriately (or any other
> Connect java type based on the schema).
> * "the ... JsonDeserializer are not affected" is not exactly true, but
> semantically correct. See the note in the KIP about defaulting floating
> points to BigDecimal to avoid precision loss.
> * "The resulting application, however, may need to handle a wider range of
> numeric values." Unless I misunderstand what you're saying, I don't think
> this is correct. The resulting application will still receive exactly the
> same Connect data object from the JsonConverter as it was before - only the
> SerDe layer is affected.
>
> Cheers,
> Almog
>
> On Sun, Aug 25, 2019 at 4:28 PM Randall Hauch <rh...@gmail.com> wrote:
>
> > Thanks for all the work, Almog.
> >
> > For the most part, I think this KIP will be a great improvement, and IMO
> is
> > almost ready to go. However, I do have a few suggestions that affect the
> > wording more than the intent.
> >
> > First, the name of the `json.decimal.serialization.format` property is
> > pretty long, especially when it is prefixed in the Worker config or in a
> > connector config as `key.converer.json.decimal.serialization.format` or
> > `value.converter.json.decimal.serialization.format` . Have you
> considered a
> > shorter config property name, such as maybe `decimal.format`? Is there
> any
> > benefit to include "json" and "serialization" in the property name? Also,
> > we should be clear that the value will not be case sensitive (e.g.,
> > "numeric" and "NUMERIC" would be equivalent), to keep in alignment with
> > other enumeration literals in Connect configurations. The goal should be
> > simple
> >
> > Second, the Motivation section, has the following sentence:
> >
> > "A new configuration for producers json.decimal.serialization.format will
> > be introduced to the JsonConverter configuration to help control whether
> > source converters will serialize decimals in numeric or binary formats."
> >
> >
> > I agree with an earlier comment from Konstantine that "producers" here is
> > distracting and does not mirror the normal definition of "producers"
> within
> > the Kafka context. I suggest rephrasing this to something like
> >
> > "Introduce to the JsonConverter a new configuration property named
> > json.decimal.serialization.format to control whether source converters
> will
> > serialize decimals in numeric or binary formats."
> >
> >
> > Third, the KIP should be more clear about whether the
> > `json.decimal.serialization.format` setting does or does not affect
> > deserialization? IIUC, the deserialization logic will always look for
> JSON
> > numbers, and will always use the Schema to define whether it should
> convert
> > the value to a different number type. Is that a fair statement?
> >
> > Fourth, the JsonSerializer and JsonDeserializer are not affected, yet are
> > still compatible with the old and new behavior. Because the primary
> purpose
> > of this new setting is to define how Connect DECIMAL logical type values
> > are serialized in JSON documents, the JsonDeserializer will still be able
> > to deserialize the JSON document correctly. The resulting application,
> > however, may need to handle a wider range of numeric values.
> >
> > Fifth, the Compatibility section seems more complicated than perhaps it
> > needs to be, maybe because it seems to distinguish between upgrading and
> > setting the decimal serialization format. Maybe it would be sufficient to
> > simply emphasize that all of the sink connectors (or consumer
> applications)
> > using the JsonConverter with
> > the `json.decimal.serialization.format=NUMERIC` setting consuming records
> > from a set of topics be upgraded and changed *before* any of the source
> > connectors (or other producer applications) using the JsonConverter to
> > serialize records are changed to use
> > the `json.decimal.serialization.format=NUMERIC` setting? It may also
> > warrant giving more concrete advice on upgrade procedures. For example,
> how
> > does a user upgrade a set of Connect workers to use this new property? Do
> > they upgrade first and restart to ensure everything runs as-is, and then
> > upgrade their source connectors to set
> > `json.decimal.serialization.format=NUMERIC`via connector configurations
> or
> > worker configs?
> >
> > Anyway, great job so far!
> >
> > Best regards,
> >
> > Randall
> >
> > On Thu, Aug 15, 2019 at 12:00 PM Konstantine Karantasis <
> > konstantine@confluent.io> wrote:
> >
> > > Thanks! KIP reads even better for me now.
> > > Just voted. +1 non-binding
> > >
> > > Konstantine
> > >
> > > On Wed, Aug 14, 2019 at 7:00 PM Almog Gavra <al...@confluent.io>
> wrote:
> > >
> > > > Thanks for the review Konstantine!
> > > >
> > > > I think the terminology suggestion definitely makes things clearer -
> I
> > > will
> > > > update the documentation based on your suggestion (e.g.
> s/Consumer/Sink
> > > > Converter/g and s/Producer/Source Converter/g).
> > > >
> > > > Cheers,
> > > > Almog
> > > >
> > > > On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
> > > > konstantine@confluent.io> wrote:
> > > >
> > > > > Thanks Almog for preparing this KIP!
> > > > > I think it will improve usability and troubleshooting with JSON
> data
> > a
> > > > lot.
> > > > >
> > > > > The finalized plan seems quite concrete now. I also liked that some
> > > > > implementation specific implications (such as setting the
> > ObjectMapper
> > > to
> > > > > deserialize floating point as BigDecimal) are highlighted in the
> KIP.
> > > > >
> > > > > Still, as I was reading the KIP, the main obstacle I encountered
> was
> > > > around
> > > > > terminology. I couldn't get used to reading "producer" and
> "consumer"
> > > and
> > > > > not thinking in terms of Kafka producers and consumers - which are
> > not
> > > > > relevant to what this KIP proposes. Thus, I'd suggest replacing
> > > > > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with
> "Sink
> > > > > Converter(s)" (even if "Converter used by Source Connector" or
> > > "Converter
> > > > > used by Sink Connector" would be even more accurate - maybe this
> > could
> > > be
> > > > > an explanation in a footnote). Terminology around converters has
> been
> > > > > tricky in the past and adding producers/consumers in the mix might
> > add
> > > to
> > > > > the confusion.
> > > > >
> > > > > Another example where I'd apply this different terminology would be
> > to
> > > a
> > > > > phrase such as the following:
> > > > > "Because of this, users must take care to first ensure that all
> > > consumers
> > > > > have upgraded to the new code before upgrading producers to make
> use
> > of
> > > > the
> > > > > NUMERIC serialization format."
> > > > > which I'd write
> > > > > "Because of this, users must take care to first ensure that all
> sink
> > > > > connectors have upgraded to the new converter code before upgrading
> > > > source
> > > > > connectors to make use of the NUMERIC serialization format in
> > > > > JsonConverter."
> > > > >
> > > > > Let me know if you think this suggestion makes the KIP easier to
> > > follow.
> > > > > Otherwise I think it's a solid proposal.
> > > > >
> > > > > I'm concluding with a couple of nits:
> > > > >
> > > > > - "Upgraded Producer with BASE64 serialization, Legacy Consumer:
> this
> > > > > scenario is okay as the upgraded ~producer~ consumer will be able
> to
> > > read
> > > > > binary as today" (again according to my suggestion above, it could
> be
> > > as
> > > > > the upgraded source converter ...)
> > > > >
> > > > > - "consumers cannot consumer NUMERIC data. " -> "consumers cannot
> > read
> > > > > NUMERIC data"
> > > > >
> > > > > Best,
> > > > > Konstantine
> > > > >
> > > > > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io>
> > wrote:
> > > > >
> > > > > > Good catches! Fixed :)
> > > > > >
> > > > > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <
> > arjun.satish@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Cool!
> > > > > > >
> > > > > > > Couple of nits:
> > > > > > >
> > > > > > > - In public interfaces, typo:
> *json.decimal.serialization.fromat*
> > > > > > > - In public interfaces, you use the term "HEX" instead of
> > "BASE64".
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <almog@confluent.io
> >
> > > > wrote:
> > > > > > >
> > > > > > > > EDIT: everywhere I've been using "HEX" I meant to be using
> > > > "BASE64".
> > > > > I
> > > > > > > will
> > > > > > > > update the KIP to reflect this.
> > > > > > > >
> > > > > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <
> almog@confluent.io
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback Arjun! I'm happy changing the
> default
> > > > > config
> > > > > > to
> > > > > > > > > HEX instead of BINARY, no strong feelings there.
> > > > > > > > >
> > > > > > > > > I'll also clarify the example in the KIP to be clearer:
> > > > > > > > >
> > > > > > > > > - serialize the decimal field "foo" with value "10.2345"
> with
> > > the
> > > > > HEX
> > > > > > > > > setting: {"foo": "D3J5"}
> > > > > > > > > - serialize the decimal field "foo" with value "10.2345"
> with
> > > the
> > > > > > > NUMERIC
> > > > > > > > > setting: {"foo": 10.2345}
> > > > > > > > >
> > > > > > > > > With regards to the precision issue, that was my original
> > > concern
> > > > > as
> > > > > > > well
> > > > > > > > > (and why I originally suggested a TEXT format). Many JSON
> > > > > > deserializers
> > > > > > > > > (e.g. Jackson with
> > > > > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > > > > > however, have the ability to deserialize decimals correctly
> > so
> > > I
> > > > > will
> > > > > > > > > configure that as the default for Connect's
> JsonDeserializer.
> > > > It's
> > > > > > > > probably
> > > > > > > > > a good idea to call out that using other deserializers must
> > be
> > > > done
> > > > > > > with
> > > > > > > > > care - I will add that documentation to the serialization
> > > config.
> > > > > > > > >
> > > > > > > > > Note that there would not be an issue on the
> _serialization_
> > > side
> > > > > of
> > > > > > > > > things as Jackson respects BigDecimal.
> > > > > > > > >
> > > > > > > > > Almog
> > > > > > > > >
> > > > > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> > > > > arjun.satish@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not
> late
> > > > since
> > > > > > you
> > > > > > > > >> started the voting thread already):
> > > > > > > > >>
> > > > > > > > >> can you please add some examples to show the changes that
> > you
> > > > are
> > > > > > > > >> proposing. makes me think that for a given decimal number,
> > we
> > > > will
> > > > > > > have
> > > > > > > > >> two
> > > > > > > > >> encodings: “asHex” and “asNumber”.
> > > > > > > > >>
> > > > > > > > >> should we call the default config value “HEX” instead of
> > > > “BINARY”?
> > > > > > > > >>
> > > > > > > > >> Should we call out the fact that JS systems might be
> > > susceptible
> > > > > to
> > > > > > > > double
> > > > > > > > >> precision round offs with the new numeric format? here are
> > > some
> > > > > > people
> > > > > > > > >> discussing a similar problem
> > > > > > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > > > > > >>
> > > > > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <
> > > almog@confluent.io>
> > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> > Hello Everyone,
> > > > > > > > >> >
> > > > > > > > >> > Summarizing an in-person discussion with Randall (this
> is
> > > > copied
> > > > > > > from
> > > > > > > > >> the
> > > > > > > > >> > KIP):
> > > > > > > > >> >
> > > > > > > > >> > The original KIP suggested supporting an additional
> > > > > > representation -
> > > > > > > > >> base10
> > > > > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes
> > > issues
> > > > > > > because
> > > > > > > > >> it
> > > > > > > > >> > is impossible to disambiguate between TEXT and BINARY
> > > without
> > > > an
> > > > > > > > >> additional
> > > > > > > > >> > config - furthermore, this makes the migration from one
> to
> > > the
> > > > > > other
> > > > > > > > >> nearly
> > > > > > > > >> > impossible because it would require that all consumers
> > stop
> > > > > > > consuming
> > > > > > > > >> and
> > > > > > > > >> > producers stop producing and atomically updating the
> > config
> > > on
> > > > > all
> > > > > > > of
> > > > > > > > >> them
> > > > > > > > >> > after deploying the new code, or waiting for the full
> > > > retention
> > > > > > > period
> > > > > > > > >> to
> > > > > > > > >> > pass - neither option is viable. The suggestion in the
> KIP
> > > is
> > > > > > > strictly
> > > > > > > > >> an
> > > > > > > > >> > improvement over the existing behavior, even if it
> doesn't
> > > > > support
> > > > > > > all
> > > > > > > > >> > combinations.
> > > > > > > > >> >
> > > > > > > > >> > It seems that since most real-world use cases actually
> use
> > > the
> > > > > > > numeric
> > > > > > > > >> > representation (not string) we can consider this an
> > > > improvement.
> > > > > > > With
> > > > > > > > >> the
> > > > > > > > >> > new suggestion, we don't need a deserialization
> > > configuration
> > > > > > (only
> > > > > > > a
> > > > > > > > >> > serialization option) and the consumers will be able to
> > > always
> > > > > > > > >> > automatically determine the serialization format.
> > > > > > > > >> >
> > > > > > > > >> > Based on this, I'll be opening up the simplified version
> > of
> > > > the
> > > > > > KIP
> > > > > > > > to a
> > > > > > > > >> > vote.
> > > > > > > > >> >
> > > > > > > > >> > Almog
> > > > > > > > >> >
> > > > > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <
> > > > almog@confluent.io
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > I'm mostly happy with your current suggestion (two
> > > configs,
> > > > > one
> > > > > > > for
> > > > > > > > >> > > serialization and one for deserialization) and your
> > > > > > implementation
> > > > > > > > >> > > suggestion. One thing to note:
> > > > > > > > >> > >
> > > > > > > > >> > > > We should _always_ be able to deserialize a standard
> > > JSON
> > > > > > > > >> > > > number as a decimal
> > > > > > > > >> > >
> > > > > > > > >> > > I was doing some research into decimals and JSON, and
> I
> > > can
> > > > > > > imagine
> > > > > > > > a
> > > > > > > > >> > > compelling reason to require string representations to
> > > avoid
> > > > > > > losing
> > > > > > > > >> > > precision and to be certain that whomever is sending
> the
> > > > data
> > > > > > > isn't
> > > > > > > > >> > losing
> > > > > > > > >> > > precision (e.g.
> > > > https://stackoverflow.com/a/38357877/2258040
> > > > > ).
> > > > > > > > >> > >
> > > > > > > > >> > > I'm okay with always allowing numerics, but thought
> it's
> > > > worth
> > > > > > > > raising
> > > > > > > > >> > the
> > > > > > > > >> > > thought.
> > > > > > > > >> > >
> > > > > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> > > > > andy@confluent.io>
> > > > > > > > >> wrote:
> > > > > > > > >> > >
> > > > > > > > >> > >> The way I see it, we need to control two seperate
> > things:
> > > > > > > > >> > >>
> > > > > > > > >> > >> 1. How do we _deserialize_ a decimal type if we
> > > encounter a
> > > > > > text
> > > > > > > > >> node in
> > > > > > > > >> > >> the JSON?    (We should _always_ be able to
> > deserialize a
> > > > > > > standard
> > > > > > > > >> JSON
> > > > > > > > >> > >> number as a decimal).
> > > > > > > > >> > >> 2. How do we chose how we want decimals to be
> > > _serialized_.
> > > > > > > > >> > >>
> > > > > > > > >> > >> This looks to fits well with your second suggestion
> of
> > > > > slightly
> > > > > > > > >> > different
> > > > > > > > >> > >> configs names for serialization vs deserialization.
> > > > > > > > >> > >> a, For deserialization we only care about how to
> handle
> > > > text
> > > > > > > > nodes: `
> > > > > > > > >> > >> deserialization.decimal.*text*.format`, which should
> > only
> > > > > have
> > > > > > > two
> > > > > > > > >> valid
> > > > > > > > >> > >> values BINARY | TEXT.
> > > > > > > > >> > >> b. For serialization we need all three:
> > > > > > > > >> `serialization.decimal.format`,
> > > > > > > > >> > >> which should support all three options: BINARY |
> TEXT |
> > > > > > NUMERIC.
> > > > > > > > >> > >>
> > > > > > > > >> > >> Implementation wise, I think these should be two
> > separate
> > > > > > enums,
> > > > > > > > >> rather
> > > > > > > > >> > >> than one shared enum and throwing an error if the
> > > > > deserializer
> > > > > > is
> > > > > > > > >> set to
> > > > > > > > >> > >> NUMERIC.  Mainly as this means the enums reflect the
> > > > options
> > > > > > > > >> available,
> > > > > > > > >> > >> rather than this being hidden in config checking
> code.
> > > But
> > > > > > > that's
> > > > > > > > a
> > > > > > > > >> > minor
> > > > > > > > >> > >> implementation detail.
> > > > > > > > >> > >>
> > > > > > > > >> > >> Personally, I'd be tempted to have the BINARY value
> > named
> > > > > > > something
> > > > > > > > >> like
> > > > > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging
> > users
> > > > to
> > > > > > move
> > > > > > > > >> away
> > > > > > > > >> > >> from
> > > > > > > > >> > >> it.
> > > > > > > > >> > >>
> > > > > > > > >> > >> It's a real shame that both of these settings
> require a
> > > > > default
> > > > > > > of
> > > > > > > > >> > BINARY
> > > > > > > > >> > >> for backwards compatibility, but I agree that
> > > discussions /
> > > > > > plans
> > > > > > > > >> around
> > > > > > > > >> > >> switching the defaults should not block this KIP.
> > > > > > > > >> > >>
> > > > > > > > >> > >> Andy
> > > > > > > > >> > >>
> > > > > > > > >> > >>
> > > > > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> > > > > almog@confluent.io>
> > > > > > > > >> wrote:
> > > > > > > > >> > >>
> > > > > > > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > > Is the text decimal a base16 encoded number, or
> is
> > it
> > > > > > base16
> > > > > > > > >> encoded
> > > > > > > > >> > >> > binary
> > > > > > > > >> > >> > > form of the number?
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > The conversion happens as
> > > > > > decimal.unscaledValue().toByteArray()
> > > > > > > > and
> > > > > > > > >> > then
> > > > > > > > >> > >> > the byte array is converted to a hex string, so
> it's
> > > > > > definitely
> > > > > > > > the
> > > > > > > > >> > >> binary
> > > > > > > > >> > >> > form of the number converted to base16. Whether or
> > not
> > > > > that's
> > > > > > > the
> > > > > > > > >> same
> > > > > > > > >> > >> as
> > > > > > > > >> > >> > the base16 encoded number is a good question
> > > (toByteArray
> > > > > > > > returns a
> > > > > > > > >> > byte
> > > > > > > > >> > >> > array containing a signed, big-endian, two's
> > complement
> > > > > > > > >> representation
> > > > > > > > >> > >> of
> > > > > > > > >> > >> > the big integer).
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > > One suggestion I have is to change the proposed
> new
> > > > > config
> > > > > > to
> > > > > > > > >> only
> > > > > > > > >> > >> affect
> > > > > > > > >> > >> > > decimals stored as text, i.e. to switch between
> the
> > > > > current
> > > > > > > > >> base16
> > > > > > > > >> > and
> > > > > > > > >> > >> > the
> > > > > > > > >> > >> > > more common base10.   Then add another config to
> > the
> > > > > > > serializer
> > > > > > > > >> only
> > > > > > > > >> > >> that
> > > > > > > > >> > >> > > controls if decimals should be serialized as text
> > or
> > > > > > numeric.
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > I think we need to be able to handle all mappings
> > from
> > > > > > > > >> serialization
> > > > > > > > >> > >> format
> > > > > > > > >> > >> > to deserialization format (e.g. read in BINARY and
> > > output
> > > > > > > TEXT),
> > > > > > > > >> > which I
> > > > > > > > >> > >> > think would be impossible with the alternative
> > > > suggestion.
> > > > > I
> > > > > > > > agree
> > > > > > > > >> > that
> > > > > > > > >> > >> > automatically deserializing numerics is valuable. I
> > see
> > > > two
> > > > > > > other
> > > > > > > > >> ways
> > > > > > > > >> > >> to
> > > > > > > > >> > >> > get this, both keeping the serialization.format
> > config
> > > > the
> > > > > > > same:
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > - have json.decimal.deserialization.format accept
> all
> > > > three
> > > > > > > > >> formats.
> > > > > > > > >> > if
> > > > > > > > >> > >> set
> > > > > > > > >> > >> > to BINARY/TEXT, numerics would be automatically
> > > > supported.
> > > > > If
> > > > > > > set
> > > > > > > > >> to
> > > > > > > > >> > >> > NUMERIC, then any string coming in would result in
> > > > > > > > deserialization
> > > > > > > > >> > error
> > > > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > > > >> > >> > - change json.decimal.deserialization.format to
> > > > > > > > >> > >> > json.decimal.deserialization.string.format which
> > > accepts
> > > > > only
> > > > > > > > >> > >> BINARY/TEXT
> > > > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > > would be a breaking change in that things that
> > > > previously
> > > > > > > > failed
> > > > > > > > >> > would
> > > > > > > > >> > >> > > suddenly start deserializing.  This is a price
> I'm
> > > > > willing
> > > > > > to
> > > > > > > > >> pay.
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > > IMHO, we should then plan to switch the default
> of
> > > > > decimal
> > > > > > > > >> > >> serialization
> > > > > > > > >> > >> > to
> > > > > > > > >> > >> > > numeric, and text serialization to base 10 in the
> > > next
> > > > > > major
> > > > > > > > >> > release.
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > I think that can be a separate discussion, I don't
> > want
> > > > to
> > > > > > > block
> > > > > > > > >> this
> > > > > > > > >> > >> KIP
> > > > > > > > >> > >> > on it.
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > Thoughts?
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > > > > > otto@wikimedia.org>
> > > > > > > > >> > wrote:
> > > > > > > > >> > >> >
> > > > > > > > >> > >> > > This is a bit orthogonal, but in
> > JsonSchemaConverter
> > > I
> > > > > use
> > > > > > > > >> > >> JSONSchemas to
> > > > > > > > >> > >> > > indicate whether a JSON number should be
> > deserialized
> > > > as
> > > > > an
> > > > > > > > >> integer
> > > > > > > > >> > >> or a
> > > > > > > > >> > >> > > decimal
> > > > > > > > >> > >> > > <
> > > > > > > > >> > >> > >
> > > > > > > > >> > >> >
> > > > > > > > >> > >>
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > > > > > >> > >> > > >.
> > > > > > > > >> > >> > > Not everyone is going to have JSONSchemas
> available
> > > > when
> > > > > > > > >> converting,
> > > > > > > > >> > >> but
> > > > > > > > >> > >> > if
> > > > > > > > >> > >> > > you do, it is an easy way to support JSON numbers
> > as
> > > > > > > decimals.
> > > > > > > > >> > >> > >
> > > > > > > > >> > >> > > Carry on! :)
> > > > > > > > >> > >> > >
> > > > > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > > > > > andy@confluent.io
> > > > > > > > >
> > > > > > > > >> > >> wrote:
> > > > > > > > >> > >> > >
> > > > > > > > >> > >> > > > Hi Almog,
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > > > Like the KIP - I think being able to support
> > > decimals
> > > > > in
> > > > > > > JSON
> > > > > > > > >> in
> > > > > > > > >> > the
> > > > > > > > >> > >> > same
> > > > > > > > >> > >> > > > way most other systems do is a great
> improvement.
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > > > It's not 100% clear to me from the KIP what the
> > > > current
> > > > > > > > format
> > > > > > > > >> is.
> > > > > > > > >> > >> Is
> > > > > > > > >> > >> > > the
> > > > > > > > >> > >> > > > text decimal a base16 encoded number, or is it
> > > base16
> > > > > > > encoded
> > > > > > > > >> > binary
> > > > > > > > >> > >> > form
> > > > > > > > >> > >> > > > of the number? (I've not tried to get my head
> > > around
> > > > if
> > > > > > > these
> > > > > > > > >> two
> > > > > > > > >> > >> are
> > > > > > > > >> > >> > > even
> > > > > > > > >> > >> > > > different!)
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > > > One suggestion I have is to change the proposed
> > new
> > > > > > config
> > > > > > > to
> > > > > > > > >> only
> > > > > > > > >> > >> > affect
> > > > > > > > >> > >> > > > decimals stored as text, i.e. to switch between
> > the
> > > > > > current
> > > > > > > > >> base16
> > > > > > > > >> > >> and
> > > > > > > > >> > >> > > the
> > > > > > > > >> > >> > > > more common base10.   Then add another config
> to
> > > the
> > > > > > > > serialzier
> > > > > > > > >> > only
> > > > > > > > >> > >> > that
> > > > > > > > >> > >> > > > controls if decimals should be serialized as
> text
> > > or
> > > > > > > numeric.
> > > > > > > > >> The
> > > > > > > > >> > >> > > benefit
> > > > > > > > >> > >> > > > of this approach is it allows us to enhance the
> > > > > > > deserializer
> > > > > > > > to
> > > > > > > > >> > >> > > > automatically handle numeric decimals even
> > without
> > > > any
> > > > > > > config
> > > > > > > > >> > >> having to
> > > > > > > > >> > >> > > be
> > > > > > > > >> > >> > > > set, i.e. default config in the deserializer
> > would
> > > be
> > > > > > able
> > > > > > > to
> > > > > > > > >> > handle
> > > > > > > > >> > >> > > > numeric decimals.  Of course, this is a two
> edged
> > > > > sword:
> > > > > > > this
> > > > > > > > >> > would
> > > > > > > > >> > >> > make
> > > > > > > > >> > >> > > > the deserializer work out of the box with
> numeric
> > > > > > decimals,
> > > > > > > > >> > (yay!),
> > > > > > > > >> > >> but
> > > > > > > > >> > >> > > > would be a breaking change in that things that
> > > > > previously
> > > > > > > > >> failed
> > > > > > > > >> > >> would
> > > > > > > > >> > >> > > > suddenly start deserializing.  This is a price
> > I'm
> > > > > > willing
> > > > > > > to
> > > > > > > > >> pay.
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > > > IMHO, we should then plan to switch the default
> > of
> > > > > > decimal
> > > > > > > > >> > >> > serialization
> > > > > > > > >> > >> > > to
> > > > > > > > >> > >> > > > numeric, and text serialization to base 10 in
> the
> > > > next
> > > > > > > major
> > > > > > > > >> > >> release.
> > > > > > > > >> > >> > > > (With upgrade notes to match). Though I know
> this
> > > is
> > > > > more
> > > > > > > > >> > >> contentious,
> > > > > > > > >> > >> > I
> > > > > > > > >> > >> > > > think it moves us forward in a much more
> standard
> > > way
> > > > > > that
> > > > > > > > the
> > > > > > > > >> > >> current
> > > > > > > > >> > >> > > > encoding of decimals.
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > > > > > almog@confluent.io>
> > > > > > > > >> > >> wrote:
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > > > > Hi Everyone!
> > > > > > > > >> > >> > > > >
> > > > > > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > > > > > >> > >> > > > >
> > > > > > > > >> > >> > > > >
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > >
> > > > > > > > >> > >> >
> > > > > > > > >> > >>
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > > > > > >> > >> > > > >
> > > > > > > > >> > >> > > > > For those who are interested, I have a
> > prototype
> > > > > > > > >> implementation
> > > > > > > > >> > >> that
> > > > > > > > >> > >> > > > helped
> > > > > > > > >> > >> > > > > guide my design:
> > > > > > https://github.com/agavra/kafka/pull/1
> > > > > > > > >> > >> > > > >
> > > > > > > > >> > >> > > > > Cheers,
> > > > > > > > >> > >> > > > > Almog
> > > > > > > > >> > >> > > > >
> > > > > > > > >> > >> > > >
> > > > > > > > >> > >> > >
> > > > > > > > >> > >> >
> > > > > > > > >> > >>
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Almog Gavra <al...@confluent.io>.
Thanks for the feedback Randall! I have updated the KIP with the following
edits:

* Updated the reference from "producer" to "source" (I had missed that one!)
* Changed the config from "json.decimal.serialization.format" to
"decimal.format"
* Clarified case sensitivity
* Clarified the proposed changes to note that deserialization is not
affected by the config
* Clarified the changes in JsonConverter to handle deserialization (see my
third bullet below)
* Added a clear migration plan and simplified compatibility

Here are also some clarifications based on your comments.

* I think "json" has limited value in the configuration name. If put in a
top-level worker config, it clarifies that it only affects connectors using
the JsonConverter. I have opted for your suggestion and dropped it.
* I think "serialization" has limited value in the configuration name. If
we ever want to introduce "deserialization" configurations, there will be
asymmetry in the configuration names. I have opted for your suggestion and
dropped it.
* The JsonConverter will not "always look for numbers". The converter will
receive from the Jackson Object Mapper either a NumericNode containing a
big decimal or a BinaryNode containing a btye[]. Based on the type of this
node, it will convert the value to a BigDecimal appropriately (or any other
Connect java type based on the schema).
* "the ... JsonDeserializer are not affected" is not exactly true, but
semantically correct. See the note in the KIP about defaulting floating
points to BigDecimal to avoid precision loss.
* "The resulting application, however, may need to handle a wider range of
numeric values." Unless I misunderstand what you're saying, I don't think
this is correct. The resulting application will still receive exactly the
same Connect data object from the JsonConverter as it was before - only the
SerDe layer is affected.

Cheers,
Almog

On Sun, Aug 25, 2019 at 4:28 PM Randall Hauch <rh...@gmail.com> wrote:

> Thanks for all the work, Almog.
>
> For the most part, I think this KIP will be a great improvement, and IMO is
> almost ready to go. However, I do have a few suggestions that affect the
> wording more than the intent.
>
> First, the name of the `json.decimal.serialization.format` property is
> pretty long, especially when it is prefixed in the Worker config or in a
> connector config as `key.converer.json.decimal.serialization.format` or
> `value.converter.json.decimal.serialization.format` . Have you considered a
> shorter config property name, such as maybe `decimal.format`? Is there any
> benefit to include "json" and "serialization" in the property name? Also,
> we should be clear that the value will not be case sensitive (e.g.,
> "numeric" and "NUMERIC" would be equivalent), to keep in alignment with
> other enumeration literals in Connect configurations. The goal should be
> simple
>
> Second, the Motivation section, has the following sentence:
>
> "A new configuration for producers json.decimal.serialization.format will
> be introduced to the JsonConverter configuration to help control whether
> source converters will serialize decimals in numeric or binary formats."
>
>
> I agree with an earlier comment from Konstantine that "producers" here is
> distracting and does not mirror the normal definition of "producers" within
> the Kafka context. I suggest rephrasing this to something like
>
> "Introduce to the JsonConverter a new configuration property named
> json.decimal.serialization.format to control whether source converters will
> serialize decimals in numeric or binary formats."
>
>
> Third, the KIP should be more clear about whether the
> `json.decimal.serialization.format` setting does or does not affect
> deserialization? IIUC, the deserialization logic will always look for JSON
> numbers, and will always use the Schema to define whether it should convert
> the value to a different number type. Is that a fair statement?
>
> Fourth, the JsonSerializer and JsonDeserializer are not affected, yet are
> still compatible with the old and new behavior. Because the primary purpose
> of this new setting is to define how Connect DECIMAL logical type values
> are serialized in JSON documents, the JsonDeserializer will still be able
> to deserialize the JSON document correctly. The resulting application,
> however, may need to handle a wider range of numeric values.
>
> Fifth, the Compatibility section seems more complicated than perhaps it
> needs to be, maybe because it seems to distinguish between upgrading and
> setting the decimal serialization format. Maybe it would be sufficient to
> simply emphasize that all of the sink connectors (or consumer applications)
> using the JsonConverter with
> the `json.decimal.serialization.format=NUMERIC` setting consuming records
> from a set of topics be upgraded and changed *before* any of the source
> connectors (or other producer applications) using the JsonConverter to
> serialize records are changed to use
> the `json.decimal.serialization.format=NUMERIC` setting? It may also
> warrant giving more concrete advice on upgrade procedures. For example, how
> does a user upgrade a set of Connect workers to use this new property? Do
> they upgrade first and restart to ensure everything runs as-is, and then
> upgrade their source connectors to set
> `json.decimal.serialization.format=NUMERIC`via connector configurations or
> worker configs?
>
> Anyway, great job so far!
>
> Best regards,
>
> Randall
>
> On Thu, Aug 15, 2019 at 12:00 PM Konstantine Karantasis <
> konstantine@confluent.io> wrote:
>
> > Thanks! KIP reads even better for me now.
> > Just voted. +1 non-binding
> >
> > Konstantine
> >
> > On Wed, Aug 14, 2019 at 7:00 PM Almog Gavra <al...@confluent.io> wrote:
> >
> > > Thanks for the review Konstantine!
> > >
> > > I think the terminology suggestion definitely makes things clearer - I
> > will
> > > update the documentation based on your suggestion (e.g. s/Consumer/Sink
> > > Converter/g and s/Producer/Source Converter/g).
> > >
> > > Cheers,
> > > Almog
> > >
> > > On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
> > > konstantine@confluent.io> wrote:
> > >
> > > > Thanks Almog for preparing this KIP!
> > > > I think it will improve usability and troubleshooting with JSON data
> a
> > > lot.
> > > >
> > > > The finalized plan seems quite concrete now. I also liked that some
> > > > implementation specific implications (such as setting the
> ObjectMapper
> > to
> > > > deserialize floating point as BigDecimal) are highlighted in the KIP.
> > > >
> > > > Still, as I was reading the KIP, the main obstacle I encountered was
> > > around
> > > > terminology. I couldn't get used to reading "producer" and "consumer"
> > and
> > > > not thinking in terms of Kafka producers and consumers - which are
> not
> > > > relevant to what this KIP proposes. Thus, I'd suggest replacing
> > > > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink
> > > > Converter(s)" (even if "Converter used by Source Connector" or
> > "Converter
> > > > used by Sink Connector" would be even more accurate - maybe this
> could
> > be
> > > > an explanation in a footnote). Terminology around converters has been
> > > > tricky in the past and adding producers/consumers in the mix might
> add
> > to
> > > > the confusion.
> > > >
> > > > Another example where I'd apply this different terminology would be
> to
> > a
> > > > phrase such as the following:
> > > > "Because of this, users must take care to first ensure that all
> > consumers
> > > > have upgraded to the new code before upgrading producers to make use
> of
> > > the
> > > > NUMERIC serialization format."
> > > > which I'd write
> > > > "Because of this, users must take care to first ensure that all sink
> > > > connectors have upgraded to the new converter code before upgrading
> > > source
> > > > connectors to make use of the NUMERIC serialization format in
> > > > JsonConverter."
> > > >
> > > > Let me know if you think this suggestion makes the KIP easier to
> > follow.
> > > > Otherwise I think it's a solid proposal.
> > > >
> > > > I'm concluding with a couple of nits:
> > > >
> > > > - "Upgraded Producer with BASE64 serialization, Legacy Consumer: this
> > > > scenario is okay as the upgraded ~producer~ consumer will be able to
> > read
> > > > binary as today" (again according to my suggestion above, it could be
> > as
> > > > the upgraded source converter ...)
> > > >
> > > > - "consumers cannot consumer NUMERIC data. " -> "consumers cannot
> read
> > > > NUMERIC data"
> > > >
> > > > Best,
> > > > Konstantine
> > > >
> > > > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io>
> wrote:
> > > >
> > > > > Good catches! Fixed :)
> > > > >
> > > > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <
> arjun.satish@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Cool!
> > > > > >
> > > > > > Couple of nits:
> > > > > >
> > > > > > - In public interfaces, typo: *json.decimal.serialization.fromat*
> > > > > > - In public interfaces, you use the term "HEX" instead of
> "BASE64".
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > > > >
> > > > > > > EDIT: everywhere I've been using "HEX" I meant to be using
> > > "BASE64".
> > > > I
> > > > > > will
> > > > > > > update the KIP to reflect this.
> > > > > > >
> > > > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <almog@confluent.io
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback Arjun! I'm happy changing the default
> > > > config
> > > > > to
> > > > > > > > HEX instead of BINARY, no strong feelings there.
> > > > > > > >
> > > > > > > > I'll also clarify the example in the KIP to be clearer:
> > > > > > > >
> > > > > > > > - serialize the decimal field "foo" with value "10.2345" with
> > the
> > > > HEX
> > > > > > > > setting: {"foo": "D3J5"}
> > > > > > > > - serialize the decimal field "foo" with value "10.2345" with
> > the
> > > > > > NUMERIC
> > > > > > > > setting: {"foo": 10.2345}
> > > > > > > >
> > > > > > > > With regards to the precision issue, that was my original
> > concern
> > > > as
> > > > > > well
> > > > > > > > (and why I originally suggested a TEXT format). Many JSON
> > > > > deserializers
> > > > > > > > (e.g. Jackson with
> > > > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > > > > however, have the ability to deserialize decimals correctly
> so
> > I
> > > > will
> > > > > > > > configure that as the default for Connect's JsonDeserializer.
> > > It's
> > > > > > > probably
> > > > > > > > a good idea to call out that using other deserializers must
> be
> > > done
> > > > > > with
> > > > > > > > care - I will add that documentation to the serialization
> > config.
> > > > > > > >
> > > > > > > > Note that there would not be an issue on the _serialization_
> > side
> > > > of
> > > > > > > > things as Jackson respects BigDecimal.
> > > > > > > >
> > > > > > > > Almog
> > > > > > > >
> > > > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> > > > arjun.satish@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not late
> > > since
> > > > > you
> > > > > > > >> started the voting thread already):
> > > > > > > >>
> > > > > > > >> can you please add some examples to show the changes that
> you
> > > are
> > > > > > > >> proposing. makes me think that for a given decimal number,
> we
> > > will
> > > > > > have
> > > > > > > >> two
> > > > > > > >> encodings: “asHex” and “asNumber”.
> > > > > > > >>
> > > > > > > >> should we call the default config value “HEX” instead of
> > > “BINARY”?
> > > > > > > >>
> > > > > > > >> Should we call out the fact that JS systems might be
> > susceptible
> > > > to
> > > > > > > double
> > > > > > > >> precision round offs with the new numeric format? here are
> > some
> > > > > people
> > > > > > > >> discussing a similar problem
> > > > > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > > > > >>
> > > > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <
> > almog@confluent.io>
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > Hello Everyone,
> > > > > > > >> >
> > > > > > > >> > Summarizing an in-person discussion with Randall (this is
> > > copied
> > > > > > from
> > > > > > > >> the
> > > > > > > >> > KIP):
> > > > > > > >> >
> > > > > > > >> > The original KIP suggested supporting an additional
> > > > > representation -
> > > > > > > >> base10
> > > > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes
> > issues
> > > > > > because
> > > > > > > >> it
> > > > > > > >> > is impossible to disambiguate between TEXT and BINARY
> > without
> > > an
> > > > > > > >> additional
> > > > > > > >> > config - furthermore, this makes the migration from one to
> > the
> > > > > other
> > > > > > > >> nearly
> > > > > > > >> > impossible because it would require that all consumers
> stop
> > > > > > consuming
> > > > > > > >> and
> > > > > > > >> > producers stop producing and atomically updating the
> config
> > on
> > > > all
> > > > > > of
> > > > > > > >> them
> > > > > > > >> > after deploying the new code, or waiting for the full
> > > retention
> > > > > > period
> > > > > > > >> to
> > > > > > > >> > pass - neither option is viable. The suggestion in the KIP
> > is
> > > > > > strictly
> > > > > > > >> an
> > > > > > > >> > improvement over the existing behavior, even if it doesn't
> > > > support
> > > > > > all
> > > > > > > >> > combinations.
> > > > > > > >> >
> > > > > > > >> > It seems that since most real-world use cases actually use
> > the
> > > > > > numeric
> > > > > > > >> > representation (not string) we can consider this an
> > > improvement.
> > > > > > With
> > > > > > > >> the
> > > > > > > >> > new suggestion, we don't need a deserialization
> > configuration
> > > > > (only
> > > > > > a
> > > > > > > >> > serialization option) and the consumers will be able to
> > always
> > > > > > > >> > automatically determine the serialization format.
> > > > > > > >> >
> > > > > > > >> > Based on this, I'll be opening up the simplified version
> of
> > > the
> > > > > KIP
> > > > > > > to a
> > > > > > > >> > vote.
> > > > > > > >> >
> > > > > > > >> > Almog
> > > > > > > >> >
> > > > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <
> > > almog@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > I'm mostly happy with your current suggestion (two
> > configs,
> > > > one
> > > > > > for
> > > > > > > >> > > serialization and one for deserialization) and your
> > > > > implementation
> > > > > > > >> > > suggestion. One thing to note:
> > > > > > > >> > >
> > > > > > > >> > > > We should _always_ be able to deserialize a standard
> > JSON
> > > > > > > >> > > > number as a decimal
> > > > > > > >> > >
> > > > > > > >> > > I was doing some research into decimals and JSON, and I
> > can
> > > > > > imagine
> > > > > > > a
> > > > > > > >> > > compelling reason to require string representations to
> > avoid
> > > > > > losing
> > > > > > > >> > > precision and to be certain that whomever is sending the
> > > data
> > > > > > isn't
> > > > > > > >> > losing
> > > > > > > >> > > precision (e.g.
> > > https://stackoverflow.com/a/38357877/2258040
> > > > ).
> > > > > > > >> > >
> > > > > > > >> > > I'm okay with always allowing numerics, but thought it's
> > > worth
> > > > > > > raising
> > > > > > > >> > the
> > > > > > > >> > > thought.
> > > > > > > >> > >
> > > > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> > > > andy@confluent.io>
> > > > > > > >> wrote:
> > > > > > > >> > >
> > > > > > > >> > >> The way I see it, we need to control two seperate
> things:
> > > > > > > >> > >>
> > > > > > > >> > >> 1. How do we _deserialize_ a decimal type if we
> > encounter a
> > > > > text
> > > > > > > >> node in
> > > > > > > >> > >> the JSON?    (We should _always_ be able to
> deserialize a
> > > > > > standard
> > > > > > > >> JSON
> > > > > > > >> > >> number as a decimal).
> > > > > > > >> > >> 2. How do we chose how we want decimals to be
> > _serialized_.
> > > > > > > >> > >>
> > > > > > > >> > >> This looks to fits well with your second suggestion of
> > > > slightly
> > > > > > > >> > different
> > > > > > > >> > >> configs names for serialization vs deserialization.
> > > > > > > >> > >> a, For deserialization we only care about how to handle
> > > text
> > > > > > > nodes: `
> > > > > > > >> > >> deserialization.decimal.*text*.format`, which should
> only
> > > > have
> > > > > > two
> > > > > > > >> valid
> > > > > > > >> > >> values BINARY | TEXT.
> > > > > > > >> > >> b. For serialization we need all three:
> > > > > > > >> `serialization.decimal.format`,
> > > > > > > >> > >> which should support all three options: BINARY | TEXT |
> > > > > NUMERIC.
> > > > > > > >> > >>
> > > > > > > >> > >> Implementation wise, I think these should be two
> separate
> > > > > enums,
> > > > > > > >> rather
> > > > > > > >> > >> than one shared enum and throwing an error if the
> > > > deserializer
> > > > > is
> > > > > > > >> set to
> > > > > > > >> > >> NUMERIC.  Mainly as this means the enums reflect the
> > > options
> > > > > > > >> available,
> > > > > > > >> > >> rather than this being hidden in config checking code.
> > But
> > > > > > that's
> > > > > > > a
> > > > > > > >> > minor
> > > > > > > >> > >> implementation detail.
> > > > > > > >> > >>
> > > > > > > >> > >> Personally, I'd be tempted to have the BINARY value
> named
> > > > > > something
> > > > > > > >> like
> > > > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging
> users
> > > to
> > > > > move
> > > > > > > >> away
> > > > > > > >> > >> from
> > > > > > > >> > >> it.
> > > > > > > >> > >>
> > > > > > > >> > >> It's a real shame that both of these settings require a
> > > > default
> > > > > > of
> > > > > > > >> > BINARY
> > > > > > > >> > >> for backwards compatibility, but I agree that
> > discussions /
> > > > > plans
> > > > > > > >> around
> > > > > > > >> > >> switching the defaults should not block this KIP.
> > > > > > > >> > >>
> > > > > > > >> > >> Andy
> > > > > > > >> > >>
> > > > > > > >> > >>
> > > > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> > > > almog@confluent.io>
> > > > > > > >> wrote:
> > > > > > > >> > >>
> > > > > > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > > > > > >> > >> >
> > > > > > > >> > >> > > Is the text decimal a base16 encoded number, or is
> it
> > > > > base16
> > > > > > > >> encoded
> > > > > > > >> > >> > binary
> > > > > > > >> > >> > > form of the number?
> > > > > > > >> > >> >
> > > > > > > >> > >> > The conversion happens as
> > > > > decimal.unscaledValue().toByteArray()
> > > > > > > and
> > > > > > > >> > then
> > > > > > > >> > >> > the byte array is converted to a hex string, so it's
> > > > > definitely
> > > > > > > the
> > > > > > > >> > >> binary
> > > > > > > >> > >> > form of the number converted to base16. Whether or
> not
> > > > that's
> > > > > > the
> > > > > > > >> same
> > > > > > > >> > >> as
> > > > > > > >> > >> > the base16 encoded number is a good question
> > (toByteArray
> > > > > > > returns a
> > > > > > > >> > byte
> > > > > > > >> > >> > array containing a signed, big-endian, two's
> complement
> > > > > > > >> representation
> > > > > > > >> > >> of
> > > > > > > >> > >> > the big integer).
> > > > > > > >> > >> >
> > > > > > > >> > >> > > One suggestion I have is to change the proposed new
> > > > config
> > > > > to
> > > > > > > >> only
> > > > > > > >> > >> affect
> > > > > > > >> > >> > > decimals stored as text, i.e. to switch between the
> > > > current
> > > > > > > >> base16
> > > > > > > >> > and
> > > > > > > >> > >> > the
> > > > > > > >> > >> > > more common base10.   Then add another config to
> the
> > > > > > serializer
> > > > > > > >> only
> > > > > > > >> > >> that
> > > > > > > >> > >> > > controls if decimals should be serialized as text
> or
> > > > > numeric.
> > > > > > > >> > >> >
> > > > > > > >> > >> > I think we need to be able to handle all mappings
> from
> > > > > > > >> serialization
> > > > > > > >> > >> format
> > > > > > > >> > >> > to deserialization format (e.g. read in BINARY and
> > output
> > > > > > TEXT),
> > > > > > > >> > which I
> > > > > > > >> > >> > think would be impossible with the alternative
> > > suggestion.
> > > > I
> > > > > > > agree
> > > > > > > >> > that
> > > > > > > >> > >> > automatically deserializing numerics is valuable. I
> see
> > > two
> > > > > > other
> > > > > > > >> ways
> > > > > > > >> > >> to
> > > > > > > >> > >> > get this, both keeping the serialization.format
> config
> > > the
> > > > > > same:
> > > > > > > >> > >> >
> > > > > > > >> > >> > - have json.decimal.deserialization.format accept all
> > > three
> > > > > > > >> formats.
> > > > > > > >> > if
> > > > > > > >> > >> set
> > > > > > > >> > >> > to BINARY/TEXT, numerics would be automatically
> > > supported.
> > > > If
> > > > > > set
> > > > > > > >> to
> > > > > > > >> > >> > NUMERIC, then any string coming in would result in
> > > > > > > deserialization
> > > > > > > >> > error
> > > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > > >> > >> > - change json.decimal.deserialization.format to
> > > > > > > >> > >> > json.decimal.deserialization.string.format which
> > accepts
> > > > only
> > > > > > > >> > >> BINARY/TEXT
> > > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > > >> > >> >
> > > > > > > >> > >> > > would be a breaking change in that things that
> > > previously
> > > > > > > failed
> > > > > > > >> > would
> > > > > > > >> > >> > > suddenly start deserializing.  This is a price I'm
> > > > willing
> > > > > to
> > > > > > > >> pay.
> > > > > > > >> > >> >
> > > > > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > > > > >> > >> >
> > > > > > > >> > >> > > IMHO, we should then plan to switch the default of
> > > > decimal
> > > > > > > >> > >> serialization
> > > > > > > >> > >> > to
> > > > > > > >> > >> > > numeric, and text serialization to base 10 in the
> > next
> > > > > major
> > > > > > > >> > release.
> > > > > > > >> > >> >
> > > > > > > >> > >> > I think that can be a separate discussion, I don't
> want
> > > to
> > > > > > block
> > > > > > > >> this
> > > > > > > >> > >> KIP
> > > > > > > >> > >> > on it.
> > > > > > > >> > >> >
> > > > > > > >> > >> > Thoughts?
> > > > > > > >> > >> >
> > > > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > > > > otto@wikimedia.org>
> > > > > > > >> > wrote:
> > > > > > > >> > >> >
> > > > > > > >> > >> > > This is a bit orthogonal, but in
> JsonSchemaConverter
> > I
> > > > use
> > > > > > > >> > >> JSONSchemas to
> > > > > > > >> > >> > > indicate whether a JSON number should be
> deserialized
> > > as
> > > > an
> > > > > > > >> integer
> > > > > > > >> > >> or a
> > > > > > > >> > >> > > decimal
> > > > > > > >> > >> > > <
> > > > > > > >> > >> > >
> > > > > > > >> > >> >
> > > > > > > >> > >>
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > > > > >> > >> > > >.
> > > > > > > >> > >> > > Not everyone is going to have JSONSchemas available
> > > when
> > > > > > > >> converting,
> > > > > > > >> > >> but
> > > > > > > >> > >> > if
> > > > > > > >> > >> > > you do, it is an easy way to support JSON numbers
> as
> > > > > > decimals.
> > > > > > > >> > >> > >
> > > > > > > >> > >> > > Carry on! :)
> > > > > > > >> > >> > >
> > > > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > > > > andy@confluent.io
> > > > > > > >
> > > > > > > >> > >> wrote:
> > > > > > > >> > >> > >
> > > > > > > >> > >> > > > Hi Almog,
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > > > Like the KIP - I think being able to support
> > decimals
> > > > in
> > > > > > JSON
> > > > > > > >> in
> > > > > > > >> > the
> > > > > > > >> > >> > same
> > > > > > > >> > >> > > > way most other systems do is a great improvement.
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > > > It's not 100% clear to me from the KIP what the
> > > current
> > > > > > > format
> > > > > > > >> is.
> > > > > > > >> > >> Is
> > > > > > > >> > >> > > the
> > > > > > > >> > >> > > > text decimal a base16 encoded number, or is it
> > base16
> > > > > > encoded
> > > > > > > >> > binary
> > > > > > > >> > >> > form
> > > > > > > >> > >> > > > of the number? (I've not tried to get my head
> > around
> > > if
> > > > > > these
> > > > > > > >> two
> > > > > > > >> > >> are
> > > > > > > >> > >> > > even
> > > > > > > >> > >> > > > different!)
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > > > One suggestion I have is to change the proposed
> new
> > > > > config
> > > > > > to
> > > > > > > >> only
> > > > > > > >> > >> > affect
> > > > > > > >> > >> > > > decimals stored as text, i.e. to switch between
> the
> > > > > current
> > > > > > > >> base16
> > > > > > > >> > >> and
> > > > > > > >> > >> > > the
> > > > > > > >> > >> > > > more common base10.   Then add another config to
> > the
> > > > > > > serialzier
> > > > > > > >> > only
> > > > > > > >> > >> > that
> > > > > > > >> > >> > > > controls if decimals should be serialized as text
> > or
> > > > > > numeric.
> > > > > > > >> The
> > > > > > > >> > >> > > benefit
> > > > > > > >> > >> > > > of this approach is it allows us to enhance the
> > > > > > deserializer
> > > > > > > to
> > > > > > > >> > >> > > > automatically handle numeric decimals even
> without
> > > any
> > > > > > config
> > > > > > > >> > >> having to
> > > > > > > >> > >> > > be
> > > > > > > >> > >> > > > set, i.e. default config in the deserializer
> would
> > be
> > > > > able
> > > > > > to
> > > > > > > >> > handle
> > > > > > > >> > >> > > > numeric decimals.  Of course, this is a two edged
> > > > sword:
> > > > > > this
> > > > > > > >> > would
> > > > > > > >> > >> > make
> > > > > > > >> > >> > > > the deserializer work out of the box with numeric
> > > > > decimals,
> > > > > > > >> > (yay!),
> > > > > > > >> > >> but
> > > > > > > >> > >> > > > would be a breaking change in that things that
> > > > previously
> > > > > > > >> failed
> > > > > > > >> > >> would
> > > > > > > >> > >> > > > suddenly start deserializing.  This is a price
> I'm
> > > > > willing
> > > > > > to
> > > > > > > >> pay.
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > > > IMHO, we should then plan to switch the default
> of
> > > > > decimal
> > > > > > > >> > >> > serialization
> > > > > > > >> > >> > > to
> > > > > > > >> > >> > > > numeric, and text serialization to base 10 in the
> > > next
> > > > > > major
> > > > > > > >> > >> release.
> > > > > > > >> > >> > > > (With upgrade notes to match). Though I know this
> > is
> > > > more
> > > > > > > >> > >> contentious,
> > > > > > > >> > >> > I
> > > > > > > >> > >> > > > think it moves us forward in a much more standard
> > way
> > > > > that
> > > > > > > the
> > > > > > > >> > >> current
> > > > > > > >> > >> > > > encoding of decimals.
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > > > > almog@confluent.io>
> > > > > > > >> > >> wrote:
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > > > > Hi Everyone!
> > > > > > > >> > >> > > > >
> > > > > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > > > > >> > >> > > > >
> > > > > > > >> > >> > > > >
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > >
> > > > > > > >> > >> >
> > > > > > > >> > >>
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > > > > >> > >> > > > >
> > > > > > > >> > >> > > > > For those who are interested, I have a
> prototype
> > > > > > > >> implementation
> > > > > > > >> > >> that
> > > > > > > >> > >> > > > helped
> > > > > > > >> > >> > > > > guide my design:
> > > > > https://github.com/agavra/kafka/pull/1
> > > > > > > >> > >> > > > >
> > > > > > > >> > >> > > > > Cheers,
> > > > > > > >> > >> > > > > Almog
> > > > > > > >> > >> > > > >
> > > > > > > >> > >> > > >
> > > > > > > >> > >> > >
> > > > > > > >> > >> >
> > > > > > > >> > >>
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Randall Hauch <rh...@gmail.com>.
Thanks for all the work, Almog.

For the most part, I think this KIP will be a great improvement, and IMO is
almost ready to go. However, I do have a few suggestions that affect the
wording more than the intent.

First, the name of the `json.decimal.serialization.format` property is
pretty long, especially when it is prefixed in the Worker config or in a
connector config as `key.converer.json.decimal.serialization.format` or
`value.converter.json.decimal.serialization.format` . Have you considered a
shorter config property name, such as maybe `decimal.format`? Is there any
benefit to include "json" and "serialization" in the property name? Also,
we should be clear that the value will not be case sensitive (e.g.,
"numeric" and "NUMERIC" would be equivalent), to keep in alignment with
other enumeration literals in Connect configurations. The goal should be
simple

Second, the Motivation section, has the following sentence:

"A new configuration for producers json.decimal.serialization.format will
be introduced to the JsonConverter configuration to help control whether
source converters will serialize decimals in numeric or binary formats."


I agree with an earlier comment from Konstantine that "producers" here is
distracting and does not mirror the normal definition of "producers" within
the Kafka context. I suggest rephrasing this to something like

"Introduce to the JsonConverter a new configuration property named
json.decimal.serialization.format to control whether source converters will
serialize decimals in numeric or binary formats."


Third, the KIP should be more clear about whether the
`json.decimal.serialization.format` setting does or does not affect
deserialization? IIUC, the deserialization logic will always look for JSON
numbers, and will always use the Schema to define whether it should convert
the value to a different number type. Is that a fair statement?

Fourth, the JsonSerializer and JsonDeserializer are not affected, yet are
still compatible with the old and new behavior. Because the primary purpose
of this new setting is to define how Connect DECIMAL logical type values
are serialized in JSON documents, the JsonDeserializer will still be able
to deserialize the JSON document correctly. The resulting application,
however, may need to handle a wider range of numeric values.

Fifth, the Compatibility section seems more complicated than perhaps it
needs to be, maybe because it seems to distinguish between upgrading and
setting the decimal serialization format. Maybe it would be sufficient to
simply emphasize that all of the sink connectors (or consumer applications)
using the JsonConverter with
the `json.decimal.serialization.format=NUMERIC` setting consuming records
from a set of topics be upgraded and changed *before* any of the source
connectors (or other producer applications) using the JsonConverter to
serialize records are changed to use
the `json.decimal.serialization.format=NUMERIC` setting? It may also
warrant giving more concrete advice on upgrade procedures. For example, how
does a user upgrade a set of Connect workers to use this new property? Do
they upgrade first and restart to ensure everything runs as-is, and then
upgrade their source connectors to set
`json.decimal.serialization.format=NUMERIC`via connector configurations or
worker configs?

Anyway, great job so far!

Best regards,

Randall

On Thu, Aug 15, 2019 at 12:00 PM Konstantine Karantasis <
konstantine@confluent.io> wrote:

> Thanks! KIP reads even better for me now.
> Just voted. +1 non-binding
>
> Konstantine
>
> On Wed, Aug 14, 2019 at 7:00 PM Almog Gavra <al...@confluent.io> wrote:
>
> > Thanks for the review Konstantine!
> >
> > I think the terminology suggestion definitely makes things clearer - I
> will
> > update the documentation based on your suggestion (e.g. s/Consumer/Sink
> > Converter/g and s/Producer/Source Converter/g).
> >
> > Cheers,
> > Almog
> >
> > On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
> > konstantine@confluent.io> wrote:
> >
> > > Thanks Almog for preparing this KIP!
> > > I think it will improve usability and troubleshooting with JSON data a
> > lot.
> > >
> > > The finalized plan seems quite concrete now. I also liked that some
> > > implementation specific implications (such as setting the ObjectMapper
> to
> > > deserialize floating point as BigDecimal) are highlighted in the KIP.
> > >
> > > Still, as I was reading the KIP, the main obstacle I encountered was
> > around
> > > terminology. I couldn't get used to reading "producer" and "consumer"
> and
> > > not thinking in terms of Kafka producers and consumers - which are not
> > > relevant to what this KIP proposes. Thus, I'd suggest replacing
> > > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink
> > > Converter(s)" (even if "Converter used by Source Connector" or
> "Converter
> > > used by Sink Connector" would be even more accurate - maybe this could
> be
> > > an explanation in a footnote). Terminology around converters has been
> > > tricky in the past and adding producers/consumers in the mix might add
> to
> > > the confusion.
> > >
> > > Another example where I'd apply this different terminology would be to
> a
> > > phrase such as the following:
> > > "Because of this, users must take care to first ensure that all
> consumers
> > > have upgraded to the new code before upgrading producers to make use of
> > the
> > > NUMERIC serialization format."
> > > which I'd write
> > > "Because of this, users must take care to first ensure that all sink
> > > connectors have upgraded to the new converter code before upgrading
> > source
> > > connectors to make use of the NUMERIC serialization format in
> > > JsonConverter."
> > >
> > > Let me know if you think this suggestion makes the KIP easier to
> follow.
> > > Otherwise I think it's a solid proposal.
> > >
> > > I'm concluding with a couple of nits:
> > >
> > > - "Upgraded Producer with BASE64 serialization, Legacy Consumer: this
> > > scenario is okay as the upgraded ~producer~ consumer will be able to
> read
> > > binary as today" (again according to my suggestion above, it could be
> as
> > > the upgraded source converter ...)
> > >
> > > - "consumers cannot consumer NUMERIC data. " -> "consumers cannot read
> > > NUMERIC data"
> > >
> > > Best,
> > > Konstantine
> > >
> > > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io> wrote:
> > >
> > > > Good catches! Fixed :)
> > > >
> > > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <arjun.satish@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Cool!
> > > > >
> > > > > Couple of nits:
> > > > >
> > > > > - In public interfaces, typo: *json.decimal.serialization.fromat*
> > > > > - In public interfaces, you use the term "HEX" instead of "BASE64".
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io>
> > wrote:
> > > > >
> > > > > > EDIT: everywhere I've been using "HEX" I meant to be using
> > "BASE64".
> > > I
> > > > > will
> > > > > > update the KIP to reflect this.
> > > > > >
> > > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > > > >
> > > > > > > Thanks for the feedback Arjun! I'm happy changing the default
> > > config
> > > > to
> > > > > > > HEX instead of BINARY, no strong feelings there.
> > > > > > >
> > > > > > > I'll also clarify the example in the KIP to be clearer:
> > > > > > >
> > > > > > > - serialize the decimal field "foo" with value "10.2345" with
> the
> > > HEX
> > > > > > > setting: {"foo": "D3J5"}
> > > > > > > - serialize the decimal field "foo" with value "10.2345" with
> the
> > > > > NUMERIC
> > > > > > > setting: {"foo": 10.2345}
> > > > > > >
> > > > > > > With regards to the precision issue, that was my original
> concern
> > > as
> > > > > well
> > > > > > > (and why I originally suggested a TEXT format). Many JSON
> > > > deserializers
> > > > > > > (e.g. Jackson with
> > > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > > > however, have the ability to deserialize decimals correctly so
> I
> > > will
> > > > > > > configure that as the default for Connect's JsonDeserializer.
> > It's
> > > > > > probably
> > > > > > > a good idea to call out that using other deserializers must be
> > done
> > > > > with
> > > > > > > care - I will add that documentation to the serialization
> config.
> > > > > > >
> > > > > > > Note that there would not be an issue on the _serialization_
> side
> > > of
> > > > > > > things as Jackson respects BigDecimal.
> > > > > > >
> > > > > > > Almog
> > > > > > >
> > > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> > > arjun.satish@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not late
> > since
> > > > you
> > > > > > >> started the voting thread already):
> > > > > > >>
> > > > > > >> can you please add some examples to show the changes that you
> > are
> > > > > > >> proposing. makes me think that for a given decimal number, we
> > will
> > > > > have
> > > > > > >> two
> > > > > > >> encodings: “asHex” and “asNumber”.
> > > > > > >>
> > > > > > >> should we call the default config value “HEX” instead of
> > “BINARY”?
> > > > > > >>
> > > > > > >> Should we call out the fact that JS systems might be
> susceptible
> > > to
> > > > > > double
> > > > > > >> precision round offs with the new numeric format? here are
> some
> > > > people
> > > > > > >> discussing a similar problem
> > > > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > > > >>
> > > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <
> almog@confluent.io>
> > > > > wrote:
> > > > > > >>
> > > > > > >> > Hello Everyone,
> > > > > > >> >
> > > > > > >> > Summarizing an in-person discussion with Randall (this is
> > copied
> > > > > from
> > > > > > >> the
> > > > > > >> > KIP):
> > > > > > >> >
> > > > > > >> > The original KIP suggested supporting an additional
> > > > representation -
> > > > > > >> base10
> > > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes
> issues
> > > > > because
> > > > > > >> it
> > > > > > >> > is impossible to disambiguate between TEXT and BINARY
> without
> > an
> > > > > > >> additional
> > > > > > >> > config - furthermore, this makes the migration from one to
> the
> > > > other
> > > > > > >> nearly
> > > > > > >> > impossible because it would require that all consumers stop
> > > > > consuming
> > > > > > >> and
> > > > > > >> > producers stop producing and atomically updating the config
> on
> > > all
> > > > > of
> > > > > > >> them
> > > > > > >> > after deploying the new code, or waiting for the full
> > retention
> > > > > period
> > > > > > >> to
> > > > > > >> > pass - neither option is viable. The suggestion in the KIP
> is
> > > > > strictly
> > > > > > >> an
> > > > > > >> > improvement over the existing behavior, even if it doesn't
> > > support
> > > > > all
> > > > > > >> > combinations.
> > > > > > >> >
> > > > > > >> > It seems that since most real-world use cases actually use
> the
> > > > > numeric
> > > > > > >> > representation (not string) we can consider this an
> > improvement.
> > > > > With
> > > > > > >> the
> > > > > > >> > new suggestion, we don't need a deserialization
> configuration
> > > > (only
> > > > > a
> > > > > > >> > serialization option) and the consumers will be able to
> always
> > > > > > >> > automatically determine the serialization format.
> > > > > > >> >
> > > > > > >> > Based on this, I'll be opening up the simplified version of
> > the
> > > > KIP
> > > > > > to a
> > > > > > >> > vote.
> > > > > > >> >
> > > > > > >> > Almog
> > > > > > >> >
> > > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <
> > almog@confluent.io
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > I'm mostly happy with your current suggestion (two
> configs,
> > > one
> > > > > for
> > > > > > >> > > serialization and one for deserialization) and your
> > > > implementation
> > > > > > >> > > suggestion. One thing to note:
> > > > > > >> > >
> > > > > > >> > > > We should _always_ be able to deserialize a standard
> JSON
> > > > > > >> > > > number as a decimal
> > > > > > >> > >
> > > > > > >> > > I was doing some research into decimals and JSON, and I
> can
> > > > > imagine
> > > > > > a
> > > > > > >> > > compelling reason to require string representations to
> avoid
> > > > > losing
> > > > > > >> > > precision and to be certain that whomever is sending the
> > data
> > > > > isn't
> > > > > > >> > losing
> > > > > > >> > > precision (e.g.
> > https://stackoverflow.com/a/38357877/2258040
> > > ).
> > > > > > >> > >
> > > > > > >> > > I'm okay with always allowing numerics, but thought it's
> > worth
> > > > > > raising
> > > > > > >> > the
> > > > > > >> > > thought.
> > > > > > >> > >
> > > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> > > andy@confluent.io>
> > > > > > >> wrote:
> > > > > > >> > >
> > > > > > >> > >> The way I see it, we need to control two seperate things:
> > > > > > >> > >>
> > > > > > >> > >> 1. How do we _deserialize_ a decimal type if we
> encounter a
> > > > text
> > > > > > >> node in
> > > > > > >> > >> the JSON?    (We should _always_ be able to deserialize a
> > > > > standard
> > > > > > >> JSON
> > > > > > >> > >> number as a decimal).
> > > > > > >> > >> 2. How do we chose how we want decimals to be
> _serialized_.
> > > > > > >> > >>
> > > > > > >> > >> This looks to fits well with your second suggestion of
> > > slightly
> > > > > > >> > different
> > > > > > >> > >> configs names for serialization vs deserialization.
> > > > > > >> > >> a, For deserialization we only care about how to handle
> > text
> > > > > > nodes: `
> > > > > > >> > >> deserialization.decimal.*text*.format`, which should only
> > > have
> > > > > two
> > > > > > >> valid
> > > > > > >> > >> values BINARY | TEXT.
> > > > > > >> > >> b. For serialization we need all three:
> > > > > > >> `serialization.decimal.format`,
> > > > > > >> > >> which should support all three options: BINARY | TEXT |
> > > > NUMERIC.
> > > > > > >> > >>
> > > > > > >> > >> Implementation wise, I think these should be two separate
> > > > enums,
> > > > > > >> rather
> > > > > > >> > >> than one shared enum and throwing an error if the
> > > deserializer
> > > > is
> > > > > > >> set to
> > > > > > >> > >> NUMERIC.  Mainly as this means the enums reflect the
> > options
> > > > > > >> available,
> > > > > > >> > >> rather than this being hidden in config checking code.
> But
> > > > > that's
> > > > > > a
> > > > > > >> > minor
> > > > > > >> > >> implementation detail.
> > > > > > >> > >>
> > > > > > >> > >> Personally, I'd be tempted to have the BINARY value named
> > > > > something
> > > > > > >> like
> > > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users
> > to
> > > > move
> > > > > > >> away
> > > > > > >> > >> from
> > > > > > >> > >> it.
> > > > > > >> > >>
> > > > > > >> > >> It's a real shame that both of these settings require a
> > > default
> > > > > of
> > > > > > >> > BINARY
> > > > > > >> > >> for backwards compatibility, but I agree that
> discussions /
> > > > plans
> > > > > > >> around
> > > > > > >> > >> switching the defaults should not block this KIP.
> > > > > > >> > >>
> > > > > > >> > >> Andy
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> > > almog@confluent.io>
> > > > > > >> wrote:
> > > > > > >> > >>
> > > > > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > > > > >> > >> >
> > > > > > >> > >> > > Is the text decimal a base16 encoded number, or is it
> > > > base16
> > > > > > >> encoded
> > > > > > >> > >> > binary
> > > > > > >> > >> > > form of the number?
> > > > > > >> > >> >
> > > > > > >> > >> > The conversion happens as
> > > > decimal.unscaledValue().toByteArray()
> > > > > > and
> > > > > > >> > then
> > > > > > >> > >> > the byte array is converted to a hex string, so it's
> > > > definitely
> > > > > > the
> > > > > > >> > >> binary
> > > > > > >> > >> > form of the number converted to base16. Whether or not
> > > that's
> > > > > the
> > > > > > >> same
> > > > > > >> > >> as
> > > > > > >> > >> > the base16 encoded number is a good question
> (toByteArray
> > > > > > returns a
> > > > > > >> > byte
> > > > > > >> > >> > array containing a signed, big-endian, two's complement
> > > > > > >> representation
> > > > > > >> > >> of
> > > > > > >> > >> > the big integer).
> > > > > > >> > >> >
> > > > > > >> > >> > > One suggestion I have is to change the proposed new
> > > config
> > > > to
> > > > > > >> only
> > > > > > >> > >> affect
> > > > > > >> > >> > > decimals stored as text, i.e. to switch between the
> > > current
> > > > > > >> base16
> > > > > > >> > and
> > > > > > >> > >> > the
> > > > > > >> > >> > > more common base10.   Then add another config to the
> > > > > serializer
> > > > > > >> only
> > > > > > >> > >> that
> > > > > > >> > >> > > controls if decimals should be serialized as text or
> > > > numeric.
> > > > > > >> > >> >
> > > > > > >> > >> > I think we need to be able to handle all mappings from
> > > > > > >> serialization
> > > > > > >> > >> format
> > > > > > >> > >> > to deserialization format (e.g. read in BINARY and
> output
> > > > > TEXT),
> > > > > > >> > which I
> > > > > > >> > >> > think would be impossible with the alternative
> > suggestion.
> > > I
> > > > > > agree
> > > > > > >> > that
> > > > > > >> > >> > automatically deserializing numerics is valuable. I see
> > two
> > > > > other
> > > > > > >> ways
> > > > > > >> > >> to
> > > > > > >> > >> > get this, both keeping the serialization.format config
> > the
> > > > > same:
> > > > > > >> > >> >
> > > > > > >> > >> > - have json.decimal.deserialization.format accept all
> > three
> > > > > > >> formats.
> > > > > > >> > if
> > > > > > >> > >> set
> > > > > > >> > >> > to BINARY/TEXT, numerics would be automatically
> > supported.
> > > If
> > > > > set
> > > > > > >> to
> > > > > > >> > >> > NUMERIC, then any string coming in would result in
> > > > > > deserialization
> > > > > > >> > error
> > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > >> > >> > - change json.decimal.deserialization.format to
> > > > > > >> > >> > json.decimal.deserialization.string.format which
> accepts
> > > only
> > > > > > >> > >> BINARY/TEXT
> > > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > > >> > >> >
> > > > > > >> > >> > > would be a breaking change in that things that
> > previously
> > > > > > failed
> > > > > > >> > would
> > > > > > >> > >> > > suddenly start deserializing.  This is a price I'm
> > > willing
> > > > to
> > > > > > >> pay.
> > > > > > >> > >> >
> > > > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > > > >> > >> >
> > > > > > >> > >> > > IMHO, we should then plan to switch the default of
> > > decimal
> > > > > > >> > >> serialization
> > > > > > >> > >> > to
> > > > > > >> > >> > > numeric, and text serialization to base 10 in the
> next
> > > > major
> > > > > > >> > release.
> > > > > > >> > >> >
> > > > > > >> > >> > I think that can be a separate discussion, I don't want
> > to
> > > > > block
> > > > > > >> this
> > > > > > >> > >> KIP
> > > > > > >> > >> > on it.
> > > > > > >> > >> >
> > > > > > >> > >> > Thoughts?
> > > > > > >> > >> >
> > > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > > > otto@wikimedia.org>
> > > > > > >> > wrote:
> > > > > > >> > >> >
> > > > > > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter
> I
> > > use
> > > > > > >> > >> JSONSchemas to
> > > > > > >> > >> > > indicate whether a JSON number should be deserialized
> > as
> > > an
> > > > > > >> integer
> > > > > > >> > >> or a
> > > > > > >> > >> > > decimal
> > > > > > >> > >> > > <
> > > > > > >> > >> > >
> > > > > > >> > >> >
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > > > >> > >> > > >.
> > > > > > >> > >> > > Not everyone is going to have JSONSchemas available
> > when
> > > > > > >> converting,
> > > > > > >> > >> but
> > > > > > >> > >> > if
> > > > > > >> > >> > > you do, it is an easy way to support JSON numbers as
> > > > > decimals.
> > > > > > >> > >> > >
> > > > > > >> > >> > > Carry on! :)
> > > > > > >> > >> > >
> > > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > > > andy@confluent.io
> > > > > > >
> > > > > > >> > >> wrote:
> > > > > > >> > >> > >
> > > > > > >> > >> > > > Hi Almog,
> > > > > > >> > >> > > >
> > > > > > >> > >> > > > Like the KIP - I think being able to support
> decimals
> > > in
> > > > > JSON
> > > > > > >> in
> > > > > > >> > the
> > > > > > >> > >> > same
> > > > > > >> > >> > > > way most other systems do is a great improvement.
> > > > > > >> > >> > > >
> > > > > > >> > >> > > > It's not 100% clear to me from the KIP what the
> > current
> > > > > > format
> > > > > > >> is.
> > > > > > >> > >> Is
> > > > > > >> > >> > > the
> > > > > > >> > >> > > > text decimal a base16 encoded number, or is it
> base16
> > > > > encoded
> > > > > > >> > binary
> > > > > > >> > >> > form
> > > > > > >> > >> > > > of the number? (I've not tried to get my head
> around
> > if
> > > > > these
> > > > > > >> two
> > > > > > >> > >> are
> > > > > > >> > >> > > even
> > > > > > >> > >> > > > different!)
> > > > > > >> > >> > > >
> > > > > > >> > >> > > > One suggestion I have is to change the proposed new
> > > > config
> > > > > to
> > > > > > >> only
> > > > > > >> > >> > affect
> > > > > > >> > >> > > > decimals stored as text, i.e. to switch between the
> > > > current
> > > > > > >> base16
> > > > > > >> > >> and
> > > > > > >> > >> > > the
> > > > > > >> > >> > > > more common base10.   Then add another config to
> the
> > > > > > serialzier
> > > > > > >> > only
> > > > > > >> > >> > that
> > > > > > >> > >> > > > controls if decimals should be serialized as text
> or
> > > > > numeric.
> > > > > > >> The
> > > > > > >> > >> > > benefit
> > > > > > >> > >> > > > of this approach is it allows us to enhance the
> > > > > deserializer
> > > > > > to
> > > > > > >> > >> > > > automatically handle numeric decimals even without
> > any
> > > > > config
> > > > > > >> > >> having to
> > > > > > >> > >> > > be
> > > > > > >> > >> > > > set, i.e. default config in the deserializer would
> be
> > > > able
> > > > > to
> > > > > > >> > handle
> > > > > > >> > >> > > > numeric decimals.  Of course, this is a two edged
> > > sword:
> > > > > this
> > > > > > >> > would
> > > > > > >> > >> > make
> > > > > > >> > >> > > > the deserializer work out of the box with numeric
> > > > decimals,
> > > > > > >> > (yay!),
> > > > > > >> > >> but
> > > > > > >> > >> > > > would be a breaking change in that things that
> > > previously
> > > > > > >> failed
> > > > > > >> > >> would
> > > > > > >> > >> > > > suddenly start deserializing.  This is a price I'm
> > > > willing
> > > > > to
> > > > > > >> pay.
> > > > > > >> > >> > > >
> > > > > > >> > >> > > > IMHO, we should then plan to switch the default of
> > > > decimal
> > > > > > >> > >> > serialization
> > > > > > >> > >> > > to
> > > > > > >> > >> > > > numeric, and text serialization to base 10 in the
> > next
> > > > > major
> > > > > > >> > >> release.
> > > > > > >> > >> > > > (With upgrade notes to match). Though I know this
> is
> > > more
> > > > > > >> > >> contentious,
> > > > > > >> > >> > I
> > > > > > >> > >> > > > think it moves us forward in a much more standard
> way
> > > > that
> > > > > > the
> > > > > > >> > >> current
> > > > > > >> > >> > > > encoding of decimals.
> > > > > > >> > >> > > >
> > > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > > > almog@confluent.io>
> > > > > > >> > >> wrote:
> > > > > > >> > >> > > >
> > > > > > >> > >> > > > > Hi Everyone!
> > > > > > >> > >> > > > >
> > > > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > > > >> > >> > > > >
> > > > > > >> > >> > > > >
> > > > > > >> > >> > > >
> > > > > > >> > >> > >
> > > > > > >> > >> >
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > > > >> > >> > > > >
> > > > > > >> > >> > > > > For those who are interested, I have a prototype
> > > > > > >> implementation
> > > > > > >> > >> that
> > > > > > >> > >> > > > helped
> > > > > > >> > >> > > > > guide my design:
> > > > https://github.com/agavra/kafka/pull/1
> > > > > > >> > >> > > > >
> > > > > > >> > >> > > > > Cheers,
> > > > > > >> > >> > > > > Almog
> > > > > > >> > >> > > > >
> > > > > > >> > >> > > >
> > > > > > >> > >> > >
> > > > > > >> > >> >
> > > > > > >> > >>
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Konstantine Karantasis <ko...@confluent.io>.
Thanks! KIP reads even better for me now.
Just voted. +1 non-binding

Konstantine

On Wed, Aug 14, 2019 at 7:00 PM Almog Gavra <al...@confluent.io> wrote:

> Thanks for the review Konstantine!
>
> I think the terminology suggestion definitely makes things clearer - I will
> update the documentation based on your suggestion (e.g. s/Consumer/Sink
> Converter/g and s/Producer/Source Converter/g).
>
> Cheers,
> Almog
>
> On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
> konstantine@confluent.io> wrote:
>
> > Thanks Almog for preparing this KIP!
> > I think it will improve usability and troubleshooting with JSON data a
> lot.
> >
> > The finalized plan seems quite concrete now. I also liked that some
> > implementation specific implications (such as setting the ObjectMapper to
> > deserialize floating point as BigDecimal) are highlighted in the KIP.
> >
> > Still, as I was reading the KIP, the main obstacle I encountered was
> around
> > terminology. I couldn't get used to reading "producer" and "consumer" and
> > not thinking in terms of Kafka producers and consumers - which are not
> > relevant to what this KIP proposes. Thus, I'd suggest replacing
> > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink
> > Converter(s)" (even if "Converter used by Source Connector" or "Converter
> > used by Sink Connector" would be even more accurate - maybe this could be
> > an explanation in a footnote). Terminology around converters has been
> > tricky in the past and adding producers/consumers in the mix might add to
> > the confusion.
> >
> > Another example where I'd apply this different terminology would be to a
> > phrase such as the following:
> > "Because of this, users must take care to first ensure that all consumers
> > have upgraded to the new code before upgrading producers to make use of
> the
> > NUMERIC serialization format."
> > which I'd write
> > "Because of this, users must take care to first ensure that all sink
> > connectors have upgraded to the new converter code before upgrading
> source
> > connectors to make use of the NUMERIC serialization format in
> > JsonConverter."
> >
> > Let me know if you think this suggestion makes the KIP easier to follow.
> > Otherwise I think it's a solid proposal.
> >
> > I'm concluding with a couple of nits:
> >
> > - "Upgraded Producer with BASE64 serialization, Legacy Consumer: this
> > scenario is okay as the upgraded ~producer~ consumer will be able to read
> > binary as today" (again according to my suggestion above, it could be as
> > the upgraded source converter ...)
> >
> > - "consumers cannot consumer NUMERIC data. " -> "consumers cannot read
> > NUMERIC data"
> >
> > Best,
> > Konstantine
> >
> > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io> wrote:
> >
> > > Good catches! Fixed :)
> > >
> > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <ar...@gmail.com>
> > > wrote:
> > >
> > > > Cool!
> > > >
> > > > Couple of nits:
> > > >
> > > > - In public interfaces, typo: *json.decimal.serialization.fromat*
> > > > - In public interfaces, you use the term "HEX" instead of "BASE64".
> > > >
> > > >
> > > >
> > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io>
> wrote:
> > > >
> > > > > EDIT: everywhere I've been using "HEX" I meant to be using
> "BASE64".
> > I
> > > > will
> > > > > update the KIP to reflect this.
> > > > >
> > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io>
> > wrote:
> > > > >
> > > > > > Thanks for the feedback Arjun! I'm happy changing the default
> > config
> > > to
> > > > > > HEX instead of BINARY, no strong feelings there.
> > > > > >
> > > > > > I'll also clarify the example in the KIP to be clearer:
> > > > > >
> > > > > > - serialize the decimal field "foo" with value "10.2345" with the
> > HEX
> > > > > > setting: {"foo": "D3J5"}
> > > > > > - serialize the decimal field "foo" with value "10.2345" with the
> > > > NUMERIC
> > > > > > setting: {"foo": 10.2345}
> > > > > >
> > > > > > With regards to the precision issue, that was my original concern
> > as
> > > > well
> > > > > > (and why I originally suggested a TEXT format). Many JSON
> > > deserializers
> > > > > > (e.g. Jackson with
> > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > > however, have the ability to deserialize decimals correctly so I
> > will
> > > > > > configure that as the default for Connect's JsonDeserializer.
> It's
> > > > > probably
> > > > > > a good idea to call out that using other deserializers must be
> done
> > > > with
> > > > > > care - I will add that documentation to the serialization config.
> > > > > >
> > > > > > Note that there would not be an issue on the _serialization_ side
> > of
> > > > > > things as Jackson respects BigDecimal.
> > > > > >
> > > > > > Almog
> > > > > >
> > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> > arjun.satish@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not late
> since
> > > you
> > > > > >> started the voting thread already):
> > > > > >>
> > > > > >> can you please add some examples to show the changes that you
> are
> > > > > >> proposing. makes me think that for a given decimal number, we
> will
> > > > have
> > > > > >> two
> > > > > >> encodings: “asHex” and “asNumber”.
> > > > > >>
> > > > > >> should we call the default config value “HEX” instead of
> “BINARY”?
> > > > > >>
> > > > > >> Should we call out the fact that JS systems might be susceptible
> > to
> > > > > double
> > > > > >> precision round offs with the new numeric format? here are some
> > > people
> > > > > >> discussing a similar problem
> > > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > > >>
> > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io>
> > > > wrote:
> > > > > >>
> > > > > >> > Hello Everyone,
> > > > > >> >
> > > > > >> > Summarizing an in-person discussion with Randall (this is
> copied
> > > > from
> > > > > >> the
> > > > > >> > KIP):
> > > > > >> >
> > > > > >> > The original KIP suggested supporting an additional
> > > representation -
> > > > > >> base10
> > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues
> > > > because
> > > > > >> it
> > > > > >> > is impossible to disambiguate between TEXT and BINARY without
> an
> > > > > >> additional
> > > > > >> > config - furthermore, this makes the migration from one to the
> > > other
> > > > > >> nearly
> > > > > >> > impossible because it would require that all consumers stop
> > > > consuming
> > > > > >> and
> > > > > >> > producers stop producing and atomically updating the config on
> > all
> > > > of
> > > > > >> them
> > > > > >> > after deploying the new code, or waiting for the full
> retention
> > > > period
> > > > > >> to
> > > > > >> > pass - neither option is viable. The suggestion in the KIP is
> > > > strictly
> > > > > >> an
> > > > > >> > improvement over the existing behavior, even if it doesn't
> > support
> > > > all
> > > > > >> > combinations.
> > > > > >> >
> > > > > >> > It seems that since most real-world use cases actually use the
> > > > numeric
> > > > > >> > representation (not string) we can consider this an
> improvement.
> > > > With
> > > > > >> the
> > > > > >> > new suggestion, we don't need a deserialization configuration
> > > (only
> > > > a
> > > > > >> > serialization option) and the consumers will be able to always
> > > > > >> > automatically determine the serialization format.
> > > > > >> >
> > > > > >> > Based on this, I'll be opening up the simplified version of
> the
> > > KIP
> > > > > to a
> > > > > >> > vote.
> > > > > >> >
> > > > > >> > Almog
> > > > > >> >
> > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <
> almog@confluent.io
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> > > I'm mostly happy with your current suggestion (two configs,
> > one
> > > > for
> > > > > >> > > serialization and one for deserialization) and your
> > > implementation
> > > > > >> > > suggestion. One thing to note:
> > > > > >> > >
> > > > > >> > > > We should _always_ be able to deserialize a standard JSON
> > > > > >> > > > number as a decimal
> > > > > >> > >
> > > > > >> > > I was doing some research into decimals and JSON, and I can
> > > > imagine
> > > > > a
> > > > > >> > > compelling reason to require string representations to avoid
> > > > losing
> > > > > >> > > precision and to be certain that whomever is sending the
> data
> > > > isn't
> > > > > >> > losing
> > > > > >> > > precision (e.g.
> https://stackoverflow.com/a/38357877/2258040
> > ).
> > > > > >> > >
> > > > > >> > > I'm okay with always allowing numerics, but thought it's
> worth
> > > > > raising
> > > > > >> > the
> > > > > >> > > thought.
> > > > > >> > >
> > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> > andy@confluent.io>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > >> The way I see it, we need to control two seperate things:
> > > > > >> > >>
> > > > > >> > >> 1. How do we _deserialize_ a decimal type if we encounter a
> > > text
> > > > > >> node in
> > > > > >> > >> the JSON?    (We should _always_ be able to deserialize a
> > > > standard
> > > > > >> JSON
> > > > > >> > >> number as a decimal).
> > > > > >> > >> 2. How do we chose how we want decimals to be _serialized_.
> > > > > >> > >>
> > > > > >> > >> This looks to fits well with your second suggestion of
> > slightly
> > > > > >> > different
> > > > > >> > >> configs names for serialization vs deserialization.
> > > > > >> > >> a, For deserialization we only care about how to handle
> text
> > > > > nodes: `
> > > > > >> > >> deserialization.decimal.*text*.format`, which should only
> > have
> > > > two
> > > > > >> valid
> > > > > >> > >> values BINARY | TEXT.
> > > > > >> > >> b. For serialization we need all three:
> > > > > >> `serialization.decimal.format`,
> > > > > >> > >> which should support all three options: BINARY | TEXT |
> > > NUMERIC.
> > > > > >> > >>
> > > > > >> > >> Implementation wise, I think these should be two separate
> > > enums,
> > > > > >> rather
> > > > > >> > >> than one shared enum and throwing an error if the
> > deserializer
> > > is
> > > > > >> set to
> > > > > >> > >> NUMERIC.  Mainly as this means the enums reflect the
> options
> > > > > >> available,
> > > > > >> > >> rather than this being hidden in config checking code.  But
> > > > that's
> > > > > a
> > > > > >> > minor
> > > > > >> > >> implementation detail.
> > > > > >> > >>
> > > > > >> > >> Personally, I'd be tempted to have the BINARY value named
> > > > something
> > > > > >> like
> > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users
> to
> > > move
> > > > > >> away
> > > > > >> > >> from
> > > > > >> > >> it.
> > > > > >> > >>
> > > > > >> > >> It's a real shame that both of these settings require a
> > default
> > > > of
> > > > > >> > BINARY
> > > > > >> > >> for backwards compatibility, but I agree that discussions /
> > > plans
> > > > > >> around
> > > > > >> > >> switching the defaults should not block this KIP.
> > > > > >> > >>
> > > > > >> > >> Andy
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> > almog@confluent.io>
> > > > > >> wrote:
> > > > > >> > >>
> > > > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > > > >> > >> >
> > > > > >> > >> > > Is the text decimal a base16 encoded number, or is it
> > > base16
> > > > > >> encoded
> > > > > >> > >> > binary
> > > > > >> > >> > > form of the number?
> > > > > >> > >> >
> > > > > >> > >> > The conversion happens as
> > > decimal.unscaledValue().toByteArray()
> > > > > and
> > > > > >> > then
> > > > > >> > >> > the byte array is converted to a hex string, so it's
> > > definitely
> > > > > the
> > > > > >> > >> binary
> > > > > >> > >> > form of the number converted to base16. Whether or not
> > that's
> > > > the
> > > > > >> same
> > > > > >> > >> as
> > > > > >> > >> > the base16 encoded number is a good question (toByteArray
> > > > > returns a
> > > > > >> > byte
> > > > > >> > >> > array containing a signed, big-endian, two's complement
> > > > > >> representation
> > > > > >> > >> of
> > > > > >> > >> > the big integer).
> > > > > >> > >> >
> > > > > >> > >> > > One suggestion I have is to change the proposed new
> > config
> > > to
> > > > > >> only
> > > > > >> > >> affect
> > > > > >> > >> > > decimals stored as text, i.e. to switch between the
> > current
> > > > > >> base16
> > > > > >> > and
> > > > > >> > >> > the
> > > > > >> > >> > > more common base10.   Then add another config to the
> > > > serializer
> > > > > >> only
> > > > > >> > >> that
> > > > > >> > >> > > controls if decimals should be serialized as text or
> > > numeric.
> > > > > >> > >> >
> > > > > >> > >> > I think we need to be able to handle all mappings from
> > > > > >> serialization
> > > > > >> > >> format
> > > > > >> > >> > to deserialization format (e.g. read in BINARY and output
> > > > TEXT),
> > > > > >> > which I
> > > > > >> > >> > think would be impossible with the alternative
> suggestion.
> > I
> > > > > agree
> > > > > >> > that
> > > > > >> > >> > automatically deserializing numerics is valuable. I see
> two
> > > > other
> > > > > >> ways
> > > > > >> > >> to
> > > > > >> > >> > get this, both keeping the serialization.format config
> the
> > > > same:
> > > > > >> > >> >
> > > > > >> > >> > - have json.decimal.deserialization.format accept all
> three
> > > > > >> formats.
> > > > > >> > if
> > > > > >> > >> set
> > > > > >> > >> > to BINARY/TEXT, numerics would be automatically
> supported.
> > If
> > > > set
> > > > > >> to
> > > > > >> > >> > NUMERIC, then any string coming in would result in
> > > > > deserialization
> > > > > >> > error
> > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > >> > >> > - change json.decimal.deserialization.format to
> > > > > >> > >> > json.decimal.deserialization.string.format which accepts
> > only
> > > > > >> > >> BINARY/TEXT
> > > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > > >> > >> >
> > > > > >> > >> > > would be a breaking change in that things that
> previously
> > > > > failed
> > > > > >> > would
> > > > > >> > >> > > suddenly start deserializing.  This is a price I'm
> > willing
> > > to
> > > > > >> pay.
> > > > > >> > >> >
> > > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > > >> > >> >
> > > > > >> > >> > > IMHO, we should then plan to switch the default of
> > decimal
> > > > > >> > >> serialization
> > > > > >> > >> > to
> > > > > >> > >> > > numeric, and text serialization to base 10 in the next
> > > major
> > > > > >> > release.
> > > > > >> > >> >
> > > > > >> > >> > I think that can be a separate discussion, I don't want
> to
> > > > block
> > > > > >> this
> > > > > >> > >> KIP
> > > > > >> > >> > on it.
> > > > > >> > >> >
> > > > > >> > >> > Thoughts?
> > > > > >> > >> >
> > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > > otto@wikimedia.org>
> > > > > >> > wrote:
> > > > > >> > >> >
> > > > > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I
> > use
> > > > > >> > >> JSONSchemas to
> > > > > >> > >> > > indicate whether a JSON number should be deserialized
> as
> > an
> > > > > >> integer
> > > > > >> > >> or a
> > > > > >> > >> > > decimal
> > > > > >> > >> > > <
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > > >> > >> > > >.
> > > > > >> > >> > > Not everyone is going to have JSONSchemas available
> when
> > > > > >> converting,
> > > > > >> > >> but
> > > > > >> > >> > if
> > > > > >> > >> > > you do, it is an easy way to support JSON numbers as
> > > > decimals.
> > > > > >> > >> > >
> > > > > >> > >> > > Carry on! :)
> > > > > >> > >> > >
> > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > > andy@confluent.io
> > > > > >
> > > > > >> > >> wrote:
> > > > > >> > >> > >
> > > > > >> > >> > > > Hi Almog,
> > > > > >> > >> > > >
> > > > > >> > >> > > > Like the KIP - I think being able to support decimals
> > in
> > > > JSON
> > > > > >> in
> > > > > >> > the
> > > > > >> > >> > same
> > > > > >> > >> > > > way most other systems do is a great improvement.
> > > > > >> > >> > > >
> > > > > >> > >> > > > It's not 100% clear to me from the KIP what the
> current
> > > > > format
> > > > > >> is.
> > > > > >> > >> Is
> > > > > >> > >> > > the
> > > > > >> > >> > > > text decimal a base16 encoded number, or is it base16
> > > > encoded
> > > > > >> > binary
> > > > > >> > >> > form
> > > > > >> > >> > > > of the number? (I've not tried to get my head around
> if
> > > > these
> > > > > >> two
> > > > > >> > >> are
> > > > > >> > >> > > even
> > > > > >> > >> > > > different!)
> > > > > >> > >> > > >
> > > > > >> > >> > > > One suggestion I have is to change the proposed new
> > > config
> > > > to
> > > > > >> only
> > > > > >> > >> > affect
> > > > > >> > >> > > > decimals stored as text, i.e. to switch between the
> > > current
> > > > > >> base16
> > > > > >> > >> and
> > > > > >> > >> > > the
> > > > > >> > >> > > > more common base10.   Then add another config to the
> > > > > serialzier
> > > > > >> > only
> > > > > >> > >> > that
> > > > > >> > >> > > > controls if decimals should be serialized as text or
> > > > numeric.
> > > > > >> The
> > > > > >> > >> > > benefit
> > > > > >> > >> > > > of this approach is it allows us to enhance the
> > > > deserializer
> > > > > to
> > > > > >> > >> > > > automatically handle numeric decimals even without
> any
> > > > config
> > > > > >> > >> having to
> > > > > >> > >> > > be
> > > > > >> > >> > > > set, i.e. default config in the deserializer would be
> > > able
> > > > to
> > > > > >> > handle
> > > > > >> > >> > > > numeric decimals.  Of course, this is a two edged
> > sword:
> > > > this
> > > > > >> > would
> > > > > >> > >> > make
> > > > > >> > >> > > > the deserializer work out of the box with numeric
> > > decimals,
> > > > > >> > (yay!),
> > > > > >> > >> but
> > > > > >> > >> > > > would be a breaking change in that things that
> > previously
> > > > > >> failed
> > > > > >> > >> would
> > > > > >> > >> > > > suddenly start deserializing.  This is a price I'm
> > > willing
> > > > to
> > > > > >> pay.
> > > > > >> > >> > > >
> > > > > >> > >> > > > IMHO, we should then plan to switch the default of
> > > decimal
> > > > > >> > >> > serialization
> > > > > >> > >> > > to
> > > > > >> > >> > > > numeric, and text serialization to base 10 in the
> next
> > > > major
> > > > > >> > >> release.
> > > > > >> > >> > > > (With upgrade notes to match). Though I know this is
> > more
> > > > > >> > >> contentious,
> > > > > >> > >> > I
> > > > > >> > >> > > > think it moves us forward in a much more standard way
> > > that
> > > > > the
> > > > > >> > >> current
> > > > > >> > >> > > > encoding of decimals.
> > > > > >> > >> > > >
> > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > > almog@confluent.io>
> > > > > >> > >> wrote:
> > > > > >> > >> > > >
> > > > > >> > >> > > > > Hi Everyone!
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > > >> > >> > > > >
> > > > > >> > >> > > > >
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > For those who are interested, I have a prototype
> > > > > >> implementation
> > > > > >> > >> that
> > > > > >> > >> > > > helped
> > > > > >> > >> > > > > guide my design:
> > > https://github.com/agavra/kafka/pull/1
> > > > > >> > >> > > > >
> > > > > >> > >> > > > > Cheers,
> > > > > >> > >> > > > > Almog
> > > > > >> > >> > > > >
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Almog Gavra <al...@confluent.io>.
Thanks for the review Konstantine!

I think the terminology suggestion definitely makes things clearer - I will
update the documentation based on your suggestion (e.g. s/Consumer/Sink
Converter/g and s/Producer/Source Converter/g).

Cheers,
Almog

On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis <
konstantine@confluent.io> wrote:

> Thanks Almog for preparing this KIP!
> I think it will improve usability and troubleshooting with JSON data a lot.
>
> The finalized plan seems quite concrete now. I also liked that some
> implementation specific implications (such as setting the ObjectMapper to
> deserialize floating point as BigDecimal) are highlighted in the KIP.
>
> Still, as I was reading the KIP, the main obstacle I encountered was around
> terminology. I couldn't get used to reading "producer" and "consumer" and
> not thinking in terms of Kafka producers and consumers - which are not
> relevant to what this KIP proposes. Thus, I'd suggest replacing
> "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink
> Converter(s)" (even if "Converter used by Source Connector" or "Converter
> used by Sink Connector" would be even more accurate - maybe this could be
> an explanation in a footnote). Terminology around converters has been
> tricky in the past and adding producers/consumers in the mix might add to
> the confusion.
>
> Another example where I'd apply this different terminology would be to a
> phrase such as the following:
> "Because of this, users must take care to first ensure that all consumers
> have upgraded to the new code before upgrading producers to make use of the
> NUMERIC serialization format."
> which I'd write
> "Because of this, users must take care to first ensure that all sink
> connectors have upgraded to the new converter code before upgrading source
> connectors to make use of the NUMERIC serialization format in
> JsonConverter."
>
> Let me know if you think this suggestion makes the KIP easier to follow.
> Otherwise I think it's a solid proposal.
>
> I'm concluding with a couple of nits:
>
> - "Upgraded Producer with BASE64 serialization, Legacy Consumer: this
> scenario is okay as the upgraded ~producer~ consumer will be able to read
> binary as today" (again according to my suggestion above, it could be as
> the upgraded source converter ...)
>
> - "consumers cannot consumer NUMERIC data. " -> "consumers cannot read
> NUMERIC data"
>
> Best,
> Konstantine
>
> On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io> wrote:
>
> > Good catches! Fixed :)
> >
> > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <ar...@gmail.com>
> > wrote:
> >
> > > Cool!
> > >
> > > Couple of nits:
> > >
> > > - In public interfaces, typo: *json.decimal.serialization.fromat*
> > > - In public interfaces, you use the term "HEX" instead of "BASE64".
> > >
> > >
> > >
> > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io> wrote:
> > >
> > > > EDIT: everywhere I've been using "HEX" I meant to be using "BASE64".
> I
> > > will
> > > > update the KIP to reflect this.
> > > >
> > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io>
> wrote:
> > > >
> > > > > Thanks for the feedback Arjun! I'm happy changing the default
> config
> > to
> > > > > HEX instead of BINARY, no strong feelings there.
> > > > >
> > > > > I'll also clarify the example in the KIP to be clearer:
> > > > >
> > > > > - serialize the decimal field "foo" with value "10.2345" with the
> HEX
> > > > > setting: {"foo": "D3J5"}
> > > > > - serialize the decimal field "foo" with value "10.2345" with the
> > > NUMERIC
> > > > > setting: {"foo": 10.2345}
> > > > >
> > > > > With regards to the precision issue, that was my original concern
> as
> > > well
> > > > > (and why I originally suggested a TEXT format). Many JSON
> > deserializers
> > > > > (e.g. Jackson with
> > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > > however, have the ability to deserialize decimals correctly so I
> will
> > > > > configure that as the default for Connect's JsonDeserializer. It's
> > > > probably
> > > > > a good idea to call out that using other deserializers must be done
> > > with
> > > > > care - I will add that documentation to the serialization config.
> > > > >
> > > > > Note that there would not be an issue on the _serialization_ side
> of
> > > > > things as Jackson respects BigDecimal.
> > > > >
> > > > > Almog
> > > > >
> > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <
> arjun.satish@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > >> hey Almog, nice work! couple of thoughts (hope I'm not late since
> > you
> > > > >> started the voting thread already):
> > > > >>
> > > > >> can you please add some examples to show the changes that you are
> > > > >> proposing. makes me think that for a given decimal number, we will
> > > have
> > > > >> two
> > > > >> encodings: “asHex” and “asNumber”.
> > > > >>
> > > > >> should we call the default config value “HEX” instead of “BINARY”?
> > > > >>
> > > > >> Should we call out the fact that JS systems might be susceptible
> to
> > > > double
> > > > >> precision round offs with the new numeric format? here are some
> > people
> > > > >> discussing a similar problem
> > > > >> https://github.com/EventStore/EventStore/issues/1541
> > > > >>
> > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > > >>
> > > > >> > Hello Everyone,
> > > > >> >
> > > > >> > Summarizing an in-person discussion with Randall (this is copied
> > > from
> > > > >> the
> > > > >> > KIP):
> > > > >> >
> > > > >> > The original KIP suggested supporting an additional
> > representation -
> > > > >> base10
> > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues
> > > because
> > > > >> it
> > > > >> > is impossible to disambiguate between TEXT and BINARY without an
> > > > >> additional
> > > > >> > config - furthermore, this makes the migration from one to the
> > other
> > > > >> nearly
> > > > >> > impossible because it would require that all consumers stop
> > > consuming
> > > > >> and
> > > > >> > producers stop producing and atomically updating the config on
> all
> > > of
> > > > >> them
> > > > >> > after deploying the new code, or waiting for the full retention
> > > period
> > > > >> to
> > > > >> > pass - neither option is viable. The suggestion in the KIP is
> > > strictly
> > > > >> an
> > > > >> > improvement over the existing behavior, even if it doesn't
> support
> > > all
> > > > >> > combinations.
> > > > >> >
> > > > >> > It seems that since most real-world use cases actually use the
> > > numeric
> > > > >> > representation (not string) we can consider this an improvement.
> > > With
> > > > >> the
> > > > >> > new suggestion, we don't need a deserialization configuration
> > (only
> > > a
> > > > >> > serialization option) and the consumers will be able to always
> > > > >> > automatically determine the serialization format.
> > > > >> >
> > > > >> > Based on this, I'll be opening up the simplified version of the
> > KIP
> > > > to a
> > > > >> > vote.
> > > > >> >
> > > > >> > Almog
> > > > >> >
> > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <almog@confluent.io
> >
> > > > wrote:
> > > > >> >
> > > > >> > > I'm mostly happy with your current suggestion (two configs,
> one
> > > for
> > > > >> > > serialization and one for deserialization) and your
> > implementation
> > > > >> > > suggestion. One thing to note:
> > > > >> > >
> > > > >> > > > We should _always_ be able to deserialize a standard JSON
> > > > >> > > > number as a decimal
> > > > >> > >
> > > > >> > > I was doing some research into decimals and JSON, and I can
> > > imagine
> > > > a
> > > > >> > > compelling reason to require string representations to avoid
> > > losing
> > > > >> > > precision and to be certain that whomever is sending the data
> > > isn't
> > > > >> > losing
> > > > >> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040
> ).
> > > > >> > >
> > > > >> > > I'm okay with always allowing numerics, but thought it's worth
> > > > raising
> > > > >> > the
> > > > >> > > thought.
> > > > >> > >
> > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <
> andy@confluent.io>
> > > > >> wrote:
> > > > >> > >
> > > > >> > >> The way I see it, we need to control two seperate things:
> > > > >> > >>
> > > > >> > >> 1. How do we _deserialize_ a decimal type if we encounter a
> > text
> > > > >> node in
> > > > >> > >> the JSON?    (We should _always_ be able to deserialize a
> > > standard
> > > > >> JSON
> > > > >> > >> number as a decimal).
> > > > >> > >> 2. How do we chose how we want decimals to be _serialized_.
> > > > >> > >>
> > > > >> > >> This looks to fits well with your second suggestion of
> slightly
> > > > >> > different
> > > > >> > >> configs names for serialization vs deserialization.
> > > > >> > >> a, For deserialization we only care about how to handle text
> > > > nodes: `
> > > > >> > >> deserialization.decimal.*text*.format`, which should only
> have
> > > two
> > > > >> valid
> > > > >> > >> values BINARY | TEXT.
> > > > >> > >> b. For serialization we need all three:
> > > > >> `serialization.decimal.format`,
> > > > >> > >> which should support all three options: BINARY | TEXT |
> > NUMERIC.
> > > > >> > >>
> > > > >> > >> Implementation wise, I think these should be two separate
> > enums,
> > > > >> rather
> > > > >> > >> than one shared enum and throwing an error if the
> deserializer
> > is
> > > > >> set to
> > > > >> > >> NUMERIC.  Mainly as this means the enums reflect the options
> > > > >> available,
> > > > >> > >> rather than this being hidden in config checking code.  But
> > > that's
> > > > a
> > > > >> > minor
> > > > >> > >> implementation detail.
> > > > >> > >>
> > > > >> > >> Personally, I'd be tempted to have the BINARY value named
> > > something
> > > > >> like
> > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to
> > move
> > > > >> away
> > > > >> > >> from
> > > > >> > >> it.
> > > > >> > >>
> > > > >> > >> It's a real shame that both of these settings require a
> default
> > > of
> > > > >> > BINARY
> > > > >> > >> for backwards compatibility, but I agree that discussions /
> > plans
> > > > >> around
> > > > >> > >> switching the defaults should not block this KIP.
> > > > >> > >>
> > > > >> > >> Andy
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <
> almog@confluent.io>
> > > > >> wrote:
> > > > >> > >>
> > > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > > >> > >> >
> > > > >> > >> > > Is the text decimal a base16 encoded number, or is it
> > base16
> > > > >> encoded
> > > > >> > >> > binary
> > > > >> > >> > > form of the number?
> > > > >> > >> >
> > > > >> > >> > The conversion happens as
> > decimal.unscaledValue().toByteArray()
> > > > and
> > > > >> > then
> > > > >> > >> > the byte array is converted to a hex string, so it's
> > definitely
> > > > the
> > > > >> > >> binary
> > > > >> > >> > form of the number converted to base16. Whether or not
> that's
> > > the
> > > > >> same
> > > > >> > >> as
> > > > >> > >> > the base16 encoded number is a good question (toByteArray
> > > > returns a
> > > > >> > byte
> > > > >> > >> > array containing a signed, big-endian, two's complement
> > > > >> representation
> > > > >> > >> of
> > > > >> > >> > the big integer).
> > > > >> > >> >
> > > > >> > >> > > One suggestion I have is to change the proposed new
> config
> > to
> > > > >> only
> > > > >> > >> affect
> > > > >> > >> > > decimals stored as text, i.e. to switch between the
> current
> > > > >> base16
> > > > >> > and
> > > > >> > >> > the
> > > > >> > >> > > more common base10.   Then add another config to the
> > > serializer
> > > > >> only
> > > > >> > >> that
> > > > >> > >> > > controls if decimals should be serialized as text or
> > numeric.
> > > > >> > >> >
> > > > >> > >> > I think we need to be able to handle all mappings from
> > > > >> serialization
> > > > >> > >> format
> > > > >> > >> > to deserialization format (e.g. read in BINARY and output
> > > TEXT),
> > > > >> > which I
> > > > >> > >> > think would be impossible with the alternative suggestion.
> I
> > > > agree
> > > > >> > that
> > > > >> > >> > automatically deserializing numerics is valuable. I see two
> > > other
> > > > >> ways
> > > > >> > >> to
> > > > >> > >> > get this, both keeping the serialization.format config the
> > > same:
> > > > >> > >> >
> > > > >> > >> > - have json.decimal.deserialization.format accept all three
> > > > >> formats.
> > > > >> > if
> > > > >> > >> set
> > > > >> > >> > to BINARY/TEXT, numerics would be automatically supported.
> If
> > > set
> > > > >> to
> > > > >> > >> > NUMERIC, then any string coming in would result in
> > > > deserialization
> > > > >> > error
> > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > >> > >> > - change json.decimal.deserialization.format to
> > > > >> > >> > json.decimal.deserialization.string.format which accepts
> only
> > > > >> > >> BINARY/TEXT
> > > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > > >> > >> >
> > > > >> > >> > > would be a breaking change in that things that previously
> > > > failed
> > > > >> > would
> > > > >> > >> > > suddenly start deserializing.  This is a price I'm
> willing
> > to
> > > > >> pay.
> > > > >> > >> >
> > > > >> > >> > I agree. I'm willing to pay this price too.
> > > > >> > >> >
> > > > >> > >> > > IMHO, we should then plan to switch the default of
> decimal
> > > > >> > >> serialization
> > > > >> > >> > to
> > > > >> > >> > > numeric, and text serialization to base 10 in the next
> > major
> > > > >> > release.
> > > > >> > >> >
> > > > >> > >> > I think that can be a separate discussion, I don't want to
> > > block
> > > > >> this
> > > > >> > >> KIP
> > > > >> > >> > on it.
> > > > >> > >> >
> > > > >> > >> > Thoughts?
> > > > >> > >> >
> > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > > otto@wikimedia.org>
> > > > >> > wrote:
> > > > >> > >> >
> > > > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I
> use
> > > > >> > >> JSONSchemas to
> > > > >> > >> > > indicate whether a JSON number should be deserialized as
> an
> > > > >> integer
> > > > >> > >> or a
> > > > >> > >> > > decimal
> > > > >> > >> > > <
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > > >> > >> > > >.
> > > > >> > >> > > Not everyone is going to have JSONSchemas available when
> > > > >> converting,
> > > > >> > >> but
> > > > >> > >> > if
> > > > >> > >> > > you do, it is an easy way to support JSON numbers as
> > > decimals.
> > > > >> > >> > >
> > > > >> > >> > > Carry on! :)
> > > > >> > >> > >
> > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > > andy@confluent.io
> > > > >
> > > > >> > >> wrote:
> > > > >> > >> > >
> > > > >> > >> > > > Hi Almog,
> > > > >> > >> > > >
> > > > >> > >> > > > Like the KIP - I think being able to support decimals
> in
> > > JSON
> > > > >> in
> > > > >> > the
> > > > >> > >> > same
> > > > >> > >> > > > way most other systems do is a great improvement.
> > > > >> > >> > > >
> > > > >> > >> > > > It's not 100% clear to me from the KIP what the current
> > > > format
> > > > >> is.
> > > > >> > >> Is
> > > > >> > >> > > the
> > > > >> > >> > > > text decimal a base16 encoded number, or is it base16
> > > encoded
> > > > >> > binary
> > > > >> > >> > form
> > > > >> > >> > > > of the number? (I've not tried to get my head around if
> > > these
> > > > >> two
> > > > >> > >> are
> > > > >> > >> > > even
> > > > >> > >> > > > different!)
> > > > >> > >> > > >
> > > > >> > >> > > > One suggestion I have is to change the proposed new
> > config
> > > to
> > > > >> only
> > > > >> > >> > affect
> > > > >> > >> > > > decimals stored as text, i.e. to switch between the
> > current
> > > > >> base16
> > > > >> > >> and
> > > > >> > >> > > the
> > > > >> > >> > > > more common base10.   Then add another config to the
> > > > serialzier
> > > > >> > only
> > > > >> > >> > that
> > > > >> > >> > > > controls if decimals should be serialized as text or
> > > numeric.
> > > > >> The
> > > > >> > >> > > benefit
> > > > >> > >> > > > of this approach is it allows us to enhance the
> > > deserializer
> > > > to
> > > > >> > >> > > > automatically handle numeric decimals even without any
> > > config
> > > > >> > >> having to
> > > > >> > >> > > be
> > > > >> > >> > > > set, i.e. default config in the deserializer would be
> > able
> > > to
> > > > >> > handle
> > > > >> > >> > > > numeric decimals.  Of course, this is a two edged
> sword:
> > > this
> > > > >> > would
> > > > >> > >> > make
> > > > >> > >> > > > the deserializer work out of the box with numeric
> > decimals,
> > > > >> > (yay!),
> > > > >> > >> but
> > > > >> > >> > > > would be a breaking change in that things that
> previously
> > > > >> failed
> > > > >> > >> would
> > > > >> > >> > > > suddenly start deserializing.  This is a price I'm
> > willing
> > > to
> > > > >> pay.
> > > > >> > >> > > >
> > > > >> > >> > > > IMHO, we should then plan to switch the default of
> > decimal
> > > > >> > >> > serialization
> > > > >> > >> > > to
> > > > >> > >> > > > numeric, and text serialization to base 10 in the next
> > > major
> > > > >> > >> release.
> > > > >> > >> > > > (With upgrade notes to match). Though I know this is
> more
> > > > >> > >> contentious,
> > > > >> > >> > I
> > > > >> > >> > > > think it moves us forward in a much more standard way
> > that
> > > > the
> > > > >> > >> current
> > > > >> > >> > > > encoding of decimals.
> > > > >> > >> > > >
> > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > > almog@confluent.io>
> > > > >> > >> wrote:
> > > > >> > >> > > >
> > > > >> > >> > > > > Hi Everyone!
> > > > >> > >> > > > >
> > > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > > >> > >> > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > > >> > >> > > > >
> > > > >> > >> > > > > For those who are interested, I have a prototype
> > > > >> implementation
> > > > >> > >> that
> > > > >> > >> > > > helped
> > > > >> > >> > > > > guide my design:
> > https://github.com/agavra/kafka/pull/1
> > > > >> > >> > > > >
> > > > >> > >> > > > > Cheers,
> > > > >> > >> > > > > Almog
> > > > >> > >> > > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Konstantine Karantasis <ko...@confluent.io>.
Thanks Almog for preparing this KIP!
I think it will improve usability and troubleshooting with JSON data a lot.

The finalized plan seems quite concrete now. I also liked that some
implementation specific implications (such as setting the ObjectMapper to
deserialize floating point as BigDecimal) are highlighted in the KIP.

Still, as I was reading the KIP, the main obstacle I encountered was around
terminology. I couldn't get used to reading "producer" and "consumer" and
not thinking in terms of Kafka producers and consumers - which are not
relevant to what this KIP proposes. Thus, I'd suggest replacing
"Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink
Converter(s)" (even if "Converter used by Source Connector" or "Converter
used by Sink Connector" would be even more accurate - maybe this could be
an explanation in a footnote). Terminology around converters has been
tricky in the past and adding producers/consumers in the mix might add to
the confusion.

Another example where I'd apply this different terminology would be to a
phrase such as the following:
"Because of this, users must take care to first ensure that all consumers
have upgraded to the new code before upgrading producers to make use of the
NUMERIC serialization format."
which I'd write
"Because of this, users must take care to first ensure that all sink
connectors have upgraded to the new converter code before upgrading source
connectors to make use of the NUMERIC serialization format in
JsonConverter."

Let me know if you think this suggestion makes the KIP easier to follow.
Otherwise I think it's a solid proposal.

I'm concluding with a couple of nits:

- "Upgraded Producer with BASE64 serialization, Legacy Consumer: this
scenario is okay as the upgraded ~producer~ consumer will be able to read
binary as today" (again according to my suggestion above, it could be as
the upgraded source converter ...)

- "consumers cannot consumer NUMERIC data. " -> "consumers cannot read
NUMERIC data"

Best,
Konstantine

On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io> wrote:

> Good catches! Fixed :)
>
> On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <ar...@gmail.com>
> wrote:
>
> > Cool!
> >
> > Couple of nits:
> >
> > - In public interfaces, typo: *json.decimal.serialization.fromat*
> > - In public interfaces, you use the term "HEX" instead of "BASE64".
> >
> >
> >
> > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io> wrote:
> >
> > > EDIT: everywhere I've been using "HEX" I meant to be using "BASE64". I
> > will
> > > update the KIP to reflect this.
> > >
> > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io> wrote:
> > >
> > > > Thanks for the feedback Arjun! I'm happy changing the default config
> to
> > > > HEX instead of BINARY, no strong feelings there.
> > > >
> > > > I'll also clarify the example in the KIP to be clearer:
> > > >
> > > > - serialize the decimal field "foo" with value "10.2345" with the HEX
> > > > setting: {"foo": "D3J5"}
> > > > - serialize the decimal field "foo" with value "10.2345" with the
> > NUMERIC
> > > > setting: {"foo": 10.2345}
> > > >
> > > > With regards to the precision issue, that was my original concern as
> > well
> > > > (and why I originally suggested a TEXT format). Many JSON
> deserializers
> > > > (e.g. Jackson with
> DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > however, have the ability to deserialize decimals correctly so I will
> > > > configure that as the default for Connect's JsonDeserializer. It's
> > > probably
> > > > a good idea to call out that using other deserializers must be done
> > with
> > > > care - I will add that documentation to the serialization config.
> > > >
> > > > Note that there would not be an issue on the _serialization_ side of
> > > > things as Jackson respects BigDecimal.
> > > >
> > > > Almog
> > > >
> > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <arjun.satish@gmail.com
> >
> > > > wrote:
> > > >
> > > >> hey Almog, nice work! couple of thoughts (hope I'm not late since
> you
> > > >> started the voting thread already):
> > > >>
> > > >> can you please add some examples to show the changes that you are
> > > >> proposing. makes me think that for a given decimal number, we will
> > have
> > > >> two
> > > >> encodings: “asHex” and “asNumber”.
> > > >>
> > > >> should we call the default config value “HEX” instead of “BINARY”?
> > > >>
> > > >> Should we call out the fact that JS systems might be susceptible to
> > > double
> > > >> precision round offs with the new numeric format? here are some
> people
> > > >> discussing a similar problem
> > > >> https://github.com/EventStore/EventStore/issues/1541
> > > >>
> > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io>
> > wrote:
> > > >>
> > > >> > Hello Everyone,
> > > >> >
> > > >> > Summarizing an in-person discussion with Randall (this is copied
> > from
> > > >> the
> > > >> > KIP):
> > > >> >
> > > >> > The original KIP suggested supporting an additional
> representation -
> > > >> base10
> > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues
> > because
> > > >> it
> > > >> > is impossible to disambiguate between TEXT and BINARY without an
> > > >> additional
> > > >> > config - furthermore, this makes the migration from one to the
> other
> > > >> nearly
> > > >> > impossible because it would require that all consumers stop
> > consuming
> > > >> and
> > > >> > producers stop producing and atomically updating the config on all
> > of
> > > >> them
> > > >> > after deploying the new code, or waiting for the full retention
> > period
> > > >> to
> > > >> > pass - neither option is viable. The suggestion in the KIP is
> > strictly
> > > >> an
> > > >> > improvement over the existing behavior, even if it doesn't support
> > all
> > > >> > combinations.
> > > >> >
> > > >> > It seems that since most real-world use cases actually use the
> > numeric
> > > >> > representation (not string) we can consider this an improvement.
> > With
> > > >> the
> > > >> > new suggestion, we don't need a deserialization configuration
> (only
> > a
> > > >> > serialization option) and the consumers will be able to always
> > > >> > automatically determine the serialization format.
> > > >> >
> > > >> > Based on this, I'll be opening up the simplified version of the
> KIP
> > > to a
> > > >> > vote.
> > > >> >
> > > >> > Almog
> > > >> >
> > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > >> >
> > > >> > > I'm mostly happy with your current suggestion (two configs, one
> > for
> > > >> > > serialization and one for deserialization) and your
> implementation
> > > >> > > suggestion. One thing to note:
> > > >> > >
> > > >> > > > We should _always_ be able to deserialize a standard JSON
> > > >> > > > number as a decimal
> > > >> > >
> > > >> > > I was doing some research into decimals and JSON, and I can
> > imagine
> > > a
> > > >> > > compelling reason to require string representations to avoid
> > losing
> > > >> > > precision and to be certain that whomever is sending the data
> > isn't
> > > >> > losing
> > > >> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
> > > >> > >
> > > >> > > I'm okay with always allowing numerics, but thought it's worth
> > > raising
> > > >> > the
> > > >> > > thought.
> > > >> > >
> > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io>
> > > >> wrote:
> > > >> > >
> > > >> > >> The way I see it, we need to control two seperate things:
> > > >> > >>
> > > >> > >> 1. How do we _deserialize_ a decimal type if we encounter a
> text
> > > >> node in
> > > >> > >> the JSON?    (We should _always_ be able to deserialize a
> > standard
> > > >> JSON
> > > >> > >> number as a decimal).
> > > >> > >> 2. How do we chose how we want decimals to be _serialized_.
> > > >> > >>
> > > >> > >> This looks to fits well with your second suggestion of slightly
> > > >> > different
> > > >> > >> configs names for serialization vs deserialization.
> > > >> > >> a, For deserialization we only care about how to handle text
> > > nodes: `
> > > >> > >> deserialization.decimal.*text*.format`, which should only have
> > two
> > > >> valid
> > > >> > >> values BINARY | TEXT.
> > > >> > >> b. For serialization we need all three:
> > > >> `serialization.decimal.format`,
> > > >> > >> which should support all three options: BINARY | TEXT |
> NUMERIC.
> > > >> > >>
> > > >> > >> Implementation wise, I think these should be two separate
> enums,
> > > >> rather
> > > >> > >> than one shared enum and throwing an error if the deserializer
> is
> > > >> set to
> > > >> > >> NUMERIC.  Mainly as this means the enums reflect the options
> > > >> available,
> > > >> > >> rather than this being hidden in config checking code.  But
> > that's
> > > a
> > > >> > minor
> > > >> > >> implementation detail.
> > > >> > >>
> > > >> > >> Personally, I'd be tempted to have the BINARY value named
> > something
> > > >> like
> > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to
> move
> > > >> away
> > > >> > >> from
> > > >> > >> it.
> > > >> > >>
> > > >> > >> It's a real shame that both of these settings require a default
> > of
> > > >> > BINARY
> > > >> > >> for backwards compatibility, but I agree that discussions /
> plans
> > > >> around
> > > >> > >> switching the defaults should not block this KIP.
> > > >> > >>
> > > >> > >> Andy
> > > >> > >>
> > > >> > >>
> > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io>
> > > >> wrote:
> > > >> > >>
> > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > >> > >> >
> > > >> > >> > > Is the text decimal a base16 encoded number, or is it
> base16
> > > >> encoded
> > > >> > >> > binary
> > > >> > >> > > form of the number?
> > > >> > >> >
> > > >> > >> > The conversion happens as
> decimal.unscaledValue().toByteArray()
> > > and
> > > >> > then
> > > >> > >> > the byte array is converted to a hex string, so it's
> definitely
> > > the
> > > >> > >> binary
> > > >> > >> > form of the number converted to base16. Whether or not that's
> > the
> > > >> same
> > > >> > >> as
> > > >> > >> > the base16 encoded number is a good question (toByteArray
> > > returns a
> > > >> > byte
> > > >> > >> > array containing a signed, big-endian, two's complement
> > > >> representation
> > > >> > >> of
> > > >> > >> > the big integer).
> > > >> > >> >
> > > >> > >> > > One suggestion I have is to change the proposed new config
> to
> > > >> only
> > > >> > >> affect
> > > >> > >> > > decimals stored as text, i.e. to switch between the current
> > > >> base16
> > > >> > and
> > > >> > >> > the
> > > >> > >> > > more common base10.   Then add another config to the
> > serializer
> > > >> only
> > > >> > >> that
> > > >> > >> > > controls if decimals should be serialized as text or
> numeric.
> > > >> > >> >
> > > >> > >> > I think we need to be able to handle all mappings from
> > > >> serialization
> > > >> > >> format
> > > >> > >> > to deserialization format (e.g. read in BINARY and output
> > TEXT),
> > > >> > which I
> > > >> > >> > think would be impossible with the alternative suggestion. I
> > > agree
> > > >> > that
> > > >> > >> > automatically deserializing numerics is valuable. I see two
> > other
> > > >> ways
> > > >> > >> to
> > > >> > >> > get this, both keeping the serialization.format config the
> > same:
> > > >> > >> >
> > > >> > >> > - have json.decimal.deserialization.format accept all three
> > > >> formats.
> > > >> > if
> > > >> > >> set
> > > >> > >> > to BINARY/TEXT, numerics would be automatically supported. If
> > set
> > > >> to
> > > >> > >> > NUMERIC, then any string coming in would result in
> > > deserialization
> > > >> > error
> > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > >> > >> > - change json.decimal.deserialization.format to
> > > >> > >> > json.decimal.deserialization.string.format which accepts only
> > > >> > >> BINARY/TEXT
> > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > >> > >> >
> > > >> > >> > > would be a breaking change in that things that previously
> > > failed
> > > >> > would
> > > >> > >> > > suddenly start deserializing.  This is a price I'm willing
> to
> > > >> pay.
> > > >> > >> >
> > > >> > >> > I agree. I'm willing to pay this price too.
> > > >> > >> >
> > > >> > >> > > IMHO, we should then plan to switch the default of decimal
> > > >> > >> serialization
> > > >> > >> > to
> > > >> > >> > > numeric, and text serialization to base 10 in the next
> major
> > > >> > release.
> > > >> > >> >
> > > >> > >> > I think that can be a separate discussion, I don't want to
> > block
> > > >> this
> > > >> > >> KIP
> > > >> > >> > on it.
> > > >> > >> >
> > > >> > >> > Thoughts?
> > > >> > >> >
> > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > otto@wikimedia.org>
> > > >> > wrote:
> > > >> > >> >
> > > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
> > > >> > >> JSONSchemas to
> > > >> > >> > > indicate whether a JSON number should be deserialized as an
> > > >> integer
> > > >> > >> or a
> > > >> > >> > > decimal
> > > >> > >> > > <
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > >> > >> > > >.
> > > >> > >> > > Not everyone is going to have JSONSchemas available when
> > > >> converting,
> > > >> > >> but
> > > >> > >> > if
> > > >> > >> > > you do, it is an easy way to support JSON numbers as
> > decimals.
> > > >> > >> > >
> > > >> > >> > > Carry on! :)
> > > >> > >> > >
> > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > andy@confluent.io
> > > >
> > > >> > >> wrote:
> > > >> > >> > >
> > > >> > >> > > > Hi Almog,
> > > >> > >> > > >
> > > >> > >> > > > Like the KIP - I think being able to support decimals in
> > JSON
> > > >> in
> > > >> > the
> > > >> > >> > same
> > > >> > >> > > > way most other systems do is a great improvement.
> > > >> > >> > > >
> > > >> > >> > > > It's not 100% clear to me from the KIP what the current
> > > format
> > > >> is.
> > > >> > >> Is
> > > >> > >> > > the
> > > >> > >> > > > text decimal a base16 encoded number, or is it base16
> > encoded
> > > >> > binary
> > > >> > >> > form
> > > >> > >> > > > of the number? (I've not tried to get my head around if
> > these
> > > >> two
> > > >> > >> are
> > > >> > >> > > even
> > > >> > >> > > > different!)
> > > >> > >> > > >
> > > >> > >> > > > One suggestion I have is to change the proposed new
> config
> > to
> > > >> only
> > > >> > >> > affect
> > > >> > >> > > > decimals stored as text, i.e. to switch between the
> current
> > > >> base16
> > > >> > >> and
> > > >> > >> > > the
> > > >> > >> > > > more common base10.   Then add another config to the
> > > serialzier
> > > >> > only
> > > >> > >> > that
> > > >> > >> > > > controls if decimals should be serialized as text or
> > numeric.
> > > >> The
> > > >> > >> > > benefit
> > > >> > >> > > > of this approach is it allows us to enhance the
> > deserializer
> > > to
> > > >> > >> > > > automatically handle numeric decimals even without any
> > config
> > > >> > >> having to
> > > >> > >> > > be
> > > >> > >> > > > set, i.e. default config in the deserializer would be
> able
> > to
> > > >> > handle
> > > >> > >> > > > numeric decimals.  Of course, this is a two edged sword:
> > this
> > > >> > would
> > > >> > >> > make
> > > >> > >> > > > the deserializer work out of the box with numeric
> decimals,
> > > >> > (yay!),
> > > >> > >> but
> > > >> > >> > > > would be a breaking change in that things that previously
> > > >> failed
> > > >> > >> would
> > > >> > >> > > > suddenly start deserializing.  This is a price I'm
> willing
> > to
> > > >> pay.
> > > >> > >> > > >
> > > >> > >> > > > IMHO, we should then plan to switch the default of
> decimal
> > > >> > >> > serialization
> > > >> > >> > > to
> > > >> > >> > > > numeric, and text serialization to base 10 in the next
> > major
> > > >> > >> release.
> > > >> > >> > > > (With upgrade notes to match). Though I know this is more
> > > >> > >> contentious,
> > > >> > >> > I
> > > >> > >> > > > think it moves us forward in a much more standard way
> that
> > > the
> > > >> > >> current
> > > >> > >> > > > encoding of decimals.
> > > >> > >> > > >
> > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > almog@confluent.io>
> > > >> > >> wrote:
> > > >> > >> > > >
> > > >> > >> > > > > Hi Everyone!
> > > >> > >> > > > >
> > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > >> > >> > > > >
> > > >> > >> > > > > For those who are interested, I have a prototype
> > > >> implementation
> > > >> > >> that
> > > >> > >> > > > helped
> > > >> > >> > > > > guide my design:
> https://github.com/agavra/kafka/pull/1
> > > >> > >> > > > >
> > > >> > >> > > > > Cheers,
> > > >> > >> > > > > Almog
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Almog Gavra <al...@confluent.io>.
Good catches! Fixed :)

On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <ar...@gmail.com> wrote:

> Cool!
>
> Couple of nits:
>
> - In public interfaces, typo: *json.decimal.serialization.fromat*
> - In public interfaces, you use the term "HEX" instead of "BASE64".
>
>
>
> On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io> wrote:
>
> > EDIT: everywhere I've been using "HEX" I meant to be using "BASE64". I
> will
> > update the KIP to reflect this.
> >
> > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io> wrote:
> >
> > > Thanks for the feedback Arjun! I'm happy changing the default config to
> > > HEX instead of BINARY, no strong feelings there.
> > >
> > > I'll also clarify the example in the KIP to be clearer:
> > >
> > > - serialize the decimal field "foo" with value "10.2345" with the HEX
> > > setting: {"foo": "D3J5"}
> > > - serialize the decimal field "foo" with value "10.2345" with the
> NUMERIC
> > > setting: {"foo": 10.2345}
> > >
> > > With regards to the precision issue, that was my original concern as
> well
> > > (and why I originally suggested a TEXT format). Many JSON deserializers
> > > (e.g. Jackson with DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > however, have the ability to deserialize decimals correctly so I will
> > > configure that as the default for Connect's JsonDeserializer. It's
> > probably
> > > a good idea to call out that using other deserializers must be done
> with
> > > care - I will add that documentation to the serialization config.
> > >
> > > Note that there would not be an issue on the _serialization_ side of
> > > things as Jackson respects BigDecimal.
> > >
> > > Almog
> > >
> > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <ar...@gmail.com>
> > > wrote:
> > >
> > >> hey Almog, nice work! couple of thoughts (hope I'm not late since you
> > >> started the voting thread already):
> > >>
> > >> can you please add some examples to show the changes that you are
> > >> proposing. makes me think that for a given decimal number, we will
> have
> > >> two
> > >> encodings: “asHex” and “asNumber”.
> > >>
> > >> should we call the default config value “HEX” instead of “BINARY”?
> > >>
> > >> Should we call out the fact that JS systems might be susceptible to
> > double
> > >> precision round offs with the new numeric format? here are some people
> > >> discussing a similar problem
> > >> https://github.com/EventStore/EventStore/issues/1541
> > >>
> > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io>
> wrote:
> > >>
> > >> > Hello Everyone,
> > >> >
> > >> > Summarizing an in-person discussion with Randall (this is copied
> from
> > >> the
> > >> > KIP):
> > >> >
> > >> > The original KIP suggested supporting an additional representation -
> > >> base10
> > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues
> because
> > >> it
> > >> > is impossible to disambiguate between TEXT and BINARY without an
> > >> additional
> > >> > config - furthermore, this makes the migration from one to the other
> > >> nearly
> > >> > impossible because it would require that all consumers stop
> consuming
> > >> and
> > >> > producers stop producing and atomically updating the config on all
> of
> > >> them
> > >> > after deploying the new code, or waiting for the full retention
> period
> > >> to
> > >> > pass - neither option is viable. The suggestion in the KIP is
> strictly
> > >> an
> > >> > improvement over the existing behavior, even if it doesn't support
> all
> > >> > combinations.
> > >> >
> > >> > It seems that since most real-world use cases actually use the
> numeric
> > >> > representation (not string) we can consider this an improvement.
> With
> > >> the
> > >> > new suggestion, we don't need a deserialization configuration (only
> a
> > >> > serialization option) and the consumers will be able to always
> > >> > automatically determine the serialization format.
> > >> >
> > >> > Based on this, I'll be opening up the simplified version of the KIP
> > to a
> > >> > vote.
> > >> >
> > >> > Almog
> > >> >
> > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io>
> > wrote:
> > >> >
> > >> > > I'm mostly happy with your current suggestion (two configs, one
> for
> > >> > > serialization and one for deserialization) and your implementation
> > >> > > suggestion. One thing to note:
> > >> > >
> > >> > > > We should _always_ be able to deserialize a standard JSON
> > >> > > > number as a decimal
> > >> > >
> > >> > > I was doing some research into decimals and JSON, and I can
> imagine
> > a
> > >> > > compelling reason to require string representations to avoid
> losing
> > >> > > precision and to be certain that whomever is sending the data
> isn't
> > >> > losing
> > >> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
> > >> > >
> > >> > > I'm okay with always allowing numerics, but thought it's worth
> > raising
> > >> > the
> > >> > > thought.
> > >> > >
> > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io>
> > >> wrote:
> > >> > >
> > >> > >> The way I see it, we need to control two seperate things:
> > >> > >>
> > >> > >> 1. How do we _deserialize_ a decimal type if we encounter a text
> > >> node in
> > >> > >> the JSON?    (We should _always_ be able to deserialize a
> standard
> > >> JSON
> > >> > >> number as a decimal).
> > >> > >> 2. How do we chose how we want decimals to be _serialized_.
> > >> > >>
> > >> > >> This looks to fits well with your second suggestion of slightly
> > >> > different
> > >> > >> configs names for serialization vs deserialization.
> > >> > >> a, For deserialization we only care about how to handle text
> > nodes: `
> > >> > >> deserialization.decimal.*text*.format`, which should only have
> two
> > >> valid
> > >> > >> values BINARY | TEXT.
> > >> > >> b. For serialization we need all three:
> > >> `serialization.decimal.format`,
> > >> > >> which should support all three options: BINARY | TEXT | NUMERIC.
> > >> > >>
> > >> > >> Implementation wise, I think these should be two separate enums,
> > >> rather
> > >> > >> than one shared enum and throwing an error if the deserializer is
> > >> set to
> > >> > >> NUMERIC.  Mainly as this means the enums reflect the options
> > >> available,
> > >> > >> rather than this being hidden in config checking code.  But
> that's
> > a
> > >> > minor
> > >> > >> implementation detail.
> > >> > >>
> > >> > >> Personally, I'd be tempted to have the BINARY value named
> something
> > >> like
> > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move
> > >> away
> > >> > >> from
> > >> > >> it.
> > >> > >>
> > >> > >> It's a real shame that both of these settings require a default
> of
> > >> > BINARY
> > >> > >> for backwards compatibility, but I agree that discussions / plans
> > >> around
> > >> > >> switching the defaults should not block this KIP.
> > >> > >>
> > >> > >> Andy
> > >> > >>
> > >> > >>
> > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io>
> > >> wrote:
> > >> > >>
> > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > >> > >> >
> > >> > >> > > Is the text decimal a base16 encoded number, or is it base16
> > >> encoded
> > >> > >> > binary
> > >> > >> > > form of the number?
> > >> > >> >
> > >> > >> > The conversion happens as decimal.unscaledValue().toByteArray()
> > and
> > >> > then
> > >> > >> > the byte array is converted to a hex string, so it's definitely
> > the
> > >> > >> binary
> > >> > >> > form of the number converted to base16. Whether or not that's
> the
> > >> same
> > >> > >> as
> > >> > >> > the base16 encoded number is a good question (toByteArray
> > returns a
> > >> > byte
> > >> > >> > array containing a signed, big-endian, two's complement
> > >> representation
> > >> > >> of
> > >> > >> > the big integer).
> > >> > >> >
> > >> > >> > > One suggestion I have is to change the proposed new config to
> > >> only
> > >> > >> affect
> > >> > >> > > decimals stored as text, i.e. to switch between the current
> > >> base16
> > >> > and
> > >> > >> > the
> > >> > >> > > more common base10.   Then add another config to the
> serializer
> > >> only
> > >> > >> that
> > >> > >> > > controls if decimals should be serialized as text or numeric.
> > >> > >> >
> > >> > >> > I think we need to be able to handle all mappings from
> > >> serialization
> > >> > >> format
> > >> > >> > to deserialization format (e.g. read in BINARY and output
> TEXT),
> > >> > which I
> > >> > >> > think would be impossible with the alternative suggestion. I
> > agree
> > >> > that
> > >> > >> > automatically deserializing numerics is valuable. I see two
> other
> > >> ways
> > >> > >> to
> > >> > >> > get this, both keeping the serialization.format config the
> same:
> > >> > >> >
> > >> > >> > - have json.decimal.deserialization.format accept all three
> > >> formats.
> > >> > if
> > >> > >> set
> > >> > >> > to BINARY/TEXT, numerics would be automatically supported. If
> set
> > >> to
> > >> > >> > NUMERIC, then any string coming in would result in
> > deserialization
> > >> > error
> > >> > >> > (defaults to BINARY for backwards compatibility)
> > >> > >> > - change json.decimal.deserialization.format to
> > >> > >> > json.decimal.deserialization.string.format which accepts only
> > >> > >> BINARY/TEXT
> > >> > >> > (defaults to BINARY for backwards compatibility)
> > >> > >> >
> > >> > >> > > would be a breaking change in that things that previously
> > failed
> > >> > would
> > >> > >> > > suddenly start deserializing.  This is a price I'm willing to
> > >> pay.
> > >> > >> >
> > >> > >> > I agree. I'm willing to pay this price too.
> > >> > >> >
> > >> > >> > > IMHO, we should then plan to switch the default of decimal
> > >> > >> serialization
> > >> > >> > to
> > >> > >> > > numeric, and text serialization to base 10 in the next major
> > >> > release.
> > >> > >> >
> > >> > >> > I think that can be a separate discussion, I don't want to
> block
> > >> this
> > >> > >> KIP
> > >> > >> > on it.
> > >> > >> >
> > >> > >> > Thoughts?
> > >> > >> >
> > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> otto@wikimedia.org>
> > >> > wrote:
> > >> > >> >
> > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
> > >> > >> JSONSchemas to
> > >> > >> > > indicate whether a JSON number should be deserialized as an
> > >> integer
> > >> > >> or a
> > >> > >> > > decimal
> > >> > >> > > <
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > >> > >> > > >.
> > >> > >> > > Not everyone is going to have JSONSchemas available when
> > >> converting,
> > >> > >> but
> > >> > >> > if
> > >> > >> > > you do, it is an easy way to support JSON numbers as
> decimals.
> > >> > >> > >
> > >> > >> > > Carry on! :)
> > >> > >> > >
> > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> andy@confluent.io
> > >
> > >> > >> wrote:
> > >> > >> > >
> > >> > >> > > > Hi Almog,
> > >> > >> > > >
> > >> > >> > > > Like the KIP - I think being able to support decimals in
> JSON
> > >> in
> > >> > the
> > >> > >> > same
> > >> > >> > > > way most other systems do is a great improvement.
> > >> > >> > > >
> > >> > >> > > > It's not 100% clear to me from the KIP what the current
> > format
> > >> is.
> > >> > >> Is
> > >> > >> > > the
> > >> > >> > > > text decimal a base16 encoded number, or is it base16
> encoded
> > >> > binary
> > >> > >> > form
> > >> > >> > > > of the number? (I've not tried to get my head around if
> these
> > >> two
> > >> > >> are
> > >> > >> > > even
> > >> > >> > > > different!)
> > >> > >> > > >
> > >> > >> > > > One suggestion I have is to change the proposed new config
> to
> > >> only
> > >> > >> > affect
> > >> > >> > > > decimals stored as text, i.e. to switch between the current
> > >> base16
> > >> > >> and
> > >> > >> > > the
> > >> > >> > > > more common base10.   Then add another config to the
> > serialzier
> > >> > only
> > >> > >> > that
> > >> > >> > > > controls if decimals should be serialized as text or
> numeric.
> > >> The
> > >> > >> > > benefit
> > >> > >> > > > of this approach is it allows us to enhance the
> deserializer
> > to
> > >> > >> > > > automatically handle numeric decimals even without any
> config
> > >> > >> having to
> > >> > >> > > be
> > >> > >> > > > set, i.e. default config in the deserializer would be able
> to
> > >> > handle
> > >> > >> > > > numeric decimals.  Of course, this is a two edged sword:
> this
> > >> > would
> > >> > >> > make
> > >> > >> > > > the deserializer work out of the box with numeric decimals,
> > >> > (yay!),
> > >> > >> but
> > >> > >> > > > would be a breaking change in that things that previously
> > >> failed
> > >> > >> would
> > >> > >> > > > suddenly start deserializing.  This is a price I'm willing
> to
> > >> pay.
> > >> > >> > > >
> > >> > >> > > > IMHO, we should then plan to switch the default of decimal
> > >> > >> > serialization
> > >> > >> > > to
> > >> > >> > > > numeric, and text serialization to base 10 in the next
> major
> > >> > >> release.
> > >> > >> > > > (With upgrade notes to match). Though I know this is more
> > >> > >> contentious,
> > >> > >> > I
> > >> > >> > > > think it moves us forward in a much more standard way that
> > the
> > >> > >> current
> > >> > >> > > > encoding of decimals.
> > >> > >> > > >
> > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > almog@confluent.io>
> > >> > >> wrote:
> > >> > >> > > >
> > >> > >> > > > > Hi Everyone!
> > >> > >> > > > >
> > >> > >> > > > > Kicking off discussion for a new KIP:
> > >> > >> > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > >> > >> > > > >
> > >> > >> > > > > For those who are interested, I have a prototype
> > >> implementation
> > >> > >> that
> > >> > >> > > > helped
> > >> > >> > > > > guide my design: https://github.com/agavra/kafka/pull/1
> > >> > >> > > > >
> > >> > >> > > > > Cheers,
> > >> > >> > > > > Almog
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Arjun Satish <ar...@gmail.com>.
Cool!

Couple of nits:

- In public interfaces, typo: *json.decimal.serialization.fromat*
- In public interfaces, you use the term "HEX" instead of "BASE64".



On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io> wrote:

> EDIT: everywhere I've been using "HEX" I meant to be using "BASE64". I will
> update the KIP to reflect this.
>
> On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io> wrote:
>
> > Thanks for the feedback Arjun! I'm happy changing the default config to
> > HEX instead of BINARY, no strong feelings there.
> >
> > I'll also clarify the example in the KIP to be clearer:
> >
> > - serialize the decimal field "foo" with value "10.2345" with the HEX
> > setting: {"foo": "D3J5"}
> > - serialize the decimal field "foo" with value "10.2345" with the NUMERIC
> > setting: {"foo": 10.2345}
> >
> > With regards to the precision issue, that was my original concern as well
> > (and why I originally suggested a TEXT format). Many JSON deserializers
> > (e.g. Jackson with DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > however, have the ability to deserialize decimals correctly so I will
> > configure that as the default for Connect's JsonDeserializer. It's
> probably
> > a good idea to call out that using other deserializers must be done with
> > care - I will add that documentation to the serialization config.
> >
> > Note that there would not be an issue on the _serialization_ side of
> > things as Jackson respects BigDecimal.
> >
> > Almog
> >
> > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <ar...@gmail.com>
> > wrote:
> >
> >> hey Almog, nice work! couple of thoughts (hope I'm not late since you
> >> started the voting thread already):
> >>
> >> can you please add some examples to show the changes that you are
> >> proposing. makes me think that for a given decimal number, we will have
> >> two
> >> encodings: “asHex” and “asNumber”.
> >>
> >> should we call the default config value “HEX” instead of “BINARY”?
> >>
> >> Should we call out the fact that JS systems might be susceptible to
> double
> >> precision round offs with the new numeric format? here are some people
> >> discussing a similar problem
> >> https://github.com/EventStore/EventStore/issues/1541
> >>
> >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io> wrote:
> >>
> >> > Hello Everyone,
> >> >
> >> > Summarizing an in-person discussion with Randall (this is copied from
> >> the
> >> > KIP):
> >> >
> >> > The original KIP suggested supporting an additional representation -
> >> base10
> >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues because
> >> it
> >> > is impossible to disambiguate between TEXT and BINARY without an
> >> additional
> >> > config - furthermore, this makes the migration from one to the other
> >> nearly
> >> > impossible because it would require that all consumers stop consuming
> >> and
> >> > producers stop producing and atomically updating the config on all of
> >> them
> >> > after deploying the new code, or waiting for the full retention period
> >> to
> >> > pass - neither option is viable. The suggestion in the KIP is strictly
> >> an
> >> > improvement over the existing behavior, even if it doesn't support all
> >> > combinations.
> >> >
> >> > It seems that since most real-world use cases actually use the numeric
> >> > representation (not string) we can consider this an improvement. With
> >> the
> >> > new suggestion, we don't need a deserialization configuration (only a
> >> > serialization option) and the consumers will be able to always
> >> > automatically determine the serialization format.
> >> >
> >> > Based on this, I'll be opening up the simplified version of the KIP
> to a
> >> > vote.
> >> >
> >> > Almog
> >> >
> >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io>
> wrote:
> >> >
> >> > > I'm mostly happy with your current suggestion (two configs, one for
> >> > > serialization and one for deserialization) and your implementation
> >> > > suggestion. One thing to note:
> >> > >
> >> > > > We should _always_ be able to deserialize a standard JSON
> >> > > > number as a decimal
> >> > >
> >> > > I was doing some research into decimals and JSON, and I can imagine
> a
> >> > > compelling reason to require string representations to avoid losing
> >> > > precision and to be certain that whomever is sending the data isn't
> >> > losing
> >> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
> >> > >
> >> > > I'm okay with always allowing numerics, but thought it's worth
> raising
> >> > the
> >> > > thought.
> >> > >
> >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io>
> >> wrote:
> >> > >
> >> > >> The way I see it, we need to control two seperate things:
> >> > >>
> >> > >> 1. How do we _deserialize_ a decimal type if we encounter a text
> >> node in
> >> > >> the JSON?    (We should _always_ be able to deserialize a standard
> >> JSON
> >> > >> number as a decimal).
> >> > >> 2. How do we chose how we want decimals to be _serialized_.
> >> > >>
> >> > >> This looks to fits well with your second suggestion of slightly
> >> > different
> >> > >> configs names for serialization vs deserialization.
> >> > >> a, For deserialization we only care about how to handle text
> nodes: `
> >> > >> deserialization.decimal.*text*.format`, which should only have two
> >> valid
> >> > >> values BINARY | TEXT.
> >> > >> b. For serialization we need all three:
> >> `serialization.decimal.format`,
> >> > >> which should support all three options: BINARY | TEXT | NUMERIC.
> >> > >>
> >> > >> Implementation wise, I think these should be two separate enums,
> >> rather
> >> > >> than one shared enum and throwing an error if the deserializer is
> >> set to
> >> > >> NUMERIC.  Mainly as this means the enums reflect the options
> >> available,
> >> > >> rather than this being hidden in config checking code.  But that's
> a
> >> > minor
> >> > >> implementation detail.
> >> > >>
> >> > >> Personally, I'd be tempted to have the BINARY value named something
> >> like
> >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move
> >> away
> >> > >> from
> >> > >> it.
> >> > >>
> >> > >> It's a real shame that both of these settings require a default of
> >> > BINARY
> >> > >> for backwards compatibility, but I agree that discussions / plans
> >> around
> >> > >> switching the defaults should not block this KIP.
> >> > >>
> >> > >> Andy
> >> > >>
> >> > >>
> >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io>
> >> wrote:
> >> > >>
> >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> >> > >> >
> >> > >> > > Is the text decimal a base16 encoded number, or is it base16
> >> encoded
> >> > >> > binary
> >> > >> > > form of the number?
> >> > >> >
> >> > >> > The conversion happens as decimal.unscaledValue().toByteArray()
> and
> >> > then
> >> > >> > the byte array is converted to a hex string, so it's definitely
> the
> >> > >> binary
> >> > >> > form of the number converted to base16. Whether or not that's the
> >> same
> >> > >> as
> >> > >> > the base16 encoded number is a good question (toByteArray
> returns a
> >> > byte
> >> > >> > array containing a signed, big-endian, two's complement
> >> representation
> >> > >> of
> >> > >> > the big integer).
> >> > >> >
> >> > >> > > One suggestion I have is to change the proposed new config to
> >> only
> >> > >> affect
> >> > >> > > decimals stored as text, i.e. to switch between the current
> >> base16
> >> > and
> >> > >> > the
> >> > >> > > more common base10.   Then add another config to the serializer
> >> only
> >> > >> that
> >> > >> > > controls if decimals should be serialized as text or numeric.
> >> > >> >
> >> > >> > I think we need to be able to handle all mappings from
> >> serialization
> >> > >> format
> >> > >> > to deserialization format (e.g. read in BINARY and output TEXT),
> >> > which I
> >> > >> > think would be impossible with the alternative suggestion. I
> agree
> >> > that
> >> > >> > automatically deserializing numerics is valuable. I see two other
> >> ways
> >> > >> to
> >> > >> > get this, both keeping the serialization.format config the same:
> >> > >> >
> >> > >> > - have json.decimal.deserialization.format accept all three
> >> formats.
> >> > if
> >> > >> set
> >> > >> > to BINARY/TEXT, numerics would be automatically supported. If set
> >> to
> >> > >> > NUMERIC, then any string coming in would result in
> deserialization
> >> > error
> >> > >> > (defaults to BINARY for backwards compatibility)
> >> > >> > - change json.decimal.deserialization.format to
> >> > >> > json.decimal.deserialization.string.format which accepts only
> >> > >> BINARY/TEXT
> >> > >> > (defaults to BINARY for backwards compatibility)
> >> > >> >
> >> > >> > > would be a breaking change in that things that previously
> failed
> >> > would
> >> > >> > > suddenly start deserializing.  This is a price I'm willing to
> >> pay.
> >> > >> >
> >> > >> > I agree. I'm willing to pay this price too.
> >> > >> >
> >> > >> > > IMHO, we should then plan to switch the default of decimal
> >> > >> serialization
> >> > >> > to
> >> > >> > > numeric, and text serialization to base 10 in the next major
> >> > release.
> >> > >> >
> >> > >> > I think that can be a separate discussion, I don't want to block
> >> this
> >> > >> KIP
> >> > >> > on it.
> >> > >> >
> >> > >> > Thoughts?
> >> > >> >
> >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <ot...@wikimedia.org>
> >> > wrote:
> >> > >> >
> >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
> >> > >> JSONSchemas to
> >> > >> > > indicate whether a JSON number should be deserialized as an
> >> integer
> >> > >> or a
> >> > >> > > decimal
> >> > >> > > <
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> >> > >> > > >.
> >> > >> > > Not everyone is going to have JSONSchemas available when
> >> converting,
> >> > >> but
> >> > >> > if
> >> > >> > > you do, it is an easy way to support JSON numbers as decimals.
> >> > >> > >
> >> > >> > > Carry on! :)
> >> > >> > >
> >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <andy@confluent.io
> >
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > Hi Almog,
> >> > >> > > >
> >> > >> > > > Like the KIP - I think being able to support decimals in JSON
> >> in
> >> > the
> >> > >> > same
> >> > >> > > > way most other systems do is a great improvement.
> >> > >> > > >
> >> > >> > > > It's not 100% clear to me from the KIP what the current
> format
> >> is.
> >> > >> Is
> >> > >> > > the
> >> > >> > > > text decimal a base16 encoded number, or is it base16 encoded
> >> > binary
> >> > >> > form
> >> > >> > > > of the number? (I've not tried to get my head around if these
> >> two
> >> > >> are
> >> > >> > > even
> >> > >> > > > different!)
> >> > >> > > >
> >> > >> > > > One suggestion I have is to change the proposed new config to
> >> only
> >> > >> > affect
> >> > >> > > > decimals stored as text, i.e. to switch between the current
> >> base16
> >> > >> and
> >> > >> > > the
> >> > >> > > > more common base10.   Then add another config to the
> serialzier
> >> > only
> >> > >> > that
> >> > >> > > > controls if decimals should be serialized as text or numeric.
> >> The
> >> > >> > > benefit
> >> > >> > > > of this approach is it allows us to enhance the deserializer
> to
> >> > >> > > > automatically handle numeric decimals even without any config
> >> > >> having to
> >> > >> > > be
> >> > >> > > > set, i.e. default config in the deserializer would be able to
> >> > handle
> >> > >> > > > numeric decimals.  Of course, this is a two edged sword: this
> >> > would
> >> > >> > make
> >> > >> > > > the deserializer work out of the box with numeric decimals,
> >> > (yay!),
> >> > >> but
> >> > >> > > > would be a breaking change in that things that previously
> >> failed
> >> > >> would
> >> > >> > > > suddenly start deserializing.  This is a price I'm willing to
> >> pay.
> >> > >> > > >
> >> > >> > > > IMHO, we should then plan to switch the default of decimal
> >> > >> > serialization
> >> > >> > > to
> >> > >> > > > numeric, and text serialization to base 10 in the next major
> >> > >> release.
> >> > >> > > > (With upgrade notes to match). Though I know this is more
> >> > >> contentious,
> >> > >> > I
> >> > >> > > > think it moves us forward in a much more standard way that
> the
> >> > >> current
> >> > >> > > > encoding of decimals.
> >> > >> > > >
> >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> almog@confluent.io>
> >> > >> wrote:
> >> > >> > > >
> >> > >> > > > > Hi Everyone!
> >> > >> > > > >
> >> > >> > > > > Kicking off discussion for a new KIP:
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> >> > >> > > > >
> >> > >> > > > > For those who are interested, I have a prototype
> >> implementation
> >> > >> that
> >> > >> > > > helped
> >> > >> > > > > guide my design: https://github.com/agavra/kafka/pull/1
> >> > >> > > > >
> >> > >> > > > > Cheers,
> >> > >> > > > > Almog
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Almog Gavra <al...@confluent.io>.
EDIT: everywhere I've been using "HEX" I meant to be using "BASE64". I will
update the KIP to reflect this.

On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io> wrote:

> Thanks for the feedback Arjun! I'm happy changing the default config to
> HEX instead of BINARY, no strong feelings there.
>
> I'll also clarify the example in the KIP to be clearer:
>
> - serialize the decimal field "foo" with value "10.2345" with the HEX
> setting: {"foo": "D3J5"}
> - serialize the decimal field "foo" with value "10.2345" with the NUMERIC
> setting: {"foo": 10.2345}
>
> With regards to the precision issue, that was my original concern as well
> (and why I originally suggested a TEXT format). Many JSON deserializers
> (e.g. Jackson with DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> however, have the ability to deserialize decimals correctly so I will
> configure that as the default for Connect's JsonDeserializer. It's probably
> a good idea to call out that using other deserializers must be done with
> care - I will add that documentation to the serialization config.
>
> Note that there would not be an issue on the _serialization_ side of
> things as Jackson respects BigDecimal.
>
> Almog
>
> On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <ar...@gmail.com>
> wrote:
>
>> hey Almog, nice work! couple of thoughts (hope I'm not late since you
>> started the voting thread already):
>>
>> can you please add some examples to show the changes that you are
>> proposing. makes me think that for a given decimal number, we will have
>> two
>> encodings: “asHex” and “asNumber”.
>>
>> should we call the default config value “HEX” instead of “BINARY”?
>>
>> Should we call out the fact that JS systems might be susceptible to double
>> precision round offs with the new numeric format? here are some people
>> discussing a similar problem
>> https://github.com/EventStore/EventStore/issues/1541
>>
>> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io> wrote:
>>
>> > Hello Everyone,
>> >
>> > Summarizing an in-person discussion with Randall (this is copied from
>> the
>> > KIP):
>> >
>> > The original KIP suggested supporting an additional representation -
>> base10
>> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues because
>> it
>> > is impossible to disambiguate between TEXT and BINARY without an
>> additional
>> > config - furthermore, this makes the migration from one to the other
>> nearly
>> > impossible because it would require that all consumers stop consuming
>> and
>> > producers stop producing and atomically updating the config on all of
>> them
>> > after deploying the new code, or waiting for the full retention period
>> to
>> > pass - neither option is viable. The suggestion in the KIP is strictly
>> an
>> > improvement over the existing behavior, even if it doesn't support all
>> > combinations.
>> >
>> > It seems that since most real-world use cases actually use the numeric
>> > representation (not string) we can consider this an improvement. With
>> the
>> > new suggestion, we don't need a deserialization configuration (only a
>> > serialization option) and the consumers will be able to always
>> > automatically determine the serialization format.
>> >
>> > Based on this, I'll be opening up the simplified version of the KIP to a
>> > vote.
>> >
>> > Almog
>> >
>> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io> wrote:
>> >
>> > > I'm mostly happy with your current suggestion (two configs, one for
>> > > serialization and one for deserialization) and your implementation
>> > > suggestion. One thing to note:
>> > >
>> > > > We should _always_ be able to deserialize a standard JSON
>> > > > number as a decimal
>> > >
>> > > I was doing some research into decimals and JSON, and I can imagine a
>> > > compelling reason to require string representations to avoid losing
>> > > precision and to be certain that whomever is sending the data isn't
>> > losing
>> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
>> > >
>> > > I'm okay with always allowing numerics, but thought it's worth raising
>> > the
>> > > thought.
>> > >
>> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io>
>> wrote:
>> > >
>> > >> The way I see it, we need to control two seperate things:
>> > >>
>> > >> 1. How do we _deserialize_ a decimal type if we encounter a text
>> node in
>> > >> the JSON?    (We should _always_ be able to deserialize a standard
>> JSON
>> > >> number as a decimal).
>> > >> 2. How do we chose how we want decimals to be _serialized_.
>> > >>
>> > >> This looks to fits well with your second suggestion of slightly
>> > different
>> > >> configs names for serialization vs deserialization.
>> > >> a, For deserialization we only care about how to handle text nodes: `
>> > >> deserialization.decimal.*text*.format`, which should only have two
>> valid
>> > >> values BINARY | TEXT.
>> > >> b. For serialization we need all three:
>> `serialization.decimal.format`,
>> > >> which should support all three options: BINARY | TEXT | NUMERIC.
>> > >>
>> > >> Implementation wise, I think these should be two separate enums,
>> rather
>> > >> than one shared enum and throwing an error if the deserializer is
>> set to
>> > >> NUMERIC.  Mainly as this means the enums reflect the options
>> available,
>> > >> rather than this being hidden in config checking code.  But that's a
>> > minor
>> > >> implementation detail.
>> > >>
>> > >> Personally, I'd be tempted to have the BINARY value named something
>> like
>> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move
>> away
>> > >> from
>> > >> it.
>> > >>
>> > >> It's a real shame that both of these settings require a default of
>> > BINARY
>> > >> for backwards compatibility, but I agree that discussions / plans
>> around
>> > >> switching the defaults should not block this KIP.
>> > >>
>> > >> Andy
>> > >>
>> > >>
>> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io>
>> wrote:
>> > >>
>> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
>> > >> >
>> > >> > > Is the text decimal a base16 encoded number, or is it base16
>> encoded
>> > >> > binary
>> > >> > > form of the number?
>> > >> >
>> > >> > The conversion happens as decimal.unscaledValue().toByteArray() and
>> > then
>> > >> > the byte array is converted to a hex string, so it's definitely the
>> > >> binary
>> > >> > form of the number converted to base16. Whether or not that's the
>> same
>> > >> as
>> > >> > the base16 encoded number is a good question (toByteArray returns a
>> > byte
>> > >> > array containing a signed, big-endian, two's complement
>> representation
>> > >> of
>> > >> > the big integer).
>> > >> >
>> > >> > > One suggestion I have is to change the proposed new config to
>> only
>> > >> affect
>> > >> > > decimals stored as text, i.e. to switch between the current
>> base16
>> > and
>> > >> > the
>> > >> > > more common base10.   Then add another config to the serializer
>> only
>> > >> that
>> > >> > > controls if decimals should be serialized as text or numeric.
>> > >> >
>> > >> > I think we need to be able to handle all mappings from
>> serialization
>> > >> format
>> > >> > to deserialization format (e.g. read in BINARY and output TEXT),
>> > which I
>> > >> > think would be impossible with the alternative suggestion. I agree
>> > that
>> > >> > automatically deserializing numerics is valuable. I see two other
>> ways
>> > >> to
>> > >> > get this, both keeping the serialization.format config the same:
>> > >> >
>> > >> > - have json.decimal.deserialization.format accept all three
>> formats.
>> > if
>> > >> set
>> > >> > to BINARY/TEXT, numerics would be automatically supported. If set
>> to
>> > >> > NUMERIC, then any string coming in would result in deserialization
>> > error
>> > >> > (defaults to BINARY for backwards compatibility)
>> > >> > - change json.decimal.deserialization.format to
>> > >> > json.decimal.deserialization.string.format which accepts only
>> > >> BINARY/TEXT
>> > >> > (defaults to BINARY for backwards compatibility)
>> > >> >
>> > >> > > would be a breaking change in that things that previously failed
>> > would
>> > >> > > suddenly start deserializing.  This is a price I'm willing to
>> pay.
>> > >> >
>> > >> > I agree. I'm willing to pay this price too.
>> > >> >
>> > >> > > IMHO, we should then plan to switch the default of decimal
>> > >> serialization
>> > >> > to
>> > >> > > numeric, and text serialization to base 10 in the next major
>> > release.
>> > >> >
>> > >> > I think that can be a separate discussion, I don't want to block
>> this
>> > >> KIP
>> > >> > on it.
>> > >> >
>> > >> > Thoughts?
>> > >> >
>> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <ot...@wikimedia.org>
>> > wrote:
>> > >> >
>> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
>> > >> JSONSchemas to
>> > >> > > indicate whether a JSON number should be deserialized as an
>> integer
>> > >> or a
>> > >> > > decimal
>> > >> > > <
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
>> > >> > > >.
>> > >> > > Not everyone is going to have JSONSchemas available when
>> converting,
>> > >> but
>> > >> > if
>> > >> > > you do, it is an easy way to support JSON numbers as decimals.
>> > >> > >
>> > >> > > Carry on! :)
>> > >> > >
>> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <an...@confluent.io>
>> > >> wrote:
>> > >> > >
>> > >> > > > Hi Almog,
>> > >> > > >
>> > >> > > > Like the KIP - I think being able to support decimals in JSON
>> in
>> > the
>> > >> > same
>> > >> > > > way most other systems do is a great improvement.
>> > >> > > >
>> > >> > > > It's not 100% clear to me from the KIP what the current format
>> is.
>> > >> Is
>> > >> > > the
>> > >> > > > text decimal a base16 encoded number, or is it base16 encoded
>> > binary
>> > >> > form
>> > >> > > > of the number? (I've not tried to get my head around if these
>> two
>> > >> are
>> > >> > > even
>> > >> > > > different!)
>> > >> > > >
>> > >> > > > One suggestion I have is to change the proposed new config to
>> only
>> > >> > affect
>> > >> > > > decimals stored as text, i.e. to switch between the current
>> base16
>> > >> and
>> > >> > > the
>> > >> > > > more common base10.   Then add another config to the serialzier
>> > only
>> > >> > that
>> > >> > > > controls if decimals should be serialized as text or numeric.
>> The
>> > >> > > benefit
>> > >> > > > of this approach is it allows us to enhance the deserializer to
>> > >> > > > automatically handle numeric decimals even without any config
>> > >> having to
>> > >> > > be
>> > >> > > > set, i.e. default config in the deserializer would be able to
>> > handle
>> > >> > > > numeric decimals.  Of course, this is a two edged sword: this
>> > would
>> > >> > make
>> > >> > > > the deserializer work out of the box with numeric decimals,
>> > (yay!),
>> > >> but
>> > >> > > > would be a breaking change in that things that previously
>> failed
>> > >> would
>> > >> > > > suddenly start deserializing.  This is a price I'm willing to
>> pay.
>> > >> > > >
>> > >> > > > IMHO, we should then plan to switch the default of decimal
>> > >> > serialization
>> > >> > > to
>> > >> > > > numeric, and text serialization to base 10 in the next major
>> > >> release.
>> > >> > > > (With upgrade notes to match). Though I know this is more
>> > >> contentious,
>> > >> > I
>> > >> > > > think it moves us forward in a much more standard way that the
>> > >> current
>> > >> > > > encoding of decimals.
>> > >> > > >
>> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <al...@confluent.io>
>> > >> wrote:
>> > >> > > >
>> > >> > > > > Hi Everyone!
>> > >> > > > >
>> > >> > > > > Kicking off discussion for a new KIP:
>> > >> > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
>> > >> > > > >
>> > >> > > > > For those who are interested, I have a prototype
>> implementation
>> > >> that
>> > >> > > > helped
>> > >> > > > > guide my design: https://github.com/agavra/kafka/pull/1
>> > >> > > > >
>> > >> > > > > Cheers,
>> > >> > > > > Almog
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Almog Gavra <al...@confluent.io>.
Thanks for the feedback Arjun! I'm happy changing the default config to HEX
instead of BINARY, no strong feelings there.

I'll also clarify the example in the KIP to be clearer:

- serialize the decimal field "foo" with value "10.2345" with the HEX
setting: {"foo": "D3J5"}
- serialize the decimal field "foo" with value "10.2345" with the NUMERIC
setting: {"foo": 10.2345}

With regards to the precision issue, that was my original concern as well
(and why I originally suggested a TEXT format). Many JSON deserializers
(e.g. Jackson with DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
however, have the ability to deserialize decimals correctly so I will
configure that as the default for Connect's JsonDeserializer. It's probably
a good idea to call out that using other deserializers must be done with
care - I will add that documentation to the serialization config.

Note that there would not be an issue on the _serialization_ side of things
as Jackson respects BigDecimal.

Almog

On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <ar...@gmail.com> wrote:

> hey Almog, nice work! couple of thoughts (hope I'm not late since you
> started the voting thread already):
>
> can you please add some examples to show the changes that you are
> proposing. makes me think that for a given decimal number, we will have two
> encodings: “asHex” and “asNumber”.
>
> should we call the default config value “HEX” instead of “BINARY”?
>
> Should we call out the fact that JS systems might be susceptible to double
> precision round offs with the new numeric format? here are some people
> discussing a similar problem
> https://github.com/EventStore/EventStore/issues/1541
>
> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io> wrote:
>
> > Hello Everyone,
> >
> > Summarizing an in-person discussion with Randall (this is copied from the
> > KIP):
> >
> > The original KIP suggested supporting an additional representation -
> base10
> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues because it
> > is impossible to disambiguate between TEXT and BINARY without an
> additional
> > config - furthermore, this makes the migration from one to the other
> nearly
> > impossible because it would require that all consumers stop consuming and
> > producers stop producing and atomically updating the config on all of
> them
> > after deploying the new code, or waiting for the full retention period to
> > pass - neither option is viable. The suggestion in the KIP is strictly an
> > improvement over the existing behavior, even if it doesn't support all
> > combinations.
> >
> > It seems that since most real-world use cases actually use the numeric
> > representation (not string) we can consider this an improvement. With the
> > new suggestion, we don't need a deserialization configuration (only a
> > serialization option) and the consumers will be able to always
> > automatically determine the serialization format.
> >
> > Based on this, I'll be opening up the simplified version of the KIP to a
> > vote.
> >
> > Almog
> >
> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io> wrote:
> >
> > > I'm mostly happy with your current suggestion (two configs, one for
> > > serialization and one for deserialization) and your implementation
> > > suggestion. One thing to note:
> > >
> > > > We should _always_ be able to deserialize a standard JSON
> > > > number as a decimal
> > >
> > > I was doing some research into decimals and JSON, and I can imagine a
> > > compelling reason to require string representations to avoid losing
> > > precision and to be certain that whomever is sending the data isn't
> > losing
> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
> > >
> > > I'm okay with always allowing numerics, but thought it's worth raising
> > the
> > > thought.
> > >
> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io> wrote:
> > >
> > >> The way I see it, we need to control two seperate things:
> > >>
> > >> 1. How do we _deserialize_ a decimal type if we encounter a text node
> in
> > >> the JSON?    (We should _always_ be able to deserialize a standard
> JSON
> > >> number as a decimal).
> > >> 2. How do we chose how we want decimals to be _serialized_.
> > >>
> > >> This looks to fits well with your second suggestion of slightly
> > different
> > >> configs names for serialization vs deserialization.
> > >> a, For deserialization we only care about how to handle text nodes: `
> > >> deserialization.decimal.*text*.format`, which should only have two
> valid
> > >> values BINARY | TEXT.
> > >> b. For serialization we need all three:
> `serialization.decimal.format`,
> > >> which should support all three options: BINARY | TEXT | NUMERIC.
> > >>
> > >> Implementation wise, I think these should be two separate enums,
> rather
> > >> than one shared enum and throwing an error if the deserializer is set
> to
> > >> NUMERIC.  Mainly as this means the enums reflect the options
> available,
> > >> rather than this being hidden in config checking code.  But that's a
> > minor
> > >> implementation detail.
> > >>
> > >> Personally, I'd be tempted to have the BINARY value named something
> like
> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move away
> > >> from
> > >> it.
> > >>
> > >> It's a real shame that both of these settings require a default of
> > BINARY
> > >> for backwards compatibility, but I agree that discussions / plans
> around
> > >> switching the defaults should not block this KIP.
> > >>
> > >> Andy
> > >>
> > >>
> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io> wrote:
> > >>
> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > >> >
> > >> > > Is the text decimal a base16 encoded number, or is it base16
> encoded
> > >> > binary
> > >> > > form of the number?
> > >> >
> > >> > The conversion happens as decimal.unscaledValue().toByteArray() and
> > then
> > >> > the byte array is converted to a hex string, so it's definitely the
> > >> binary
> > >> > form of the number converted to base16. Whether or not that's the
> same
> > >> as
> > >> > the base16 encoded number is a good question (toByteArray returns a
> > byte
> > >> > array containing a signed, big-endian, two's complement
> representation
> > >> of
> > >> > the big integer).
> > >> >
> > >> > > One suggestion I have is to change the proposed new config to only
> > >> affect
> > >> > > decimals stored as text, i.e. to switch between the current base16
> > and
> > >> > the
> > >> > > more common base10.   Then add another config to the serializer
> only
> > >> that
> > >> > > controls if decimals should be serialized as text or numeric.
> > >> >
> > >> > I think we need to be able to handle all mappings from serialization
> > >> format
> > >> > to deserialization format (e.g. read in BINARY and output TEXT),
> > which I
> > >> > think would be impossible with the alternative suggestion. I agree
> > that
> > >> > automatically deserializing numerics is valuable. I see two other
> ways
> > >> to
> > >> > get this, both keeping the serialization.format config the same:
> > >> >
> > >> > - have json.decimal.deserialization.format accept all three formats.
> > if
> > >> set
> > >> > to BINARY/TEXT, numerics would be automatically supported. If set to
> > >> > NUMERIC, then any string coming in would result in deserialization
> > error
> > >> > (defaults to BINARY for backwards compatibility)
> > >> > - change json.decimal.deserialization.format to
> > >> > json.decimal.deserialization.string.format which accepts only
> > >> BINARY/TEXT
> > >> > (defaults to BINARY for backwards compatibility)
> > >> >
> > >> > > would be a breaking change in that things that previously failed
> > would
> > >> > > suddenly start deserializing.  This is a price I'm willing to pay.
> > >> >
> > >> > I agree. I'm willing to pay this price too.
> > >> >
> > >> > > IMHO, we should then plan to switch the default of decimal
> > >> serialization
> > >> > to
> > >> > > numeric, and text serialization to base 10 in the next major
> > release.
> > >> >
> > >> > I think that can be a separate discussion, I don't want to block
> this
> > >> KIP
> > >> > on it.
> > >> >
> > >> > Thoughts?
> > >> >
> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <ot...@wikimedia.org>
> > wrote:
> > >> >
> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
> > >> JSONSchemas to
> > >> > > indicate whether a JSON number should be deserialized as an
> integer
> > >> or a
> > >> > > decimal
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > >> > > >.
> > >> > > Not everyone is going to have JSONSchemas available when
> converting,
> > >> but
> > >> > if
> > >> > > you do, it is an easy way to support JSON numbers as decimals.
> > >> > >
> > >> > > Carry on! :)
> > >> > >
> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <an...@confluent.io>
> > >> wrote:
> > >> > >
> > >> > > > Hi Almog,
> > >> > > >
> > >> > > > Like the KIP - I think being able to support decimals in JSON in
> > the
> > >> > same
> > >> > > > way most other systems do is a great improvement.
> > >> > > >
> > >> > > > It's not 100% clear to me from the KIP what the current format
> is.
> > >> Is
> > >> > > the
> > >> > > > text decimal a base16 encoded number, or is it base16 encoded
> > binary
> > >> > form
> > >> > > > of the number? (I've not tried to get my head around if these
> two
> > >> are
> > >> > > even
> > >> > > > different!)
> > >> > > >
> > >> > > > One suggestion I have is to change the proposed new config to
> only
> > >> > affect
> > >> > > > decimals stored as text, i.e. to switch between the current
> base16
> > >> and
> > >> > > the
> > >> > > > more common base10.   Then add another config to the serialzier
> > only
> > >> > that
> > >> > > > controls if decimals should be serialized as text or numeric.
> The
> > >> > > benefit
> > >> > > > of this approach is it allows us to enhance the deserializer to
> > >> > > > automatically handle numeric decimals even without any config
> > >> having to
> > >> > > be
> > >> > > > set, i.e. default config in the deserializer would be able to
> > handle
> > >> > > > numeric decimals.  Of course, this is a two edged sword: this
> > would
> > >> > make
> > >> > > > the deserializer work out of the box with numeric decimals,
> > (yay!),
> > >> but
> > >> > > > would be a breaking change in that things that previously failed
> > >> would
> > >> > > > suddenly start deserializing.  This is a price I'm willing to
> pay.
> > >> > > >
> > >> > > > IMHO, we should then plan to switch the default of decimal
> > >> > serialization
> > >> > > to
> > >> > > > numeric, and text serialization to base 10 in the next major
> > >> release.
> > >> > > > (With upgrade notes to match). Though I know this is more
> > >> contentious,
> > >> > I
> > >> > > > think it moves us forward in a much more standard way that the
> > >> current
> > >> > > > encoding of decimals.
> > >> > > >
> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <al...@confluent.io>
> > >> wrote:
> > >> > > >
> > >> > > > > Hi Everyone!
> > >> > > > >
> > >> > > > > Kicking off discussion for a new KIP:
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > >> > > > >
> > >> > > > > For those who are interested, I have a prototype
> implementation
> > >> that
> > >> > > > helped
> > >> > > > > guide my design: https://github.com/agavra/kafka/pull/1
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > Almog
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-481: SerDe Improvements for Connect Decimal type in JSON

Posted by Arjun Satish <ar...@gmail.com>.
hey Almog, nice work! couple of thoughts (hope I'm not late since you
started the voting thread already):

can you please add some examples to show the changes that you are
proposing. makes me think that for a given decimal number, we will have two
encodings: “asHex” and “asNumber”.

should we call the default config value “HEX” instead of “BINARY”?

Should we call out the fact that JS systems might be susceptible to double
precision round offs with the new numeric format? here are some people
discussing a similar problem
https://github.com/EventStore/EventStore/issues/1541

On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io> wrote:

> Hello Everyone,
>
> Summarizing an in-person discussion with Randall (this is copied from the
> KIP):
>
> The original KIP suggested supporting an additional representation - base10
> encoded text (e.g. `{"asText":"10.2345"}`). This causes issues because it
> is impossible to disambiguate between TEXT and BINARY without an additional
> config - furthermore, this makes the migration from one to the other nearly
> impossible because it would require that all consumers stop consuming and
> producers stop producing and atomically updating the config on all of them
> after deploying the new code, or waiting for the full retention period to
> pass - neither option is viable. The suggestion in the KIP is strictly an
> improvement over the existing behavior, even if it doesn't support all
> combinations.
>
> It seems that since most real-world use cases actually use the numeric
> representation (not string) we can consider this an improvement. With the
> new suggestion, we don't need a deserialization configuration (only a
> serialization option) and the consumers will be able to always
> automatically determine the serialization format.
>
> Based on this, I'll be opening up the simplified version of the KIP to a
> vote.
>
> Almog
>
> On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io> wrote:
>
> > I'm mostly happy with your current suggestion (two configs, one for
> > serialization and one for deserialization) and your implementation
> > suggestion. One thing to note:
> >
> > > We should _always_ be able to deserialize a standard JSON
> > > number as a decimal
> >
> > I was doing some research into decimals and JSON, and I can imagine a
> > compelling reason to require string representations to avoid losing
> > precision and to be certain that whomever is sending the data isn't
> losing
> > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
> >
> > I'm okay with always allowing numerics, but thought it's worth raising
> the
> > thought.
> >
> > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <an...@confluent.io> wrote:
> >
> >> The way I see it, we need to control two seperate things:
> >>
> >> 1. How do we _deserialize_ a decimal type if we encounter a text node in
> >> the JSON?    (We should _always_ be able to deserialize a standard JSON
> >> number as a decimal).
> >> 2. How do we chose how we want decimals to be _serialized_.
> >>
> >> This looks to fits well with your second suggestion of slightly
> different
> >> configs names for serialization vs deserialization.
> >> a, For deserialization we only care about how to handle text nodes: `
> >> deserialization.decimal.*text*.format`, which should only have two valid
> >> values BINARY | TEXT.
> >> b. For serialization we need all three: `serialization.decimal.format`,
> >> which should support all three options: BINARY | TEXT | NUMERIC.
> >>
> >> Implementation wise, I think these should be two separate enums, rather
> >> than one shared enum and throwing an error if the deserializer is set to
> >> NUMERIC.  Mainly as this means the enums reflect the options available,
> >> rather than this being hidden in config checking code.  But that's a
> minor
> >> implementation detail.
> >>
> >> Personally, I'd be tempted to have the BINARY value named something like
> >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to move away
> >> from
> >> it.
> >>
> >> It's a real shame that both of these settings require a default of
> BINARY
> >> for backwards compatibility, but I agree that discussions / plans around
> >> switching the defaults should not block this KIP.
> >>
> >> Andy
> >>
> >>
> >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io> wrote:
> >>
> >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> >> >
> >> > > Is the text decimal a base16 encoded number, or is it base16 encoded
> >> > binary
> >> > > form of the number?
> >> >
> >> > The conversion happens as decimal.unscaledValue().toByteArray() and
> then
> >> > the byte array is converted to a hex string, so it's definitely the
> >> binary
> >> > form of the number converted to base16. Whether or not that's the same
> >> as
> >> > the base16 encoded number is a good question (toByteArray returns a
> byte
> >> > array containing a signed, big-endian, two's complement representation
> >> of
> >> > the big integer).
> >> >
> >> > > One suggestion I have is to change the proposed new config to only
> >> affect
> >> > > decimals stored as text, i.e. to switch between the current base16
> and
> >> > the
> >> > > more common base10.   Then add another config to the serializer only
> >> that
> >> > > controls if decimals should be serialized as text or numeric.
> >> >
> >> > I think we need to be able to handle all mappings from serialization
> >> format
> >> > to deserialization format (e.g. read in BINARY and output TEXT),
> which I
> >> > think would be impossible with the alternative suggestion. I agree
> that
> >> > automatically deserializing numerics is valuable. I see two other ways
> >> to
> >> > get this, both keeping the serialization.format config the same:
> >> >
> >> > - have json.decimal.deserialization.format accept all three formats.
> if
> >> set
> >> > to BINARY/TEXT, numerics would be automatically supported. If set to
> >> > NUMERIC, then any string coming in would result in deserialization
> error
> >> > (defaults to BINARY for backwards compatibility)
> >> > - change json.decimal.deserialization.format to
> >> > json.decimal.deserialization.string.format which accepts only
> >> BINARY/TEXT
> >> > (defaults to BINARY for backwards compatibility)
> >> >
> >> > > would be a breaking change in that things that previously failed
> would
> >> > > suddenly start deserializing.  This is a price I'm willing to pay.
> >> >
> >> > I agree. I'm willing to pay this price too.
> >> >
> >> > > IMHO, we should then plan to switch the default of decimal
> >> serialization
> >> > to
> >> > > numeric, and text serialization to base 10 in the next major
> release.
> >> >
> >> > I think that can be a separate discussion, I don't want to block this
> >> KIP
> >> > on it.
> >> >
> >> > Thoughts?
> >> >
> >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <ot...@wikimedia.org>
> wrote:
> >> >
> >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
> >> JSONSchemas to
> >> > > indicate whether a JSON number should be deserialized as an integer
> >> or a
> >> > > decimal
> >> > > <
> >> > >
> >> >
> >>
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> >> > > >.
> >> > > Not everyone is going to have JSONSchemas available when converting,
> >> but
> >> > if
> >> > > you do, it is an easy way to support JSON numbers as decimals.
> >> > >
> >> > > Carry on! :)
> >> > >
> >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <an...@confluent.io>
> >> wrote:
> >> > >
> >> > > > Hi Almog,
> >> > > >
> >> > > > Like the KIP - I think being able to support decimals in JSON in
> the
> >> > same
> >> > > > way most other systems do is a great improvement.
> >> > > >
> >> > > > It's not 100% clear to me from the KIP what the current format is.
> >> Is
> >> > > the
> >> > > > text decimal a base16 encoded number, or is it base16 encoded
> binary
> >> > form
> >> > > > of the number? (I've not tried to get my head around if these two
> >> are
> >> > > even
> >> > > > different!)
> >> > > >
> >> > > > One suggestion I have is to change the proposed new config to only
> >> > affect
> >> > > > decimals stored as text, i.e. to switch between the current base16
> >> and
> >> > > the
> >> > > > more common base10.   Then add another config to the serialzier
> only
> >> > that
> >> > > > controls if decimals should be serialized as text or numeric.  The
> >> > > benefit
> >> > > > of this approach is it allows us to enhance the deserializer to
> >> > > > automatically handle numeric decimals even without any config
> >> having to
> >> > > be
> >> > > > set, i.e. default config in the deserializer would be able to
> handle
> >> > > > numeric decimals.  Of course, this is a two edged sword: this
> would
> >> > make
> >> > > > the deserializer work out of the box with numeric decimals,
> (yay!),
> >> but
> >> > > > would be a breaking change in that things that previously failed
> >> would
> >> > > > suddenly start deserializing.  This is a price I'm willing to pay.
> >> > > >
> >> > > > IMHO, we should then plan to switch the default of decimal
> >> > serialization
> >> > > to
> >> > > > numeric, and text serialization to base 10 in the next major
> >> release.
> >> > > > (With upgrade notes to match). Though I know this is more
> >> contentious,
> >> > I
> >> > > > think it moves us forward in a much more standard way that the
> >> current
> >> > > > encoding of decimals.
> >> > > >
> >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <al...@confluent.io>
> >> wrote:
> >> > > >
> >> > > > > Hi Everyone!
> >> > > > >
> >> > > > > Kicking off discussion for a new KIP:
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> >> > > > >
> >> > > > > For those who are interested, I have a prototype implementation
> >> that
> >> > > > helped
> >> > > > > guide my design: https://github.com/agavra/kafka/pull/1
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Almog
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>