You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Christopher Egerton <ch...@confluent.io> on 2020/03/02 19:34:10 UTC

Decimal type, limitation on scale

Hi all,

I've been trying to do some research on the logical decimal type and why
the scale of a decimal type must be between zero and the precision of the
type, inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 has
a lot of discussion around the design of the type, but I haven't been able
to find any rationale for the limitations on the scale of the type.

These don't appear to align with existing conventions for precision and
scale in the context of SQL numeric types, the JDBC API, and the Java
standard library's BigDecimal class. In these contexts, the precision must
be a positive number, but the scale can be any value--positive
(representing the number of digits of precision that are available after
the decimal point), negative (representing the number of trailing zeroes at
the end of the number before an implicit decimal point), or zero. It is not
bounded by the precision of the type.

The definitions for scale and precision appear to align across these
contexts, including the Avro spec, so I'm curious as to why the Avro
spec--seemingly an anomaly--is the only one to declare these limitations on
what the scale of a decimal type can be.

Does anyone know why these exist, and if not, would it be okay to file a
ticket to remove them from the spec and begin work on it?

Cheers,

Chris

Re: Decimal type, limitation on scale

Posted by Christopher Egerton <ch...@confluent.io>.

Hi Zoltan, Ryan,

Thanks for the quick replies! Responses inline below.

> My biggest issue with the current decimal spec is that it does not encode
the scale (uses the scale defined in the schema), as such it cannot
accommodate a Oracle and Postgres NUMBER without scale coercion.

This is exactly the issue I'm running into. We have a Kafka source
connector that reads data from Oracle and can serialize it as Avro before
writing to Kafka. We currently handle these variable-scale types by
defaulting to a scale of 127, but after switching to use the official
logical type support for the decimal type provided by the Avro Java library
(see here:
https://github.com/confluentinc/schema-registry/blob/4113c3c3e05c9afe631c4d9d399c3a13ec8e3f5b/avro-data/src/main/java/io/confluent/connect/avro/AvroData.java#L910),
that approach no longer works as it produces schemas with scales that
nearly always exceed their precision. Granted, this approach isn't
fantastic anyways as it potentially produces values whose precision ends up
exceeding the precision of their schemas, but it'd still be nice to support
some of the other use cases (such as Oracle DB tables with scales that are
either negative or exceed the precision of the column) as well.

It'd be great if we could get a decimal type with the ability to specify
scale on a per-value level instead of the current per-schema level, but
that alone wouldn't address the issue that some external systems have a
more lax approach to scale than others (including the Avro spec).

> I believe that if we were to write a file (for example) with a negative
scale using Avro 1.10, a reader with an older version _should_ just fall
back to bytes, which seems fair enough.  I would consider it a bug if the
reader just failed on an "out-of-bounds" scale!

Agreed, and agreed! I'll admit, my investment in the handling of these
values is limited as we have our own custom way of serializing and
deserializing logical types (
https://github.com/apache/kafka/blob/dcfb641add650073f46fe8d8370c72f0284e62d7/connect/api/src/main/java/org/apache/kafka/connect/data/Decimal.java#L61-L74)
but for what it's worth, I definitely concur with the concerns about
preserving backwards compatibility and allowing for this potential change
to take place without negatively impacting downstream consumers that aren't
necessarily up-to-date with the version of Avro that upstream writers are
using.

> Any thoughts on what Hive (as an example) would require if we were to
relax this constraint in the spec?

Unfortunately, I'm not very familiar with Hive. Do they perform their own
schema validation? If so, would it be fair to let them handle these new
instances of the decimal logical type on their own (probably by either
breaking or downgrading the logical type to the underlying bytes/fixed
type)? The impact this might have on downstream consumers should be weighed
with the knowledge that, if some difficulty is caused for some of them,
flexibility will be provided for others that are capable of handling these
new logical types and possibly require workarounds to work with Avro as-is.

Cheers, and thanks to both of you for your thoughts,

Chris

On Tue, Mar 3, 2020 at 6:16 AM Ryan Skraba <ry...@skraba.com> wrote:

> It looks like the "scale must be less than precision" rule comes from
> Hive requirements[1] (although while searching, this is called into
> question elsewhere in Hive[2]). From the design document, the
> requirement was specifically to avoid variable (per-row scale):
>
> > For instance, applications (particularly native applications) such as
> SAS which need to
> > pre-allocate memory require fixed types to do so efficiently.
>
> I believe that if we were to write a file (for example) with a
> negative scale using Avro 1.10, a reader with an older version
> _should_ just fall back to bytes, which seems fair enough.  I would
> consider it a bug if the reader just failed on an "out-of-bounds"
> scale!
>
> Any thoughts on what Hive (as an example) would require if we were to
> relax this constraint in the spec?
>
> Ryan
>
> [1]: https://issues.apache.org/jira/browse/HIVE-3976
> [2]:
> https://github.com/apache/hive/blob/94dca16e4eb3caf7dcaa43ae92807e5750e1ff04/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java#L54
>
> On Mon, Mar 2, 2020 at 9:53 PM Zoltan Farkas
> <zo...@yahoo.com.invalid> wrote:
> >
> > +dev adding the dev mailing list, maybe somebody there can answer the
> reasoning.
> >
> > when comparing sql server with Oracle and Postgress:
> >
> >
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15
> <
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15
> >
> >
> >
> https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743
> <
> https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743
> >
> > https://www.postgresql.org/docs/9.1/datatype-numeric.html <
> https://www.postgresql.org/docs/9.1/datatype-numeric.html>
> >
> >
> > One allows for negative scale, the other doesn’t.
> > My biggest issue with the current decimal spec is that it does not
> encode the scale (uses the scale defined in the schema), as such it cannot
> accommodate a Oracle and Postgres NUMBER without scale coercion.
> >
> > there are other differences (like NAN, …)
> >
> > But there is no reason why the decimal2 logical type should not be
> created to address the above…
> >
> > or even better promote decimal to a first class type...
> https://issues.apache.org/jira/browse/AVRO-2164 <
> https://issues.apache.org/jira/browse/AVRO-2164>
> >
> >
> > —Z
> >
> > > On Mar 2, 2020, at 2:34 PM, Christopher Egerton <ch...@confluent.io>
> wrote:
> > >
> > > Hi all,
> > >
> > > I've been trying to do some research on the logical decimal type and
> why the scale of a decimal type must be between zero and the precision of
> the type, inclusive. The ticket
> https://issues.apache.org/jira/browse/AVRO-1402 <
> https://issues.apache.org/jira/browse/AVRO-1402> has a lot of discussion
> around the design of the type, but I haven't been able to find any
> rationale for the limitations on the scale of the type.
> > >
> > > These don't appear to align with existing conventions for precision
> and scale in the context of SQL numeric types, the JDBC API, and the Java
> standard library's BigDecimal class. In these contexts, the precision must
> be a positive number, but the scale can be any value--positive
> (representing the number of digits of precision that are available after
> the decimal point), negative (representing the number of trailing zeroes at
> the end of the number before an implicit decimal point), or zero. It is not
> bounded by the precision of the type.
> > >
> > > The definitions for scale and precision appear to align across these
> contexts, including the Avro spec, so I'm curious as to why the Avro
> spec--seemingly an anomaly--is the only one to declare these limitations on
> what the scale of a decimal type can be.
> > >
> > > Does anyone know why these exist, and if not, would it be okay to file
> a ticket to remove them from the spec and begin work on it?
> > >
> > > Cheers,
> > >
> > > Chris
> >
>

Re: Decimal type, limitation on scale

Posted by Ryan Skraba <ry...@skraba.com>.

It looks like the "scale must be less than precision" rule comes from
Hive requirements[1] (although while searching, this is called into
question elsewhere in Hive[2]). From the design document, the
requirement was specifically to avoid variable (per-row scale):

> For instance, applications (particularly native applications) such as SAS which need to
> pre-allocate memory require fixed types to do so efficiently.

I believe that if we were to write a file (for example) with a
negative scale using Avro 1.10, a reader with an older version
_should_ just fall back to bytes, which seems fair enough.  I would
consider it a bug if the reader just failed on an "out-of-bounds"
scale!

Any thoughts on what Hive (as an example) would require if we were to
relax this constraint in the spec?

Ryan

[1]: https://issues.apache.org/jira/browse/HIVE-3976
[2]: https://github.com/apache/hive/blob/94dca16e4eb3caf7dcaa43ae92807e5750e1ff04/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java#L54

On Mon, Mar 2, 2020 at 9:53 PM Zoltan Farkas
<zo...@yahoo.com.invalid> wrote:
>
> +dev adding the dev mailing list, maybe somebody there can answer the reasoning.
>
> when comparing sql server with Oracle and Postgress:
>
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15 <https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15>
>
> https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743 <https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743>
> https://www.postgresql.org/docs/9.1/datatype-numeric.html <https://www.postgresql.org/docs/9.1/datatype-numeric.html>
>
>
> One allows for negative scale, the other doesn’t.
> My biggest issue with the current decimal spec is that it does not encode the scale (uses the scale defined in the schema), as such it cannot accommodate a Oracle and Postgres NUMBER without scale coercion.
>
> there are other differences (like NAN, …)
>
> But there is no reason why the decimal2 logical type should not be created to address the above…
>
> or even better promote decimal to a first class type...https://issues.apache.org/jira/browse/AVRO-2164 <https://issues.apache.org/jira/browse/AVRO-2164>
>
>
> —Z
>
> > On Mar 2, 2020, at 2:34 PM, Christopher Egerton <ch...@confluent.io> wrote:
> >
> > Hi all,
> >
> > I've been trying to do some research on the logical decimal type and why the scale of a decimal type must be between zero and the precision of the type, inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 <https://issues.apache.org/jira/browse/AVRO-1402> has a lot of discussion around the design of the type, but I haven't been able to find any rationale for the limitations on the scale of the type.
> >
> > These don't appear to align with existing conventions for precision and scale in the context of SQL numeric types, the JDBC API, and the Java standard library's BigDecimal class. In these contexts, the precision must be a positive number, but the scale can be any value--positive (representing the number of digits of precision that are available after the decimal point), negative (representing the number of trailing zeroes at the end of the number before an implicit decimal point), or zero. It is not bounded by the precision of the type.
> >
> > The definitions for scale and precision appear to align across these contexts, including the Avro spec, so I'm curious as to why the Avro spec--seemingly an anomaly--is the only one to declare these limitations on what the scale of a decimal type can be.
> >
> > Does anyone know why these exist, and if not, would it be okay to file a ticket to remove them from the spec and begin work on it?
> >
> > Cheers,
> >
> > Chris
>

Re: Decimal type, limitation on scale

Posted by Ryan Skraba <ry...@skraba.com>.

It looks like the "scale must be less than precision" rule comes from
Hive requirements[1] (although while searching, this is called into
question elsewhere in Hive[2]). From the design document, the
requirement was specifically to avoid variable (per-row scale):

> For instance, applications (particularly native applications) such as SAS which need to
> pre-allocate memory require fixed types to do so efficiently.

I believe that if we were to write a file (for example) with a
negative scale using Avro 1.10, a reader with an older version
_should_ just fall back to bytes, which seems fair enough.  I would
consider it a bug if the reader just failed on an "out-of-bounds"
scale!

Any thoughts on what Hive (as an example) would require if we were to
relax this constraint in the spec?

Ryan

[1]: https://issues.apache.org/jira/browse/HIVE-3976
[2]: https://github.com/apache/hive/blob/94dca16e4eb3caf7dcaa43ae92807e5750e1ff04/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFRound.java#L54

On Mon, Mar 2, 2020 at 9:53 PM Zoltan Farkas
<zo...@yahoo.com.invalid> wrote:
>
> +dev adding the dev mailing list, maybe somebody there can answer the reasoning.
>
> when comparing sql server with Oracle and Postgress:
>
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15 <https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15>
>
> https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743 <https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743>
> https://www.postgresql.org/docs/9.1/datatype-numeric.html <https://www.postgresql.org/docs/9.1/datatype-numeric.html>
>
>
> One allows for negative scale, the other doesn’t.
> My biggest issue with the current decimal spec is that it does not encode the scale (uses the scale defined in the schema), as such it cannot accommodate a Oracle and Postgres NUMBER without scale coercion.
>
> there are other differences (like NAN, …)
>
> But there is no reason why the decimal2 logical type should not be created to address the above…
>
> or even better promote decimal to a first class type...https://issues.apache.org/jira/browse/AVRO-2164 <https://issues.apache.org/jira/browse/AVRO-2164>
>
>
> —Z
>
> > On Mar 2, 2020, at 2:34 PM, Christopher Egerton <ch...@confluent.io> wrote:
> >
> > Hi all,
> >
> > I've been trying to do some research on the logical decimal type and why the scale of a decimal type must be between zero and the precision of the type, inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 <https://issues.apache.org/jira/browse/AVRO-1402> has a lot of discussion around the design of the type, but I haven't been able to find any rationale for the limitations on the scale of the type.
> >
> > These don't appear to align with existing conventions for precision and scale in the context of SQL numeric types, the JDBC API, and the Java standard library's BigDecimal class. In these contexts, the precision must be a positive number, but the scale can be any value--positive (representing the number of digits of precision that are available after the decimal point), negative (representing the number of trailing zeroes at the end of the number before an implicit decimal point), or zero. It is not bounded by the precision of the type.
> >
> > The definitions for scale and precision appear to align across these contexts, including the Avro spec, so I'm curious as to why the Avro spec--seemingly an anomaly--is the only one to declare these limitations on what the scale of a decimal type can be.
> >
> > Does anyone know why these exist, and if not, would it be okay to file a ticket to remove them from the spec and begin work on it?
> >
> > Cheers,
> >
> > Chris
>

Re: Decimal type, limitation on scale

Posted by Zoltan Farkas <zo...@yahoo.com>.

+dev adding the dev mailing list, maybe somebody there can answer the reasoning.

when comparing sql server with Oracle and Postgress: 

https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15 <https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15> 

https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743 <https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743>
https://www.postgresql.org/docs/9.1/datatype-numeric.html <https://www.postgresql.org/docs/9.1/datatype-numeric.html>


One allows for negative scale, the other doesn’t.
My biggest issue with the current decimal spec is that it does not encode the scale (uses the scale defined in the schema), as such it cannot accommodate a Oracle and Postgres NUMBER without scale coercion. 

there are other differences (like NAN, …) 

But there is no reason why the decimal2 logical type should not be created to address the above…

or even better promote decimal to a first class type...https://issues.apache.org/jira/browse/AVRO-2164 <https://issues.apache.org/jira/browse/AVRO-2164> 


—Z

> On Mar 2, 2020, at 2:34 PM, Christopher Egerton <ch...@confluent.io> wrote:
> 
> Hi all,
> 
> I've been trying to do some research on the logical decimal type and why the scale of a decimal type must be between zero and the precision of the type, inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 <https://issues.apache.org/jira/browse/AVRO-1402> has a lot of discussion around the design of the type, but I haven't been able to find any rationale for the limitations on the scale of the type.
> 
> These don't appear to align with existing conventions for precision and scale in the context of SQL numeric types, the JDBC API, and the Java standard library's BigDecimal class. In these contexts, the precision must be a positive number, but the scale can be any value--positive (representing the number of digits of precision that are available after the decimal point), negative (representing the number of trailing zeroes at the end of the number before an implicit decimal point), or zero. It is not bounded by the precision of the type.
> 
> The definitions for scale and precision appear to align across these contexts, including the Avro spec, so I'm curious as to why the Avro spec--seemingly an anomaly--is the only one to declare these limitations on what the scale of a decimal type can be.
> 
> Does anyone know why these exist, and if not, would it be okay to file a ticket to remove them from the spec and begin work on it?
> 
> Cheers,
> 
> Chris

Re: Decimal type, limitation on scale

Posted by Zoltan Farkas <zo...@yahoo.com.INVALID>.

+dev adding the dev mailing list, maybe somebody there can answer the reasoning.

when comparing sql server with Oracle and Postgress: 

https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15 <https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15> 

https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743 <https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743>
https://www.postgresql.org/docs/9.1/datatype-numeric.html <https://www.postgresql.org/docs/9.1/datatype-numeric.html>


One allows for negative scale, the other doesn’t.
My biggest issue with the current decimal spec is that it does not encode the scale (uses the scale defined in the schema), as such it cannot accommodate a Oracle and Postgres NUMBER without scale coercion. 

there are other differences (like NAN, …) 

But there is no reason why the decimal2 logical type should not be created to address the above…

or even better promote decimal to a first class type...https://issues.apache.org/jira/browse/AVRO-2164 <https://issues.apache.org/jira/browse/AVRO-2164> 


—Z

> On Mar 2, 2020, at 2:34 PM, Christopher Egerton <ch...@confluent.io> wrote:
> 
> Hi all,
> 
> I've been trying to do some research on the logical decimal type and why the scale of a decimal type must be between zero and the precision of the type, inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 <https://issues.apache.org/jira/browse/AVRO-1402> has a lot of discussion around the design of the type, but I haven't been able to find any rationale for the limitations on the scale of the type.
> 
> These don't appear to align with existing conventions for precision and scale in the context of SQL numeric types, the JDBC API, and the Java standard library's BigDecimal class. In these contexts, the precision must be a positive number, but the scale can be any value--positive (representing the number of digits of precision that are available after the decimal point), negative (representing the number of trailing zeroes at the end of the number before an implicit decimal point), or zero. It is not bounded by the precision of the type.
> 
> The definitions for scale and precision appear to align across these contexts, including the Avro spec, so I'm curious as to why the Avro spec--seemingly an anomaly--is the only one to declare these limitations on what the scale of a decimal type can be.
> 
> Does anyone know why these exist, and if not, would it be okay to file a ticket to remove them from the spec and begin work on it?
> 
> Cheers,
> 
> Chris