You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Werner Daehn (Jira)" <ji...@apache.org> on 2020/11/02 20:16:00 UTC

[jira] [Updated] (AVRO-2952) Logical Types and Conversions enhancements

     [ https://issues.apache.org/jira/browse/AVRO-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Werner Daehn updated AVRO-2952:
-------------------------------
    Description: 
*Summary*:
 * AvroDatatype Field.getDataType(): returns an object with common data type related methods.
 * Add more trivial LogicalTypes to allow better database integration, e.g. the LogicalType VARCHAR(10) is a STRING that carries the information that only ASCII chars are in the payload and up to 10 chars max.

For example, to set an ENUM value for a field f the call

{{testRecord.put(f.name(), f.getDataType().convertToRawType(myEnum.male.name()));}}

does all the conversion. I considered adding that conversion to the put() method but did not for fear of side effects.

*Reasoning*:

I am working with Avro (Kafka) for two years now and have implemented improvements around Logical Types. These I merged into the Avro code with zero side effects - pure additions. No breaking changes for other Avro users but a great help for them.

Imagine you connect two databases via Kafka using Avro as the message payload.
 # The first problem you will be facing is that RawTypes and LogicalTypes are handled differently. For LogicalTypes there are conversion functions that provide metadata (e.g. getConvertedType returns that a Java Instant is the best data type for a timestamp-millis plus conversion logic. For raw types there is no such thing. A Boolean can be provided as true, "TRUE", 1,...
 # Second problem will be the lack of getObject()/setObject() methods similar to JDBC. The result are endless switch-case lists to call the correct methods. In every single project for every user.
 # Number three is the usage of the Converters as such. The intended usage is to add converters to the GenericData and the reader/writer uses the best suited converter. What I have seen most people do however is to use the converters manually and assign the raw value directly. While adding converters is possible still, the conversion at GenericRecord.put() and GenericRecord.get() is easy now.
 # For a data exchange format like Avro, it is important to carry as much metadata as possible. For example, purely seen from Avro a STRING data type is just fine. 99% of the string data types in a database are VARCHAR(length) and NVARCHAR(length). While putting an ASCII String of length 10 into a STRING is no problem, on the consumer side the only matching data type is a NCLOB - the worst for a data base. The LogicalTypes provide such nice methods to carry such metadata, e.g. a LogicalType VARCHAR(10) backed by a String. These Logical Types do not have any conversion, they just exist for the metadata. You have such a thing already with the UUID LogicalType.

 

*Changes*:
 * A new package logicaltypes created. It includes all new LogicalTypes and the AvroDataType implementations for the various raw data types.
 * The existing LogicalTypes are unchanged. The corresponding classes in the logicaltype package just extend them.
 * For that some LogicalType fields needed to be made public.
 * The LogicalTypes return the more detailed logicaltype.* classes.
 * A test class created.

 

  was:
*Summary*:
 * AvroDatatype Field.getDataType(): returns an object with common data type related methods.
 * Add more trivial LogicalTypes to allow better database integration, e.g. the LogicalType VARCHAR(10) is a STRING that carries the information that only ASCII chars are in the payload and up to 10 chars max.

For example, to set an ENUM value for a field f the call

{{testRecord.put(f.name(), f.getDataType().convertToRawType(myEnum.male.name()));}}

does all the conversion. I considered adding that conversion to the put() method but did not for fear of side effects.

*Reasoning*:

I am working with Avro (Kafka) for two years now and one area and have implemented improvement around Logical Types. These I merged into the Avro code with zero side effects - pure additions. No breaking changes for other Avro users but a great help for them.

Imagine you connect two databases via Kafka using Avro as the message payload.
 # The first problem you will be facing is that RawTypes and LogicalTypes are handled differently. For LogicalTypes there are conversion functions that provide metadata (e.g. getConvertedType returns that a Java Instant is the best data type for a timestamp-millis plus conversion logic. For raw types there is no such thing. A Boolean can be provided as true, "TRUE", 1,...
 # Second problem will be the lack of getObject()/setObject() methods similar to JDBC. The result are endless switch-case lists to call the correct methods. In every single project for every user.
 # Number three is the usage of the Converters as such. The intended usage is to add converters to the GenericData and the reader/writer uses the best suited converter. What I have seen most people do however is to use the converters manually and assign the raw value directly. While adding converters is possible still, the conversion at GenericRecord.put() and GenericRecord.get() is easy now.
 # For a data exchange format like Avro, it is important to carry as much metadata as possible. For example, purely seen from Avro a STRING data type is just fine. 99% of the string data types in a database are VARCHAR(length) and NVARCHAR(length). While putting an ASCII String of length 10 into a STRING is no problem, on the consumer side the only matching data type is a NCLOB - the worst for a data base. The LogicalTypes provide such nice methods to carry such metadata, e.g. a LogicalType VARCHAR(10) backed by a String. These Logical Types do not have any conversion, they just exist for the metadata. You have such a thing already with the UUID LogicalType.

 

*Changes*:
 * A new package logicaltypes created. It includes all new LogicalTypes and the AvroDataType implementations for the various raw data types.
 * The existing LogicalTypes are unchanged. The corresponding classes in the logicaltype package just extend them.
 * For that some LogicalType fields needed to be made public.
 * The LogicalTypes return the more detailed logicaltype.* classes.
 * A test class created.

 


> Logical Types and Conversions enhancements
> ------------------------------------------
>
>                 Key: AVRO-2952
>                 URL: https://issues.apache.org/jira/browse/AVRO-2952
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.10.1
>            Reporter: Werner Daehn
>            Priority: Critical
>
> *Summary*:
>  * AvroDatatype Field.getDataType(): returns an object with common data type related methods.
>  * Add more trivial LogicalTypes to allow better database integration, e.g. the LogicalType VARCHAR(10) is a STRING that carries the information that only ASCII chars are in the payload and up to 10 chars max.
> For example, to set an ENUM value for a field f the call
> {{testRecord.put(f.name(), f.getDataType().convertToRawType(myEnum.male.name()));}}
> does all the conversion. I considered adding that conversion to the put() method but did not for fear of side effects.
> *Reasoning*:
> I am working with Avro (Kafka) for two years now and have implemented improvements around Logical Types. These I merged into the Avro code with zero side effects - pure additions. No breaking changes for other Avro users but a great help for them.
> Imagine you connect two databases via Kafka using Avro as the message payload.
>  # The first problem you will be facing is that RawTypes and LogicalTypes are handled differently. For LogicalTypes there are conversion functions that provide metadata (e.g. getConvertedType returns that a Java Instant is the best data type for a timestamp-millis plus conversion logic. For raw types there is no such thing. A Boolean can be provided as true, "TRUE", 1,...
>  # Second problem will be the lack of getObject()/setObject() methods similar to JDBC. The result are endless switch-case lists to call the correct methods. In every single project for every user.
>  # Number three is the usage of the Converters as such. The intended usage is to add converters to the GenericData and the reader/writer uses the best suited converter. What I have seen most people do however is to use the converters manually and assign the raw value directly. While adding converters is possible still, the conversion at GenericRecord.put() and GenericRecord.get() is easy now.
>  # For a data exchange format like Avro, it is important to carry as much metadata as possible. For example, purely seen from Avro a STRING data type is just fine. 99% of the string data types in a database are VARCHAR(length) and NVARCHAR(length). While putting an ASCII String of length 10 into a STRING is no problem, on the consumer side the only matching data type is a NCLOB - the worst for a data base. The LogicalTypes provide such nice methods to carry such metadata, e.g. a LogicalType VARCHAR(10) backed by a String. These Logical Types do not have any conversion, they just exist for the metadata. You have such a thing already with the UUID LogicalType.
>  
> *Changes*:
>  * A new package logicaltypes created. It includes all new LogicalTypes and the AvroDataType implementations for the various raw data types.
>  * The existing LogicalTypes are unchanged. The corresponding classes in the logicaltype package just extend them.
>  * For that some LogicalType fields needed to be made public.
>  * The LogicalTypes return the more detailed logicaltype.* classes.
>  * A test class created.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)