You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Ryan Skraba (Jira)" <ji...@apache.org> on 2021/03/02 17:37:00 UTC

[jira] [Commented] (AVRO-2952) Logical Types and Conversions enhancements

    [ https://issues.apache.org/jira/browse/AVRO-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293867#comment-17293867 ] 

Ryan Skraba commented on AVRO-2952:
-----------------------------------

Reviewing this feature got pushed back in my planning! There's probably going to be another 1.10.x bugfix release before a feature release, but I used my time between cherry-picks to read your PR (more) thoroughly.

I don't know if I'm the right person to be reviewing this – I have a pretty major difference in opinion around the statement: "*For a data exchange format like Avro, it is important to carry as much metadata as possible*". On the contrary, I think "as little metadata as you can get away with" is more maintainable and more understandable.

That being said, that's all a matter of opinion!  So here's a list of what **I** liked about the proposed changes and what I was less enthusiastic about.  (Note that this doesn't mean good/bad or pros/cons!)

I liked:
 * Having the conversions right there at the schema when you're looking at a record.
 * Some of the conversion rules being more flexible.

I'm not a huge fan of:
 * The name {{AvroDatatype}} for the hierarchy – alongside {{LogicalType}} and {{AvroType}} and {{Schema.Type}}, it's worthwhile finding a better way to name things.
 * Building in new logical types that don't have a different in-memory representation in Java (an {{AvroNCLOB}} is a String with an underlying String)
 * Building in new logical types that imply "validation constraints" (such as length) that aren't ever actually taken into account.
 * Some of the conversion rules were surprising to me, and inconsistent with the two other ways we have of converting to/from alternative in-memory representations of data ("Stringable" classes and "Conversions")

I just wanted to retrigger the discussion – I would love to hear other points of view!

There was a lot packed into the PR and proposal, I want to make sure that you know I appreciate the work and engineering effort you've put in!  This would probably had a bit more traction if it were split up functionally into separate PRs and JIRAs:
 # Tying converters to/from raw/logical types closer to the schema where they are being used.
 # Adding conversion rules to/from classes that currently aren't supported.
 # Adding new logical types.

Something I would *definitely* support would be having some sort of extension point to the Avro core that allowed someone to add this functionality into the Java SDK as an option – if that were the case, you could even fix the get() and put() methods of the records to automatically take these conversions into account, and only those developers "opting-in" would be affected.

> Logical Types and Conversions enhancements
> ------------------------------------------
>
>                 Key: AVRO-2952
>                 URL: https://issues.apache.org/jira/browse/AVRO-2952
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.10.1
>            Reporter: Werner Daehn
>            Priority: Critical
>             Fix For: 1.10.2
>
>
> *Summary*:
>  * Added a method *Field.getDataType()* which returns an object with common data type related methods. Most important are methods to call the converters.
>  * Added trivial LogicalTypes to allow better database integration, e.g. the LogicalType VARCHAR(10) is a STRING that carries the information that only ASCII chars are in the payload and up to 10 chars max.
> *Example*:
> The user has a record with one ENUM field and he has a Java enum class. Instead of manually converting the Java string into a GenericEnumSymbol he can use the convertToRawType of the AvroDataType class.
> f...Field to be set
> {{testRecord.put(f.name(), f.getDataType().convertToRawType(myEnum.male.name()));}}
> Using the {{f.getDataType().convertToRawType() }}does all the conversion. I considered adding that conversion into the put() method itself but feared side effects. So the user has to invoke the convertToRawType().
> *Reasoning*:
> I am working with Avro (Kafka) for two years now and have implemented improvements around Logical Types. These I merged into the Avro code with zero side effects - pure additions. No breaking changes for other Avro users but a great help for them.
> Imagine you connect two databases via Kafka using Avro as the message payload.
>  # The first problem you will be facing is that RawTypes and LogicalTypes are handled differently. For LogicalTypes there are conversion functions that provide metadata (e.g. getConvertedType returns that a Java Instant is the best data type for a timestamp-millis plus conversion logic. For raw types there is no such thing. A Boolean can be provided as true, "TRUE", 1,...
>  # Second problem will be the lack of getObject()/setObject() methods similar to JDBC. The result are endless switch-case lists to call the correct methods. In every single project for every user.
>  # Number three is the usage of the Converters as such. The intended usage is to add converters to the GenericData and the reader/writer uses the best suited converter. What I have seen most people do however is to use the converters manually and assign the raw value directly. While adding converters is possible still, the conversion at GenericRecord.put() and GenericRecord.get() is easy now.
>  # For a data exchange format like Avro, it is important to carry as much metadata as possible. For example, purely seen from Avro a STRING data type is just fine. 99% of the string data types in a database are VARCHAR(length) and NVARCHAR(length). While putting an ASCII String of length 10 into a STRING is no problem, on the consumer side the only matching data type is a NCLOB - the worst for a data base. The LogicalTypes provide such nice methods to carry such metadata, e.g. a LogicalType VARCHAR(10) backed by a String. These Logical Types do not have any conversion, they just exist for the metadata. You have such a thing already with the UUID LogicalType.
>  
> *Changes*:
>  * A new package logicaltypes created. It includes all new LogicalTypes and the AvroDataType implementations for the various raw data types.
>  * The existing LogicalTypes are unchanged. The corresponding classes in the logicaltype package just extend them.
>  * For that some LogicalType fields needed to be made public.
>  * The LogicalTypes return the more detailed logicaltype.* classes.
>  * A test class created.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)