You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by GitBox <gi...@apache.org> on 2021/02/06 20:51:22 UTC

[GitHub] [avro] tjwp opened a new pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

tjwp opened a new pull request #1082:
URL: https://github.com/apache/avro/pull/1082


   In https://github.com/apache/avro/pull/918 it was identified that matching during schema resolution is missing for bytes decimal logical types in the Ruby implementation. (There is no support yet in the Ruby implementation for the fixed decimal.)
   
   Instead of embedding the logic for this specific check in the `SchemaCompatibility` module, I've made a small, initial step in the direction of making this checking more object-oriented. The matching logic needed is added to the `BytesSchema` class.
   
   ### Jira
   
   - [x] My PR addresses the following https://issues.apache.org/jira/browse/AVRO-3036
   
   ### Tests
   
   - [x] My PR adds the following unit tests: `test_bytes_decimal` in `test_schema_compatibility.rb`
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] ziggythehamster commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
ziggythehamster commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r581379971



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -485,6 +490,19 @@ def to_avro(names=nil)
         avro['scale'] = scale if scale
         avro
       end
+
+      def match_schema?(schema)
+        if type_sym == schema.type_sym
+          return false if logical_type != schema.logical_type
+
+          if logical_type == 'decimal'.freeze

Review comment:
       You should define this as a class-level (frozen) constant and then use `DECIMAL` here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp merged pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp merged pull request #1082:
URL: https://github.com/apache/avro/pull/1082


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r584296817



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -485,6 +490,19 @@ def to_avro(names=nil)
         avro['scale'] = scale if scale
         avro
       end
+
+      def match_schema?(schema)
+        if type_sym == schema.type_sym
+          return false if logical_type != schema.logical_type
+
+          if logical_type == 'decimal'.freeze

Review comment:
       Sure, I can make a change like that, but do you know if actually makes a difference since the Ruby VM will treat this frozen string like a constant, and will not allocate a new one on each invocation:
   
   ```
   irb(main):001:0> def frozen_string_alloc
   irb(main):002:1>   "foo".freeze.object_id
   irb(main):003:1> end
   => :frozen_string_alloc
   irb(main):004:0> frozen_string_alloc
   => 70328047113620
   irb(main):005:0> frozen_string_alloc
   => 70328047113620 # <---------------------- same object
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] ziggythehamster commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
ziggythehamster commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r586844960



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -485,6 +490,19 @@ def to_avro(names=nil)
         avro['scale'] = scale if scale
         avro
       end
+
+      def match_schema?(schema)
+        if type_sym == schema.type_sym
+          return false if logical_type != schema.logical_type
+
+          if logical_type == 'decimal'.freeze

Review comment:
       The above behavior depends on a few things:
   
   - Strings shorter than a certain length are allocated differently than strings over that length. I think it's 19 bytes, but don't quote me.
   - Newer Ruby versions (and irb) may freeze strings by default. Depends on your distro's config and whatnot. The `# frozen_string_literal: true` pragma in a .rb file will force this.
   - `"foo"` might allocate a string, but `"foo".freeze` looks it up in the string table, resulting in the same object ID as you would expect. But you still allocated `"foo"`. You'll have to use something like `memory_profiler` to know for sure (and this will work differently in say Ruby 2.4 than it does in Ruby 2.7).
   - Ruby applications which are compiled into `iseq` files before executing (e.g., with Bootsnap) might result in different behavior in terms of whether the original non-frozen string is allocated each time the method is run or not.
   
   One thing to note as well is that Ruby constants don't necessarily work like you would expect. `FOO = String("bar")` is legal, and then you can `FOO.concat("baz")`. In Ruby versions prior to them implementing immutable strings (so, IIRC, 2.5 and earlier), `"bar"` and `String("bar")` are the same. You can freeze the string, which makes `.concat` illegal, but constant values aren't frozen by default.
   
   Basically what I want to point out is that you might be doing something that is interpreted like this:
   
   ```
   every call:
     allocate string "foo"
     lookup this string in the string table
     return frozen reference
     deallocate string "foo"
   ```
   
   but what you actually want is:
   
   ```
   on load:
     allocate string "foo"
     lookup and store string in the string table
   
   every call:
     get frozen reference
   ```
   
   The latter should be the case with `# frozen_string_literal: true`, but this library targets older Ruby versions than that and so you probably want to ensure that happens.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on pull request #1082:
URL: https://github.com/apache/avro/pull/1082#issuecomment-792281668


   @jturkel @ziggythehamster Good catch on fixed and bytes decimal logical types being compatible! From looking at the Java implementation my understanding is that they would match.
   
   Right now the Ruby implementation doesn't even parse decimal logical types for fixed schemas. I'll look at adding that here in addition to adjusting the schema matching. Support for the fixed decimal encoding/decoding can be added separately, similar to the separate addition of the bytes support here: https://github.com/apache/avro/pull/1102


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r624707189



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -467,6 +475,11 @@ def to_avro(names=nil)
         hsh = super
         hsh.size == 1 ? type : hsh
       end
+
+      def match_schema?(schema)
+        return type_sym == schema.type_sym

Review comment:
       This is actually just for the primitive, non-named types to ensure that they are the same type.
   
   Primarily it is used from this modified line https://github.com/apache/avro/pull/1082/files/7ea99640fd96204dcb217133da6c3c6394ef043c#diff-5fa6692791f1794cd9064f4288aff1c2447925b993f77196d09b9bb6a05c910fR49, since that call only applies to primitive types only. 
   
   It's somewhat redundant right now since the call is nested within a `w_type == r_type` condition. But I was trying to move the compatibility checked into a more object-oriented direction.
   
   It is also called from as `super` from the `BytesSchema#match_schema?`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r624707508



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -38,6 +38,8 @@ class Schema
 
     DEFAULT_VALIDATE_OPTIONS = { recursive: true, encoded: false }.freeze
 
+    DECIMAL_LOGICAL_TYPE = 'decimal'.freeze

Review comment:
       It is used in both the FixedSchema and the BytesSchema:
   - BytesSchema: https://github.com/apache/avro/pull/1082/files/7ea99640fd96204dcb217133da6c3c6394ef043c#diff-410be18ca388364fc161738379922d6c2df9b7f392a58203cc051a8038f962caR505
   - FixedSchema: https://github.com/apache/avro/pull/1082/files/7ea99640fd96204dcb217133da6c3c6394ef043c#diff-410be18ca388364fc161738379922d6c2df9b7f392a58203cc051a8038f962caR539
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] andrewthauer commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
andrewthauer commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r627870618



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -38,6 +38,8 @@ class Schema
 
     DEFAULT_VALIDATE_OPTIONS = { recursive: true, encoded: false }.freeze
 
+    DECIMAL_LOGICAL_TYPE = 'decimal'.freeze

Review comment:
       👍 Sorry, missed the other usage




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] ziggythehamster commented on pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
ziggythehamster commented on pull request #1082:
URL: https://github.com/apache/avro/pull/1082#issuecomment-790125184


   My interpretation of "a schema" in this situation means "a type of thing", so a `bytes` and a `fixed` are different schemas. I don't think there's actually any reason why it matters since this is basically `binary` vs `varbinary` and so it would be more flexible to not care if they're different physical types but otherwise are identical logical types.
   
   Thinking about non-decimal, if say a UUID could be stored as a string or a 16 byte fixed or a bytes, I'd consider all UUID logical types to be compatible (e.g., if a schema evolves from string to fixed, then a UUID logical type would have different conversion procedures but still end up with compatible conversions)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r589029373



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -485,6 +490,19 @@ def to_avro(names=nil)
         avro['scale'] = scale if scale
         avro
       end
+
+      def match_schema?(schema)
+        if type_sym == schema.type_sym
+          return false if logical_type != schema.logical_type
+
+          if logical_type == 'decimal'.freeze

Review comment:
       Thanks @ziggythehamster, I had not considered that a string may be allocated before the frozen reference is looked up! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] andrewthauer commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
andrewthauer commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r622604010



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -467,6 +475,11 @@ def to_avro(names=nil)
         hsh = super
         hsh.size == 1 ? type : hsh
       end
+
+      def match_schema?(schema)
+        return type_sym == schema.type_sym

Review comment:
       Is this essentially just making sure the writer schema is a record as well?

##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -38,6 +38,8 @@ class Schema
 
     DEFAULT_VALIDATE_OPTIONS = { recursive: true, encoded: false }.freeze
 
+    DECIMAL_LOGICAL_TYPE = 'decimal'.freeze

Review comment:
       Perhaps not a big deal, but this does not seem to be used anywhere except in the derived FixedSchema. Would it make sense to move it there instead of the base class?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp edited a comment on pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp edited a comment on pull request #1082:
URL: https://github.com/apache/avro/pull/1082#issuecomment-817199150


   @jturkel @ziggythehamster @andrewthauer I've gone around and around a couple of times on how I think this matching should work. 
   
   I'd appreciate it if you could take another look at this code, but I'll outline what I settled on and why:
   
   1. Decimal logical types (whether bytes of fixed) now match if they have the same precision and scale. This was the issue caught in the previous review (thank you!).
   2. Bytes schemas will always match, and the decimal logical type is ignored for matching. I made this decision to ensure that encoded data can always be read. I believe that this is consistent with the specification that "If a logical type is invalid ..., then implementations should ignore the logical type and use the underlying Avro type". For example, the Ruby implementation has included support for parsing decimal logical types before support encoding/decoding. Making this choice allows encoded data written when the decimal logical type was ignored to still be read by a future release.
   3. Similarly, fixed schemas will always match based on the usual (name & size) constraints for fixed schemas, and the decimal logical type is ignored for matching. Again, I think this is consistent with allowing data to be decoded even when decimal logical type was not implemented previously. This may be the situation again for the Ruby implementation where there is not yet fixed decimal logical type encoding/decoding support.
   
   The main downside that I see to the choices above is that decoding could return either a Decimal or the underlying Avro type value if decimal precision and scale do not match or the "decimal" value is invalid. This places more burden on applications if they are expecting a Decimal, but the upside is that data can be read without raising an error.
   
   The other implication of these choices is that the (bytes) decimal logical type implementation may need additional error handling for cases where the "decimal" is invalid and the underlying Avro type value should be returned.
   
   I'm open to other opinions, so please challenge this if you think the approach above is incorrect. If there are still questions I can take this to the mailing list to see if other maintainers have opinions on how this is supposed to work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on pull request #1082:
URL: https://github.com/apache/avro/pull/1082#issuecomment-830822442


   @andrewthauer Thanks for taking a look! I've added responses to your questions/comments. I'm going to leave this open for a few more days to see if other still have comments on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on a change in pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on a change in pull request #1082:
URL: https://github.com/apache/avro/pull/1082#discussion_r624707189



##########
File path: lang/ruby/lib/avro/schema.rb
##########
@@ -467,6 +475,11 @@ def to_avro(names=nil)
         hsh = super
         hsh.size == 1 ? type : hsh
       end
+
+      def match_schema?(schema)
+        return type_sym == schema.type_sym

Review comment:
       This is actually just for the primitive, non-named types to ensure that they are the same type.
   
   Primarily it is used from this modified line https://github.com/apache/avro/pull/1082/files/7ea99640fd96204dcb217133da6c3c6394ef043c#diff-5fa6692791f1794cd9064f4288aff1c2447925b993f77196d09b9bb6a05c910fR49, since that call only applies to primitive types only. 
   
   It's somewhat redundant right now since the call is nested within a `w_type == r_type` condition. But I was trying to move the compatibility checks into a more object-oriented direction.
   
   It is also called from as `super` from the `BytesSchema#match_schema?`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] tjwp commented on pull request #1082: AVRO-3036: add schema matching for bytes decimal logical type in Ruby

Posted by GitBox <gi...@apache.org>.
tjwp commented on pull request #1082:
URL: https://github.com/apache/avro/pull/1082#issuecomment-817199150


   @jturkel @ziggythehamster @andrewthauer I've gone around and around a couple of times on how I think this matching should work. 
   
   I'd appreciate it if you could take another look at this code, but I'll outline what I settled on and why:
   
   1. Decimal logical types (whether bytes of fixed) now match if they have the same precision and scale. This is was issue caught in the previous review (thank you!).
   2. Bytes schemas will always match, and the decimal logical type is ignored for matching. I made this decision to ensure that encoded data can always be read. I believe that this is consistent with the specification that "If a logical type is invalid ..., then implementations should ignore the logical type and use the underlying Avro type". For example, the Ruby implementation has included support for parsing decimal logical types before support encoding/decoding. Making this choice allows encoded data written when the decimal logical type was ignored to still be read by a future release.
   3. Similarly, fixed schemas will always match based on the usual (name & size) constraints for fixed schemas, and the decimal logical type is ignored for matching. Again, I think this is consistent with allowing data to be decoded even when decimal logical type was not implemented previously. This may be the situation again for the Ruby implementation where there is not yet fixed decimal logical type encoding/decoding support.
   
   The main downside that I see to the choices above is that decoding could return either a Decimal or the underlying Avro type value if decimal precision and scale do not match or the "decimal" value is invalid. This places more burden on applications if they are expecting a Decimal, but the upside is that data can be read without raising an error.
   
   The other implication of these choices is that the (bytes) decimal logical type implementation may need additional error handling for cases where the "decimal" is invalid and the underlying Avro type value should be returned.
   
   I'm open to other opinions, so please challenge this if you think the approach above is incorrect. If there are still questions I can take this to the mailing list to see if other maintainers have opinions on how this is supposed to work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org