You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by GitBox <gi...@apache.org> on 2020/07/24 20:31:16 UTC

[GitHub] [avro] cewing opened a new pull request #936: Speculative: Traversal validation

cewing opened a new pull request #936:
URL: https://github.com/apache/avro/pull/936


   This PR replaces the existing `validate` function in `avro.io` with a new version that uses a traversal-based approach rather than a recursive approach.  It also establishes the concept of a "validator" function that handles validation of various schema types and an "iterator" function, which powers the traversal of specific schema types.
   
   The point of the work is two-fold.  First, by traversing rather than recursing, exceptions raised by validation can be raised immediately where the problem happened, allowing for error messages that are much more localized.  This is an advantage when working with very large schemas.  Second, by using traversal instead of recursion, this approach is more conservative of system resources, especially for deeply nested schemas.
   
   The goal of this pr specifically is to spur discussion of the approach we've taken and to seek approval from the community for the change.  I anticipate that it will not be acceptable entirely as-is, and will be happy to make any requested changes should the approach be approved.
   
   ### Jira
   
   - [ x] My PR is speculative in nature and is meant to spur discussion and approval of the approach.  Once that has happened, I will close this PR and open another with an attached JIRA issue.
   
   ### Tests
   
   - [ x] Validation testing is pretty good already, so this PR does not add any tests.  If more are required in order to verify the process works, I will take responsibility for adding them.
   
   ### Commits
   
   - [ -] My commits are well formed, but as yet there is no JIRA issue so they do not reference one.  I can fix that in any final PR to be submitted.
   
   
   
   ### Documentation
   
   - [ x] This PR does not add any new functionality.  It does however alter existing functionality, at least in terms of how it is output.  I am very open to discussion of the best way to document those changes should they be accepted.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-675129637


   @kojiromike the rebase is complete, only one commit now, and it's got the avro issue number in the first line.  
   
   I held off on the question of abstract method for the base `Schema.validate`, see my comment and question above.  I am still happy to do that work once the question is resolved.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-668742346


   @cewing if you have time to add additional tests that demonstrate the memory properties or performance characteristics that would be excellent. I planned to take a deeper look at this on the weekend, but life has got in the way recently. Apologies. I'll try to get to it in the next couple weeks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r472464953



##########
File path: lang/py/avro/schema.py
##########
@@ -221,6 +232,18 @@ def to_json(self, names):
         """
         raise Exception("Must be implemented by subclasses.")
 
+    def validate(self, datum):

Review comment:
       in other comments we've decided to leave this work for another PR.  Resolving.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-674445733


   Iā€™m on the road today, but I do intend to rebase and add issue numbers. I will also update that abstract method. 
   
   Typed painstakingly with my thumbs and the active hindrance of autocorrect. 
   
   > On Aug 15, 2020, at 8:42 AM, Michael A. Smith <no...@github.com> wrote:
   > 
   > ļ»æ
   > @cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this.
   > 
   > ā€”
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r462527900



##########
File path: lang/py/avro/schema.py
##########
@@ -919,6 +984,10 @@ def to_json(self, names=None):
         to_dump['fields'] = [f.to_json(names) for f in self.fields]
         return to_dump
 
+    def validate(self, datum):
+        """Return self if datum is a valid representation of this schema, else None"""
+        return self if isinstance(datum, dict) and {f.name for f in self.fields}.issuperset(datum.keys()) else None

Review comment:
       it is, but for record the same thing happens.  Children of containers  validate their data types for themselves.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] RyanSkraba commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
RyanSkraba commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-667893775


   Hello!  I've been either relaunching the failing job until it works, or ignoring the failures on the Windows container...  We have a JIRA AVRO-2847 tracking this, but I haven't been able to figure it out (or reproduce it locally).  
   
   You *should* be able to see the error logs in the container though.
   
   My apologies for this unfortunate state, I relaunched your PR until it's green!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-675603246


   @kojiromike, I have not yet rebased to master.  I was waiting to do that until we resolved the question of the `abstractmethod` approach.  Given that the base `Schema` class is not an `ABC`, writing it as an `abstractmethod` is not allowed.  Do we want to update the `Schema` class to be an `ABC` in this PR, or save that work for another time?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-676578942


   running now, @kojiromike 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on a change in pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r471003846



##########
File path: lang/py/avro/schema.py
##########
@@ -493,6 +516,17 @@ def __eq__(self, that):
 class PrimitiveSchema(Schema):
     """Valid primitive types are in PRIMITIVE_TYPES."""
 
+    _validators = {

Review comment:
       While I started this approach, I realize now that these lambdas don't show up clearly in test coverage reports. We can leave it this way for now, but maybe in the future I should move these to named functions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-674118096


   Oh, by the way, you do not need to support Python 2 anymore in this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-668696661


   @kojiromike so all the tests pass now (thanks to @RyanSkraba!).  It's unclear to me what the next steps are.  The contributing docs in the avro wiki appear to suggest that one can *either* open a PR here or submit a patch via Jira.  Should I create a patch and submit it via Jira?  
   
   In addition, one of the claims of this PR is that it will be more memory efficient because of moving away from recursive processing.  Do I need to add a test for a more deeply nested schema that demonstrates this savings? Should I do some performance analysis that shows that this approach hasn't slowed down the parsing process for large schemas?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-666604481


   yeah, I was able to see that much, but it didn't mean much to me.  It did look like a failed setup, didn't appear even to get so far as running tests, but I figured I'd ask in case I was missing something.  
   
   What's the protocol for proceeding when there are semi-expected test failures like this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r460323512



##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True
+    :returns: True if datum is valid for expected_schema, False if not.
+    """
+    # use a FIFO queue to process schema nodes breadth first.
+    nodes = deque()
+    nodes.append(ValidationNode(expected_schema, datum, getattr(expected_schema, "name", None)))
+
+    while nodes:
+        current_node = nodes.popleft()
+
+        # _validate_node returns the node for iteration if it is valid. Or it returns None
+        valid_node = _validate_node(current_node)
+
+        if valid_node is not None:
+            # if there are children of this node to append, do so.
+            for child_node in _iterate_node(valid_node):
+                nodes.append(child_node)
+        else:
+            # the current node was not valid.
+            if raise_on_error:
+                raise AvroTypeException(current_node.schema, current_node.datum)
+            else:
+                # preserve the prior validation behavior of returning false when there are problems.
+                return False
+
+    return True
+
+
+def _validate_node(node):
+    """Return the result of applying the appropriate validator function to the provided node"""
+    # breakpoint()
+    key = (node.schema.type, getattr(node.schema, 'logical_type', None))
+    return _VALIDATORS.get(key, _default_validator)(node)
+
+
+def _iterate_node(node):
+    for item in _ITERATORS.get(node.schema.type, _default_iterator)(node):
+        yield ValidationNode(*item)
+
+
+#############
+# Iteration #
+#############
+
+def _default_iterator(_):
+    """Immediately raise StopIteration.
+
+    This exists to prevent problems with iteration over unsupported container types.
+    """
+    for blank in []:

Review comment:
       šŸ‘   (I completely forgot about that) :)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r460330066



##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True
+    :returns: True if datum is valid for expected_schema, False if not.
+    """
+    # use a FIFO queue to process schema nodes breadth first.
+    nodes = deque()
+    nodes.append(ValidationNode(expected_schema, datum, getattr(expected_schema, "name", None)))
+
+    while nodes:
+        current_node = nodes.popleft()
+
+        # _validate_node returns the node for iteration if it is valid. Or it returns None
+        valid_node = _validate_node(current_node)
+
+        if valid_node is not None:
+            # if there are children of this node to append, do so.
+            for child_node in _iterate_node(valid_node):
+                nodes.append(child_node)
+        else:
+            # the current node was not valid.
+            if raise_on_error:
+                raise AvroTypeException(current_node.schema, current_node.datum)
+            else:
+                # preserve the prior validation behavior of returning false when there are problems.
+                return False
+
+    return True
+
+
+def _validate_node(node):
+    """Return the result of applying the appropriate validator function to the provided node"""
+    # breakpoint()
+    key = (node.schema.type, getattr(node.schema, 'logical_type', None))
+    return _VALIDATORS.get(key, _default_validator)(node)
+
+
+def _iterate_node(node):
+    for item in _ITERATORS.get(node.schema.type, _default_iterator)(node):
+        yield ValidationNode(*item)
+
+
+#############
+# Iteration #
+#############
+
+def _default_iterator(_):
+    """Immediately raise StopIteration.
+
+    This exists to prevent problems with iteration over unsupported container types.
+    """
+    for blank in []:

Review comment:
       Oh, wait.  Now I remember.  Mypy gave me problems with that as it wasn't really generator function.  Now that I've removed type hinting I can probably go back to that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike merged pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike merged pull request #936:
URL: https://github.com/apache/avro/pull/936


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-675783175


   Argh, a different, but still unrelated test failure. Can you push an empty commit to retrigger build? `git commit --allow-empty --fixup HEAD` is how I have done this kind of thing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r461544142



##########
File path: lang/py/avro/schema.py
##########
@@ -521,6 +541,21 @@ def to_json(self, names=None):
         else:
             return self.props
 
+    def validate(self, datum):
+        """Return self if datum is a valid representation of this type of primitive schema, else None"""
+        is_valid = {
+            'null': lambda x: x is None,
+            'boolean': lambda x: isinstance(x, bool),
+            'string': lambda x: isinstance(x, unicode),
+            'bytes': lambda x: isinstance(x, bytes),
+            'int': lambda x: isinstance(x, int) and INT_MIN_VALUE <= x <= INT_MAX_VALUE,
+            'long': lambda x: isinstance(x, int) and LONG_MIN_VALUE <= x <= LONG_MAX_VALUE,
+            'float': lambda x: isinstance(x, (int, float)),
+            'double': lambda x: isinstance(x, (int, float)),
+        }.get(self.type, lambda x: False)(datum)

Review comment:
       Just having the primitive schema validation isolated from _all the validators_ is nice.

##########
File path: lang/py/avro/schema.py
##########
@@ -919,6 +984,10 @@ def to_json(self, names=None):
         to_dump['fields'] = [f.to_json(names) for f in self.fields]
         return to_dump
 
+    def validate(self, datum):
+        """Return self if datum is a valid representation of this schema, else None"""
+        return self if isinstance(datum, dict) and {f.name for f in self.fields}.issuperset(datum.keys()) else None

Review comment:
       Should this not validate the field values?

##########
File path: lang/py/avro/schema.py
##########
@@ -521,6 +541,21 @@ def to_json(self, names=None):
         else:
             return self.props
 
+    def validate(self, datum):
+        """Return self if datum is a valid representation of this type of primitive schema, else None"""
+        is_valid = {
+            'null': lambda x: x is None,
+            'boolean': lambda x: isinstance(x, bool),
+            'string': lambda x: isinstance(x, unicode),
+            'bytes': lambda x: isinstance(x, bytes),
+            'int': lambda x: isinstance(x, int) and INT_MIN_VALUE <= x <= INT_MAX_VALUE,
+            'long': lambda x: isinstance(x, int) and LONG_MIN_VALUE <= x <= LONG_MAX_VALUE,
+            'float': lambda x: isinstance(x, (int, float)),
+            'double': lambda x: isinstance(x, (int, float)),
+        }.get(self.type, lambda x: False)(datum)

Review comment:
       I agree that this is not ideal. I'd prefer the types to exist outright, instead of `PrimitiveSchema`, just `IntSchema`, etc. Then mapping validators to types is natural. I've tried to do this several times, but got hung up in complexities. (There are a lot of Liskov violations in here.) IMO it's always worth another shot, but I wouldn't suggest taking it on in this PR, since this is already a big improvement.
   
   I'd suggest making this dictionary a class attribute and then deciding the validator at init time. But other than that I think it's about as good as it can be without going too far off track.

##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True
+    :returns: True if datum is valid for expected_schema, False if not.
+    """
+    # use a FIFO queue to process schema nodes breadth first.
+    nodes = deque()
+    nodes.append(ValidationNode(expected_schema, datum, getattr(expected_schema, "name", None)))
+
+    while nodes:
+        current_node = nodes.popleft()
+
+        # _validate_node returns the node for iteration if it is valid. Or it returns None
+        valid_node = _validate_node(current_node)
+
+        if valid_node is not None:
+            # if there are children of this node to append, do so.
+            for child_node in _iterate_node(valid_node):
+                nodes.append(child_node)
+        else:
+            # the current node was not valid.
+            if raise_on_error:
+                raise AvroTypeException(current_node.schema, current_node.datum)
+            else:
+                # preserve the prior validation behavior of returning false when there are problems.
+                return False
+
+    return True
+
+
+def _validate_node(node):
+    """Return the result of applying the appropriate validator function to the provided node"""
+    # breakpoint()
+    key = (node.schema.type, getattr(node.schema, 'logical_type', None))
+    return _VALIDATORS.get(key, _default_validator)(node)
+
+
+def _iterate_node(node):
+    for item in _ITERATORS.get(node.schema.type, _default_iterator)(node):
+        yield ValidationNode(*item)
+
+
+#############
+# Iteration #
+#############
+
+def _default_iterator(_):
+    """Immediately raise StopIteration.
+
+    This exists to prevent problems with iteration over unsupported container types.
+    """
+    for blank in []:

Review comment:
       Sigh. Yeah, OK, that's one for the mailing list, isn't it?

##########
File path: lang/py/avro/schema.py
##########
@@ -919,6 +984,10 @@ def to_json(self, names=None):
         to_dump['fields'] = [f.to_json(names) for f in self.fields]
         return to_dump
 
+    def validate(self, datum):
+        """Return self if datum is a valid representation of this schema, else None"""
+        return self if isinstance(datum, dict) and {f.name for f in self.fields}.issuperset(datum.keys()) else None

Review comment:
       Is this map schema? I thought it was record. Sorry! Very hard to tell in github mobile.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-663722556


   Attn: @kojiromike, this PR is in relation to [this email thread on the dev list](https://lists.apache.org/thread.html/r645b831ce0054257fd1d4118d424bcec594be53753824077c417d22f%40%3Cdev.avro.apache.org%3E)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-663771238


   Absolutely this is a disguised form of singledispatch.  My first approach here was really designed to be a drop-in replacement for the existing approach, which is why it keeps things like dict lookup of validator by schema type (and now by logical_type as well).  It was originally meant to be monkey patched into place so that we could solve our problem with error messages while the longer process of this conversation took place.
   
   I'm pretty sure I can make another pass at this that uses `validate` as a method of the schema class.  That would be pretty clean.  
   
   Any other thoughts while we look at this first pass?
       
           


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-675643123


   Save that work for another time. Maybe I'll do it as part of getting rid of all the Python 2 polyfills.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on a change in pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r471003460



##########
File path: lang/py/avro/schema.py
##########
@@ -221,6 +232,18 @@ def to_json(self, names):
         """
         raise Exception("Must be implemented by subclasses.")
 
+    def validate(self, datum):

Review comment:
       Can this be an abstractmethod? If you don't want to do it in this PR I can make all these de-facto abstract methods official in a later batch change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r461225097



##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True
+    :returns: True if datum is valid for expected_schema, False if not.
+    """
+    # use a FIFO queue to process schema nodes breadth first.
+    nodes = deque()
+    nodes.append(ValidationNode(expected_schema, datum, getattr(expected_schema, "name", None)))
+
+    while nodes:
+        current_node = nodes.popleft()
+
+        # _validate_node returns the node for iteration if it is valid. Or it returns None
+        valid_node = _validate_node(current_node)
+
+        if valid_node is not None:
+            # if there are children of this node to append, do so.
+            for child_node in _iterate_node(valid_node):
+                nodes.append(child_node)
+        else:
+            # the current node was not valid.
+            if raise_on_error:
+                raise AvroTypeException(current_node.schema, current_node.datum)
+            else:
+                # preserve the prior validation behavior of returning false when there are problems.
+                return False
+
+    return True
+
+
+def _validate_node(node):
+    """Return the result of applying the appropriate validator function to the provided node"""
+    # breakpoint()
+    key = (node.schema.type, getattr(node.schema, 'logical_type', None))
+    return _VALIDATORS.get(key, _default_validator)(node)
+
+
+def _iterate_node(node):
+    for item in _ITERATORS.get(node.schema.type, _default_iterator)(node):
+        yield ValidationNode(*item)
+
+
+#############
+# Iteration #
+#############
+
+def _default_iterator(_):
+    """Immediately raise StopIteration.
+
+    This exists to prevent problems with iteration over unsupported container types.
+    """
+    for blank in []:

Review comment:
       Unfortunately, that syntax is illegal in Python 2.7 so as long as supporting it is required, we can't use that.  I went back to the first implementation in the long run.

##########
File path: lang/py/avro/schema.py
##########
@@ -919,6 +984,10 @@ def to_json(self, names=None):
         to_dump['fields'] = [f.to_json(names) for f in self.fields]
         return to_dump
 
+    def validate(self, datum):
+        """Return self if datum is a valid representation of this schema, else None"""
+        return self if isinstance(datum, dict) and {f.name for f in self.fields}.issuperset(datum.keys()) else None

Review comment:
       Values are validated as children of the map schema.  Check the `_map_iterator` function in `io.py`.  That way, you get an exception that is specific to the field within the map that is wrong, not just an indication that there's a problem inside the map.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-675693798


   @kojiromike, great.  Fixed.  All set to merge when you're ready.  Thank you for helping to get me through this first contribution!  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-664688533






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-665958500


   There appears to be a failure in testing on a windows container.  The other two runs of the tests are passing, however.  Not sure what to do about that.  Would someone with access look at the test results and let me know if there's something I can fix?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-674114490


   @cewing I'm going to look at this today. Trying to come up with some good manual test cases. Also, do you want to rebase -i and edit your commits to include the ticket number?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-666370103


   > There appears to be a failure in testing on a windows container. The other two runs of the tests are passing, however. Not sure what to do about that. Would someone with access look at the test results and let me know if there's something I can fix?
   
   This failure is not caused by this PR. The Windows build is perennial flaky and to be honest I don't know exactly what it's based on that causes this kind of behavior.
   
   I am surprised that at least readonly access to Travis isn't universal. (That said, you can set up TravisCI on your fork pretty easily if you want to have full access to what it does.)
   
   ```
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1603: Apache.Avro depends on Newtonsoft.Json (>= 10.0.3) but Newtonsoft.Json 10.0.3 was not found. An approximate best match of Newtonsoft.Json 11.0.2 was resolved. [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit3testadapter. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package NUnit.ConsoleRunner. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package Microsoft.NET.Test.Sdk. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package System.CodeDom. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
     Restore failed in 95.21 ms for C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj.
   
     Restore completed in 1.23 sec for C:\Users\travis\build\apache\avro\lang\csharp\src\apache\perf\Avro.perf.csproj.
   
   Build FAILED.
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1603: Apache.Avro depends on Newtonsoft.Json (>= 10.0.3) but Newtonsoft.Json 10.0.3 was not found. An approximate best match of Newtonsoft.Json 11.0.2 was resolved. [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit3testadapter. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package NUnit.ConsoleRunner. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package Microsoft.NET.Test.Sdk. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   
   C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package System.CodeDom. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r471638086



##########
File path: lang/py/avro/schema.py
##########
@@ -221,6 +232,18 @@ def to_json(self, names):
         """
         raise Exception("Must be implemented by subclasses.")
 
+    def validate(self, datum):

Review comment:
       @kojiromike, I was following the pattern of the `to_json` method above this, which i guess might also want to be an abstract method.  Is it possible to add `abstractmethod`s to non-ABC classes?

##########
File path: lang/py/avro/schema.py
##########
@@ -493,6 +516,17 @@ def __eq__(self, that):
 class PrimitiveSchema(Schema):
     """Valid primitive types are in PRIMITIVE_TYPES."""
 
+    _validators = {

Review comment:
       yeah, I'd like that.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] cewing commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
cewing commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r460323512



##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True
+    :returns: True if datum is valid for expected_schema, False if not.
+    """
+    # use a FIFO queue to process schema nodes breadth first.
+    nodes = deque()
+    nodes.append(ValidationNode(expected_schema, datum, getattr(expected_schema, "name", None)))
+
+    while nodes:
+        current_node = nodes.popleft()
+
+        # _validate_node returns the node for iteration if it is valid. Or it returns None
+        valid_node = _validate_node(current_node)
+
+        if valid_node is not None:
+            # if there are children of this node to append, do so.
+            for child_node in _iterate_node(valid_node):
+                nodes.append(child_node)
+        else:
+            # the current node was not valid.
+            if raise_on_error:
+                raise AvroTypeException(current_node.schema, current_node.datum)
+            else:
+                # preserve the prior validation behavior of returning false when there are problems.
+                return False
+
+    return True
+
+
+def _validate_node(node):
+    """Return the result of applying the appropriate validator function to the provided node"""
+    # breakpoint()
+    key = (node.schema.type, getattr(node.schema, 'logical_type', None))
+    return _VALIDATORS.get(key, _default_validator)(node)
+
+
+def _iterate_node(node):
+    for item in _ITERATORS.get(node.schema.type, _default_iterator)(node):
+        yield ValidationNode(*item)
+
+
+#############
+# Iteration #
+#############
+
+def _default_iterator(_):
+    """Immediately raise StopIteration.
+
+    This exists to prevent problems with iteration over unsupported container types.
+    """
+    for blank in []:

Review comment:
       šŸ‘ 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-675481774


   @cewing hmm, github is still reporting conflicts. To be sure, did you rebase against `apache/master`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on pull request #936: AVRO-2906: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on pull request #936:
URL: https://github.com/apache/avro/pull/936#issuecomment-674413793


   @cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [avro] kojiromike commented on a change in pull request #936: Speculative: Traversal validation

Posted by GitBox <gi...@apache.org>.
kojiromike commented on a change in pull request #936:
URL: https://github.com/apache/avro/pull/936#discussion_r460289855



##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema

Review comment:
       expected*

##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True
+    :returns: True if datum is valid for expected_schema, False if not.
+    """
+    # use a FIFO queue to process schema nodes breadth first.
+    nodes = deque()
+    nodes.append(ValidationNode(expected_schema, datum, getattr(expected_schema, "name", None)))
+
+    while nodes:
+        current_node = nodes.popleft()
+
+        # _validate_node returns the node for iteration if it is valid. Or it returns None
+        valid_node = _validate_node(current_node)
+
+        if valid_node is not None:
+            # if there are children of this node to append, do so.
+            for child_node in _iterate_node(valid_node):
+                nodes.append(child_node)
+        else:
+            # the current node was not valid.
+            if raise_on_error:
+                raise AvroTypeException(current_node.schema, current_node.datum)
+            else:
+                # preserve the prior validation behavior of returning false when there are problems.
+                return False
+
+    return True
+
+
+def _validate_node(node):
+    """Return the result of applying the appropriate validator function to the provided node"""
+    # breakpoint()
+    key = (node.schema.type, getattr(node.schema, 'logical_type', None))
+    return _VALIDATORS.get(key, _default_validator)(node)
+
+
+def _iterate_node(node):
+    for item in _ITERATORS.get(node.schema.type, _default_iterator)(node):
+        yield ValidationNode(*item)
+
+
+#############
+# Iteration #
+#############
+
+def _default_iterator(_):
+    """Immediately raise StopIteration.
+
+    This exists to prevent problems with iteration over unsupported container types.
+    """
+    for blank in []:

Review comment:
       Why not `raise StopIteration` explicitly?

##########
File path: lang/py/avro/io.py
##########
@@ -112,84 +164,218 @@ def __init__(self, fail_msg, writers_schema=None, readers_schema=None):
             fail_msg += "\nReader's Schema: %s" % pretty_readers
         schema.AvroException.__init__(self, fail_msg)
 
+
 #
 # Validate
 #
 
-
 def _is_timezone_aware_datetime(dt):
     return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
 
 
-_valid = {
-    'null': lambda s, d: d is None,
-    'boolean': lambda s, d: isinstance(d, bool),
-    'string': lambda s, d: isinstance(d, unicode),
-    'bytes': lambda s, d: ((isinstance(d, bytes)) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'int': lambda s, d: ((isinstance(d, (int, long))) and (INT_MIN_VALUE <= d <= INT_MAX_VALUE) or
-                         (isinstance(d, datetime.date) and
-                          getattr(s, 'logical_type', None) == constants.DATE) or
-                         (isinstance(d, datetime.time) and
-                          getattr(s, 'logical_type', None) == constants.TIME_MILLIS)),
-    'long': lambda s, d: ((isinstance(d, (int, long))) and (LONG_MIN_VALUE <= d <= LONG_MAX_VALUE) or
-                          (isinstance(d, datetime.time) and
-                           getattr(s, 'logical_type', None) == constants.TIME_MICROS) or
-                          (isinstance(d, datetime.date) and
-                           _is_timezone_aware_datetime(d) and
-                           getattr(s, 'logical_type', None) in (constants.TIMESTAMP_MILLIS,
-                                                                constants.TIMESTAMP_MICROS))),
-    'float': lambda s, d: isinstance(d, (int, long, float)),
-    'fixed': lambda s, d: ((isinstance(d, bytes) and len(d) == s.size) or
-                           (isinstance(d, Decimal) and
-                            getattr(s, 'logical_type', None) == constants.DECIMAL)),
-    'enum': lambda s, d: d in s.symbols,
-
-    'array': lambda s, d: isinstance(d, list) and all(validate(s.items, item) for item in d),
-    'map': lambda s, d: (isinstance(d, dict) and all(isinstance(key, unicode) for key in d) and
-                         all(validate(s.values, value) for value in d.values())),
-    'union': lambda s, d: any(validate(branch, d) for branch in s.schemas),
-    'record': lambda s, d: (isinstance(d, dict) and
-                            all(validate(f.type, d.get(f.name)) for f in s.fields) and
-                            {f.name for f in s.fields}.issuperset(d.keys())),
+def validate(expected_schema, datum, raise_on_error=False):
+    """Return True if the provided datum is valid for the excpted schema
+
+    If raise_on_error is passed and True, then raise a validation error
+    with specific information about the error encountered in validation.
+
+    :param expected_schema: An avro schema type object representing the schema against
+                            which the datum will be validated.
+    :param datum: The datum to be validated, A python dictionary or some supported type
+    :param raise_on_error: True if a AvroTypeException should be raised immediately when a
+                           validation problem is encountered.
+    :raises: AvroTyeException if datum is invalid and raise_on_error is True

Review comment:
       Type*




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org