You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by blrnw3 <gi...@git.apache.org> on 2017/03/08 17:56:19 UTC

[GitHub] spark pull request #17213: [SPARK-13740] [PySpark][SQL] Improve error messag...

GitHub user blrnw3 opened a pull request:

    https://github.com/apache/spark/pull/17213

    [SPARK-13740] [PySpark][SQL]  Improve error message in verify_type to indicate the field

    https://issues.apache.org/jira/browse/SPARK-19871
    
    ## What changes were proposed in this pull request?
    Improve error message in verify_type to indicate the field which is responsible for the verification error. This is incredibly useful for tracking down type/nullability errors.
    
    ## How was this patch tested?
    Unit tests pass
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrnw3/spark fix/verify_type_err_msg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17213
    
----
commit 308b4c685580848f5688704b1d73f8a5fabe85af
Author: Ben Lee Rodgers <br...@quartethealth.com>
Date:   2017-03-08T17:43:50Z

    Improve error message in verify_type to indicate for which field the error refers

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by blrnw3 <gi...@git.apache.org>.
Github user blrnw3 commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    Will add doctest and check linting.
    Have added before/after error messages to the PR description


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17213: [SPARK-19871] [PySpark][SQL] Improve error messag...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17213#discussion_r105063798
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
     }
     
     
    -def _verify_type(obj, dataType, nullable=True):
    +def _verify_type(obj, dataType, nullable=True, name=None):
    --- End diff --
    
    I think we need a doctest below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17213: [SPARK-19871] [PySpark][SQL] Improve error messag...

Posted by blrnw3 <gi...@git.apache.org>.
Github user blrnw3 closed the pull request at:

    https://github.com/apache/spark/pull/17213


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17213: [SPARK-19871] [PySpark][SQL] Improve error messag...

Posted by dgingrich <gi...@git.apache.org>.
Github user dgingrich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17213#discussion_r105246855
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1300,70 +1300,71 @@ def _verify_type(obj, dataType, nullable=True):
             if nullable:
                 return
             else:
    -            raise ValueError("This field is not nullable, but got None")
    +            raise ValueError("This field ({}, of type {}) is not nullable, but got None".format(
    +                name, dataType))
     
         # StringType can work with any types
         if isinstance(dataType, StringType):
             return
     
         if isinstance(dataType, UserDefinedType):
             if not (hasattr(obj, '__UDT__') and obj.__UDT__ == dataType):
    -            raise ValueError("%r is not an instance of type %r" % (obj, dataType))
    +            raise ValueError("%r is not an instance of type %r for field %s" % (obj, dataType, name))
             _verify_type(dataType.toInternal(obj), dataType.sqlType())
             return
     
         _type = type(dataType)
    -    assert _type in _acceptable_types, "unknown datatype: %s for object %r" % (dataType, obj)
    +    assert _type in _acceptable_types, "unknown datatype: %s for object %r for field %s" % (dataType, obj, name)
     
         if _type is StructType:
             # check the type and fields later
             pass
         else:
             # subclass of them can not be fromInternal in JVM
             if type(obj) not in _acceptable_types[_type]:
    -            raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
    +            raise TypeError("%s can not accept object %r in type %s for field %s" % (dataType, obj, type(obj), name))
     
         if isinstance(dataType, ByteType):
             if obj < -128 or obj > 127:
    -            raise ValueError("object of ByteType out of range, got: %s" % obj)
    +            raise ValueError("object of ByteType out of range, got: %s for field %s" % (obj, name))
     
         elif isinstance(dataType, ShortType):
             if obj < -32768 or obj > 32767:
    -            raise ValueError("object of ShortType out of range, got: %s" % obj)
    +            raise ValueError("object of ShortType out of range, got: %s for field %s" % (obj, name))
     
         elif isinstance(dataType, IntegerType):
             if obj < -2147483648 or obj > 2147483647:
    -            raise ValueError("object of IntegerType out of range, got: %s" % obj)
    +            raise ValueError("object of IntegerType out of range, got: %s for field %s" % (obj, name))
     
         elif isinstance(dataType, ArrayType):
             for i in obj:
    -            _verify_type(i, dataType.elementType, dataType.containsNull)
    +            _verify_type(i, dataType.elementType, dataType.containsNull, name)
     
         elif isinstance(dataType, MapType):
             for k, v in obj.items():
    -            _verify_type(k, dataType.keyType, False)
    -            _verify_type(v, dataType.valueType, dataType.valueContainsNull)
    +            _verify_type(k, dataType.keyType, False, name)
    +            _verify_type(v, dataType.valueType, dataType.valueContainsNull, name)
     
         elif isinstance(dataType, StructType):
             if isinstance(obj, dict):
                 for f in dataType.fields:
    -                _verify_type(obj.get(f.name), f.dataType, f.nullable)
    +                _verify_type(obj.get(f.name), f.dataType, f.nullable, f.name)
    --- End diff --
    
    This doesn't work that well for nested structs:
    
    ```python
    MySubType = StructType([StructField('value', StringType(), nullable=False)])
    MyType = StructType([
        StructField('one', MySubType),
        StructField('two', MySubType)])
    
    _verify_type({'one': {'value': 'good'}, 'two': {'value': None}}, MyType)
    # "This field (value, of type StringType) is not nullable, but got None"
    # But is it one.value or two.value?
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by blrnw3 <gi...@git.apache.org>.
Github user blrnw3 commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    Yours looks more complete so I'm willing to close this if you finish it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17213: [SPARK-19871] [PySpark][SQL] Improve error messag...

Posted by dgingrich <gi...@git.apache.org>.
Github user dgingrich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17213#discussion_r105247143
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1300,70 +1300,71 @@ def _verify_type(obj, dataType, nullable=True):
             if nullable:
                 return
             else:
    -            raise ValueError("This field is not nullable, but got None")
    +            raise ValueError("This field ({}, of type {}) is not nullable, but got None".format(
    +                name, dataType))
     
         # StringType can work with any types
         if isinstance(dataType, StringType):
             return
     
         if isinstance(dataType, UserDefinedType):
             if not (hasattr(obj, '__UDT__') and obj.__UDT__ == dataType):
    -            raise ValueError("%r is not an instance of type %r" % (obj, dataType))
    +            raise ValueError("%r is not an instance of type %r for field %s" % (obj, dataType, name))
             _verify_type(dataType.toInternal(obj), dataType.sqlType())
             return
     
         _type = type(dataType)
    -    assert _type in _acceptable_types, "unknown datatype: %s for object %r" % (dataType, obj)
    +    assert _type in _acceptable_types, "unknown datatype: %s for object %r for field %s" % (dataType, obj, name)
     
         if _type is StructType:
             # check the type and fields later
             pass
         else:
             # subclass of them can not be fromInternal in JVM
             if type(obj) not in _acceptable_types[_type]:
    -            raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
    +            raise TypeError("%s can not accept object %r in type %s for field %s" % (dataType, obj, type(obj), name))
     
         if isinstance(dataType, ByteType):
             if obj < -128 or obj > 127:
    -            raise ValueError("object of ByteType out of range, got: %s" % obj)
    +            raise ValueError("object of ByteType out of range, got: %s for field %s" % (obj, name))
     
         elif isinstance(dataType, ShortType):
             if obj < -32768 or obj > 32767:
    -            raise ValueError("object of ShortType out of range, got: %s" % obj)
    +            raise ValueError("object of ShortType out of range, got: %s for field %s" % (obj, name))
     
         elif isinstance(dataType, IntegerType):
             if obj < -2147483648 or obj > 2147483647:
    -            raise ValueError("object of IntegerType out of range, got: %s" % obj)
    +            raise ValueError("object of IntegerType out of range, got: %s for field %s" % (obj, name))
     
         elif isinstance(dataType, ArrayType):
             for i in obj:
    -            _verify_type(i, dataType.elementType, dataType.containsNull)
    +            _verify_type(i, dataType.elementType, dataType.containsNull, name)
     
         elif isinstance(dataType, MapType):
             for k, v in obj.items():
    -            _verify_type(k, dataType.keyType, False)
    -            _verify_type(v, dataType.valueType, dataType.valueContainsNull)
    +            _verify_type(k, dataType.keyType, False, name)
    +            _verify_type(v, dataType.valueType, dataType.valueContainsNull, name)
    --- End diff --
    
    Might also want to flag individual array/map elements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    Let me then resolve this JIRA as a duplicate and reopen @dgingrich's one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by blrnw3 <gi...@git.apache.org>.
Github user blrnw3 commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    Ah yes it seems `SPARK-19507` is a duplicate. My mistake. In any case I'm sure this PR satisfies that ticket


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    cc @dgingrich who I guess the reporter of SPARK-19507 - what do you think about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17213: [SPARK-19871] [PySpark][SQL] Improve error message in ve...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17213
  
    Could we deal with SPARK-19507 together if it looks easy to fix it up together? Also, I think we should run `./dev/lint-python`. It seems some lines does not comply pep8 here. As a bonus, we could add before/after error messages in the PR description maybe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org