You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zero323 <gi...@git.apache.org> on 2017/01/10 22:09:23 UTC

[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

GitHub user zero323 opened a pull request:

    https://github.com/apache/spark/pull/16537

    [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__call__ should validate input types

    ## What changes were proposed in this pull request?
    
    Adds basic input validation for `UserDefinedFunction.__call__` to avoid failing with cryptic `Py4J` errors.
    
    ## How was this patch tested?
    
    Unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark SPARK-19165

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16537
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78307/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    I see people run into this kind of things quite a bit.
    sounds like this is important to have. how about reviving some forms of this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    And there is of course a matter of user experience.  Even if failure is cheap, something like this:
    
    ```python
    In [4]: from pyspark.sql.functions import udf
    
    In [5]: udf(lambda x: x)(1)
    ---------------------------------------------------------------------------
    Py4JError                                 Traceback (most recent call last)
    <ipython-input-5-729166e23ad0> in <module>()
    ----> 1 udf(lambda x: x)(1)
    
    ...
    
    Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
    py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
    	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
    	at py4j.Gateway.invoke(Gateway.java:274)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:214)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    is not an useful exception for anyone who is not familiar with Spark internals.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r98955007
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
             row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
             self.assertTrue(row[0].find("people1.json") != -1)
     
    +    def test_udf_should_validate_input_args(self):
    +        from pyspark.sql.functions import udf
    +
    +        self.assertRaises(TypeError, udf(lambda x: x), None)
    --- End diff --
    
    If this is covered by existing tests, then that's fine. Good point.
    
    To validate number of args, I think it is a good idea, as long as we know that it won't fail C extensions (but may be inconclusive).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    I think the overhead of doing this piecemeal removes review time available for more important changes (like places where users are actively encountering confusing error messages, incorrect behaviour, or missing functionality for Scala parity).
    
    As illustrated by your confusion about the purpose of the PR, there are other outstanding PRs adding similar checks in other places and @zero323 is already familiar with the code base hence my suggestion that they look at a more scalable solution (from the point of view of review time).
    
    Of course this is just a personal request that we look at solving this in a less piecemeal way to reduce review overhead (and a biased one at that), but I'll probably triage these issues as less important unless there is a clear link to a user issue - (except from new contributors getting familiar with the code base which is valuable in other ways too).
    
    That being said this discussion has gotten pretty off topic from the point of view of this individual PR and we should maybe move it to the JIRA or lists if we want to continue it (but personally I think we are at an agree to disagree about priorities and no one is obliged to listen to my priorities :)).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71676/testReport)** for PR 16537 at commit [`9e0ab69`](https://github.com/apache/spark/commit/9e0ab690ec0b066ec902ff7fcc4d20597a550174).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    I explore an alternative approach, with adding type hints (https://github.com/zero323/pyspark-stubs), but I doubt it'll become particularly popular, and I won't even try to push it to the main repository :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 closed the pull request at:

    https://github.com/apache/spark/pull/16537


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Let me take over this one and credit to @zero323.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Maybe we're at an agree to disagree situation, but I think we may be talking about different things. If you're saying that we should try to keep these together to make reviews easier, I'd agree. I was under the impression that this change may be rejected because it isn't important enough of a problem, which I think isn't a good way of looking at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r95681106
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
             row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
             self.assertTrue(row[0].find("people1.json") != -1)
     
    +    def test_udf_should_validate_input_args(self):
    +        from pyspark.sql.functions import udf
    +
    +        self.assertRaises(TypeError, udf(lambda x: x), None)
    --- End diff --
    
    It is pretty well covered by existing `udf` tests. The more the merrier but I am not sure what can be added with duplicating other test cases.
    
    Do you think we should try to some type validation of the number of arguments?
    
    Pros:
    - It is easy to implement with `inspect` or `func.__code__` for plain Python objects. 
    - It is nice to fail without starting a complex job.
    
    Cons:
    - It most likely won't work well for C extensions and such.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    So I'm curious what the motivation is for adding these checks - looking in the mailing list archives this doesn't seem like a common error (and the only stack overflow post about this I saw was using UDFs inside of a parallelize so I don't think this would have helped). If this is an error people have run into and found confusing though, the change looks pretty reasonable I'm just a bit hesitant to add start doing a lot of type checking PRs for things people aren't running into/having difficulties with.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78330/testReport)** for PR 16537 at commit [`d476faf`](https://github.com/apache/spark/commit/d476faf7a9912e4ff93fcb9c567ffc91f21c0512).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Sorry, my example was for validating the object passed to `udf` was callable, not for the use of the UDF. I still think it's a good idea not to make assumptions about how a user makes a mistake. Error checking should be proportional to how difficult it is to find what happened, not how likely it is to happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78307/testReport)** for PR 16537 at commit [`0222c44`](https://github.com/apache/spark/commit/0222c44abe679ba188091eeea871d58870708f71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    For me it is all about the bigger picture. I've been working with Python for quite a while right now (probably to long for my own good) and I am used to two things:
    
    - Language is relatively forgiving when it comes to types. I am more used to thinking about abstract base classes than concrete types.
    - Language which communicates failures in a clear way. If there is problem with incompatible types or interfaces I expect to receive clear feedback. And of course interactive debugger on top of that if something goes particularly wrong.
    
    PySpark is not there yet. We get Py4j exceptions (albeit it improved a lot in 2.x), we get runtime exceptions with huge JVM tracebacks, when it is possible to fail fast (on the driver), and finally we get silent errors (like returning `int` from and UDF with declared type `float`). 
    
    It is not always possible or practical to avoid these failures but I believe that in cases where:
    
    - We have very strict requirements regarding types.
    - Cost of checking is low (O(1) not for example O(N)).
    - We fail early and prevent an expensive failure in the middle of a pipeline.
    
    it is a good idea to be proactive.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78330/testReport)** for PR 16537 at commit [`d476faf`](https://github.com/apache/spark/commit/d476faf7a9912e4ff93fcb9c567ffc91f21c0512).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r124004237
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1949,6 +1949,14 @@ def _create_judf(self):
             return judf
     
         def __call__(self, *cols):
    +        for c in cols:
    +            if not isinstance(c, (Column, str)):
    --- End diff --
    
    Doesn't this break `unicode` support in Python 2?
    
    ```python
    from pyspark.sql.functions import udf
    udf(lambda x: x)(u"a")
    ```
    
    before
    
    ```
    Column<<lambda>(a)>
    ```
    
    after
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/functions.py", line 1970, in wrapper
        return self(*args)
      File ".../spark/python/pyspark/sql/functions.py", line 1958, in __call__
        "lit, array, struct or create_map.".format(c, type(c)))
    TypeError: Invalid UDF argument, not a str or Column: a of type <type 'unicode'>. For Column literals use sql.functions lit, array, struct or create_map.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Thanks @HyukjinKwon :D 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r98978603
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
             row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
             self.assertTrue(row[0].find("people1.json") != -1)
     
    +    def test_udf_should_validate_input_args(self):
    +        from pyspark.sql.functions import udf
    +
    +        self.assertRaises(TypeError, udf(lambda x: x), None)
    --- End diff --
    
    MTE I removed [WIP] and hopefully it will get merged :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    I don't think it is a good idea to think that this has little use because it is a dumb mistake to pass something that isn't callable. In this case, it's easy to accidentally reuse a name for a function and a variable (e.g., `format`), especially as scripts change over time and pass from one maintainer to another.
    
    Spark should have reasonable behavior for any error, as opposed to being harder to work with because we thought the user wasn't likely to hit a particular problem. This is very few lines of code that will make a user's experience much better because it can catch exactly what the problem is, without running a job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71164/testReport)** for PR 16537 at commit [`dd79069`](https://github.com/apache/spark/commit/dd79069ead6e74aebbd148a9b57cb5ae3ae4121d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72884/testReport)** for PR 16537 at commit [`c2add7e`](https://github.com/apache/spark/commit/c2add7efd38c47e404d8fbef3798953b58ed1ca9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78313/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r125755849
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1949,6 +1949,14 @@ def _create_judf(self):
             return judf
     
         def __call__(self, *cols):
    +        for c in cols:
    +            if not isinstance(c, (Column, str)):
    --- End diff --
    
    @HyukjinKwon Sorry for a delayed response, I am seldom online these days. You're right, it looks like an issue. I'll take a look at this, when I have more time


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71687/testReport)** for PR 16537 at commit [`7aa1607`](https://github.com/apache/spark/commit/7aa1607d42b45ec52dd7ec7741f043b7c1f18ec7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r98964355
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
             row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
             self.assertTrue(row[0].find("people1.json") != -1)
     
    +    def test_udf_should_validate_input_args(self):
    +        from pyspark.sql.functions import udf
    +
    +        self.assertRaises(TypeError, udf(lambda x: x), None)
    --- End diff --
    
    Yeah. I am afraid it can actually cause more troubles than its worth:
    - If we throw an exception there is a chance we hit some border cases.
    - Issuing a warning doesn't prevent task failure so it doesn't provide the same advantages as failing early.
    
    Maybe it is better to leave it as is. Right now users get a clear feedback, if there is an incorrect type, and for additional safety one can always use annotations and type checker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Yeah, I thought this was the other PR that validates the function is callable. Still, I don't agree that it's okay for python to be less friendly as long as we don't think people will hit the problem too much or because they solve the problem before asking a list. There are reasonable ways to hit this and Spark should give a good explanation about what went wrong.
    
    I'm not saying we have to go fix all of the cases like this, but where there's a PR ready to go I think it's better to include it than not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    @holdenk I'll try to reproduce this problem but it looks a bit awkward:
    
    > AttributeError: 'function' object has no attribute '__closure__'
    
    Doesn't look like something related to this PR at all 😕


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    If you have the time to update/fix this @zero323 I'm happy to merge it pending jenkins, otherwise I'll just close the issue at the end of the month.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71676/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78315/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Thanks @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    @zero323 Hi, are you still working on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71687/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #73936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73936/testReport)** for PR 16537 at commit [`0222c44`](https://github.com/apache/spark/commit/0222c44abe679ba188091eeea871d58870708f71).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78330/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71164/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    @rdblue i think we're maybe understanding different type checks. My understanding is in this case the error is already thrown right away. It's also not that the user needs to pass a callable here, were checking that the UDF is called with column or strings as arguments.
    
    I agree the current error message is a bit obtuse to new users, but I also don't want us to go adding type checking to every individual function wrapper to Py4J in another PR - that just isn't scalable given the level of committer bandwidth available in Python. I'm think we should prioritize issues that users have run into or figure out a more scalable way to solve this (either bigger batches of cleanups, improved explanations of Py4J errors possible causes, or type hints - although as zero323 points out that's a larger discussion). (And for this type check I didn't see any posts related to it).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71207/testReport)** for PR 16537 at commit [`a94f755`](https://github.com/apache/spark/commit/a94f75585bbe60b520a9d94c403aeae44d7dbda4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    @holdenk If you don't see this merged could your resolve the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-19165) and I'll just close the PR? No reason to keep this open ad infinitum :) TIA


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    I cannot reproduce this locally, but do we really use `pypy-2.0.2`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    So it seems there isn't a solid reason not to merge this provided we aren't going to go down the rabbit whole we've been talking about. Lets make sure everything is still ok with Jenkins still (Jenkins Retest This Please).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r95486774
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1848,6 +1848,12 @@ def __del__(self):
                 self._broadcast = None
     
         def __call__(self, *cols):
    +        for c in cols:
    +            if not isinstance(c, (Column, str)):
    +                raise TypeError(
    +                    "All arguments should be Columns or strings representing column names. "
    --- End diff --
    
    A literal ends up being a `Column`, aren't the others as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

Posted by zero323 <gi...@git.apache.org>.
GitHub user zero323 reopened a pull request:

    https://github.com/apache/spark/pull/16537

    [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types

    ## What changes were proposed in this pull request?
    
    Adds basic input validation for `UserDefinedFunction.__call__` to avoid failing with cryptic `Py4J` errors.
    
    ## How was this patch tested?
    
    Unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark SPARK-19165

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16537
    
----
commit d476faf7a9912e4ff93fcb9c567ffc91f21c0512
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-06-20T20:42:57Z

    Validate types in UserDefinedFunction.__call__

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72826/testReport)** for PR 16537 at commit [`60bde4e`](https://github.com/apache/spark/commit/60bde4ee30e554d2effd66730a5d1b2cd1b8b084).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71207/testReport)** for PR 16537 at commit [`a94f755`](https://github.com/apache/spark/commit/a94f75585bbe60b520a9d94c403aeae44d7dbda4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72884/testReport)** for PR 16537 at commit [`c2add7e`](https://github.com/apache/spark/commit/c2add7efd38c47e404d8fbef3798953b58ed1ca9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71207/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78315 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78315/testReport)** for PR 16537 at commit [`d476faf`](https://github.com/apache/spark/commit/d476faf7a9912e4ff93fcb9c567ffc91f21c0512).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r95486026
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1848,6 +1848,12 @@ def __del__(self):
                 self._broadcast = None
     
         def __call__(self, *cols):
    +        for c in cols:
    +            if not isinstance(c, (Column, str)):
    +                raise TypeError(
    +                    "All arguments should be Columns or strings representing column names. "
    --- End diff --
    
    Sounds good. I think it should also provide a suggestion that one can use literals (`lit`, `array`, `struct` and `create_map`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 closed the pull request at:

    https://github.com/apache/spark/pull/16537


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78307/testReport)** for PR 16537 at commit [`0222c44`](https://github.com/apache/spark/commit/0222c44abe679ba188091eeea871d58870708f71).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r95487347
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1848,6 +1848,12 @@ def __del__(self):
                 self._broadcast = None
     
         def __call__(self, *cols):
    +        for c in cols:
    +            if not isinstance(c, (Column, str)):
    +                raise TypeError(
    +                    "All arguments should be Columns or strings representing column names. "
    --- End diff --
    
    Yes, it is. My point is it is not always obvious for new users. Giving a hint that there is such thing as literals could be a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73936/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Putting this particular PR and the scalability of the improvement process aside, Spark is heavily  underdocumented. This is something that hits Python and R users way more than everyone else. In the worst case scenario when working with Scala you can just follow the types. It wouldn't be a problem if used consistent conventions, idiomatic Python and didn't make hidden assumptions once in a while :) Take things like `DataFrame.replace` or some parts of the `DStream` API (can you point out places when expect `function` not a `Callable`) for example. I am not really a stakeholder here but i really believe that small things like this are crucial to create decent user experience.
    
    In hindsight, I overdid with the number of tasks but on my defense I am pretty sure that at least some of these won't get merged. Moreover problem is not imaginary. For many users it is not obvious how to use udfs (http://stackoverflow.com/q/35546576, http://stackoverflow.com/q/35375255, http://stackoverflow.com/q/39254503) and docstring of `udf` is actually the only documented example I found.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72953/testReport)** for PR 16537 at commit [`85e4a7c`](https://github.com/apache/spark/commit/85e4a7c284ab8c4cd7b5e6093b6fee8027a05781).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Ah perhaps then we are simply argeeing with each-other. I'm fine with adding these types of fixes - but doing it one function at a time is just going to be too time consuming and distracting from other more useful changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78315/testReport)** for PR 16537 at commit [`d476faf`](https://github.com/apache/spark/commit/d476faf7a9912e4ff93fcb9c567ffc91f21c0512).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r95484684
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
             row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
             self.assertTrue(row[0].find("people1.json") != -1)
     
    +    def test_udf_should_validate_input_args(self):
    +        from pyspark.sql.functions import udf
    +
    +        self.assertRaises(TypeError, udf(lambda x: x), None)
    --- End diff --
    
    I think this should have positive tests for a column and a string as well as a negative test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71164/testReport)** for PR 16537 at commit [`dd79069`](https://github.com/apache/spark/commit/dd79069ead6e74aebbd148a9b57cb5ae3ae4121d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71687/testReport)** for PR 16537 at commit [`7aa1607`](https://github.com/apache/spark/commit/7aa1607d42b45ec52dd7ec7741f043b7c1f18ec7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #71676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71676/testReport)** for PR 16537 at commit [`9e0ab69`](https://github.com/apache/spark/commit/9e0ab690ec0b066ec902ff7fcc4d20597a550174).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72953/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72244/testReport)** for PR 16537 at commit [`5a25ee2`](https://github.com/apache/spark/commit/5a25ee2ca9705455e226a26df8031db9f93abe91).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78313/testReport)** for PR 16537 at commit [`0086723`](https://github.com/apache/spark/commit/0086723af03871819b60360b734b0539bc038b03).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r95484969
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1848,6 +1848,12 @@ def __del__(self):
                 self._broadcast = None
     
         def __call__(self, *cols):
    +        for c in cols:
    +            if not isinstance(c, (Column, str)):
    +                raise TypeError(
    +                    "All arguments should be Columns or strings representing column names. "
    --- End diff --
    
    "All arguments" is a little vague, since this is going to be called later in code than the UDF definition. What about an error message like "Invalid UDF argument, not a String or Column: {0} of type {1}"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72244/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72884/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72953/testReport)** for PR 16537 at commit [`85e4a7c`](https://github.com/apache/spark/commit/85e4a7c284ab8c4cd7b5e6093b6fee8027a05781).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72244/testReport)** for PR 16537 at commit [`5a25ee2`](https://github.com/apache/spark/commit/5a25ee2ca9705455e226a26df8031db9f93abe91).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #78313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78313/testReport)** for PR 16537 at commit [`0086723`](https://github.com/apache/spark/commit/0086723af03871819b60360b734b0539bc038b03).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFuncti...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16537#discussion_r98970775
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -429,6 +429,11 @@ def test_udf_with_input_file_name(self):
             row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
             self.assertTrue(row[0].find("people1.json") != -1)
     
    +    def test_udf_should_validate_input_args(self):
    +        from pyspark.sql.functions import udf
    +
    +        self.assertRaises(TypeError, udf(lambda x: x), None)
    --- End diff --
    
    Sounds good. It's probably worth exploring eventually, but there's no need to hold up this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #73936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73936/testReport)** for PR 16537 at commit [`0222c44`](https://github.com/apache/spark/commit/0222c44abe679ba188091eeea871d58870708f71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    **[Test build #72826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72826/testReport)** for PR 16537 at commit [`60bde4e`](https://github.com/apache/spark/commit/60bde4ee30e554d2effd66730a5d1b2cd1b8b084).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    I definitely think moving errors earlier is important (nothing is worse than a 9 hour job that fails in the middle because of the wrong type). That being said in this case the error isn't caught any earlier though. I think doing this selectively in the cases where it makes sense would be good, I just don't want to spend a lot of time adding type checks in places where they don't give us a lot.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16537
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72826/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org