You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zero323 <gi...@git.apache.org> on 2017/01/10 21:57:15 UTC

[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initi...

GitHub user zero323 opened a pull request:

    https://github.com/apache/spark/pull/16536

    [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initialization to the __call__

    ## What changes were proposed in this pull request?
    
    Defer `UserDefinedFunction._judf` initialization to the first call. This prevents unintended `SparkSession` initialization. 
    
    ## How was this patch tested?
    
    TODO

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark SPARK-19163

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16536.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16536
    
----
commit 2ca2557f0a31f3007640c12fa3001cc9a868cbb4
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-01-07T17:31:36Z

    Delay _judf initialization to the __call__

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72034/testReport)** for PR 16536 at commit [`a5567b2`](https://github.com/apache/spark/commit/a5567b283b6e80a42eaff92b0143969c5cbbc87f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98542899
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    @holdenk Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    @zero323 that sounds like a good improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72170/testReport)** for PR 16536 at commit [`ee953a9`](https://github.com/apache/spark/commit/ee953a90bc3872e25b0e06a01ae44709d98c520a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UDFInitializationTests(unittest.TestCase):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71254/testReport)** for PR 16536 at commit [`c936813`](https://github.com/apache/spark/commit/c936813960dc554941ae93bb196b78b85275e18e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71163/testReport)** for PR 16536 at commit [`2ca2557`](https://github.com/apache/spark/commit/2ca2557f0a31f3007640c12fa3001cc9a868cbb4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    @holdenk Done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r97951062
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    Could you elaborate a bit? I am not sure if I understand the issue. 
    
    Assignment is atomic (so we don't have to worry about corruption), for any practical purpose operation is idempotent (we can return expressions using different Java objects but as far as I am concerned this is just a detail of implementation), access to Py4J is thread safe and as far as I remember  function registries are synchronized. I there any issue i missed here?
    
    Thanks for looking into this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Going to go ahead and merge. Still need to sort out the JIRA permissions so will take a bit for me to get that updated for you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    +1
    
    Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98090756
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
                 "judf should be initialized after UDF has been called."
             )
     
    +    @unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't support mocking")
    --- End diff --
    
    You can create a testcase without the Spark base and verify that creating a UDF doesn't create a SparkContext. This does not require making a SparkContext.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98804548
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,23 +1826,35 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        # It is possible that concurent access, to newly created UDF,
    +        # will initalize multiple UserDefinedPythonFunctions.
    +        # This is unlikely, doesn't affect correctness,
    +        # and should have a minimal performance impact.
    +        if self._judf_placeholder is None:
    +            self._judf_placeholder = self._create_judf()
    +        return self._judf_placeholder
    +
    +    def _create_judf(self):
             from pyspark.sql import SparkSession
    +
             sc = SparkContext.getOrCreate()
    -        wrapped_func = _wrap_function(sc, self.func, self.returnType)
             spark = SparkSession.builder.getOrCreate()
    +
    +        wrapped_func = _wrap_function(sc, self.func, self.returnType)
             jdt = spark._jsparkSession.parseDataType(self.returnType.json())
    -        if name is None:
    -            f = self.func
    -            name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__
             judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
    -            name, wrapped_func, jdt)
    +            self._name, wrapped_func, jdt)
             return judf
     
         def __call__(self, *cols):
    -        sc = SparkContext._active_spark_context
    +        sc = SparkContext.getOrCreate()
    --- End diff --
    
    Though I am not sure if it really matters here. If `_instantiatedContext` is not None we'll do the same thing, otherwise we fall back to initialization in `_judf`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71254/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98055966
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    I stress tested this a bit and I haven't found any abnormalities but I found a small problem with `__call__` on the way.  Fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r97910321
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    So there isn't any lock around this - I suspect we aren't too likely to have concurrent calls to this - but just to be safe we should maybe think through what would happen if this is does happen (and then leave a comment about it)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72039/testReport)** for PR 16536 at commit [`29ffc57`](https://github.com/apache/spark/commit/29ffc57a4436ef6a7ca391bd5f463cf4c4ab950f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98093311
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
                 "judf should be initialized after UDF has been called."
             )
     
    +    @unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't support mocking")
    --- End diff --
    
    ```python
    class UDFInitializationTestCase(unittest.TestCase):
        def tearDown(self):
            if SparkSession._instantiatedSession is not None:
                SparkSession._instantiatedSession.stop()
    
            if SparkContext._active_spark_context is not None:
                SparkContext._active_spark_contex.stop()
    
        def test_udf_context_access(self):
            from pyspark.sql.functions import UserDefinedFunction
            f = UserDefinedFunction(lambda x: x, StringType())
            self.assertIsNone(SparkContext._active_spark_context)
            self.assertIsNone(SparkSession._instantiatedSession)
                
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98802783
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,23 +1826,35 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        # It is possible that concurent access, to newly created UDF,
    +        # will initalize multiple UserDefinedPythonFunctions.
    +        # This is unlikely, doesn't affect correctness,
    +        # and should have a minimal performance impact.
    +        if self._judf_placeholder is None:
    +            self._judf_placeholder = self._create_judf()
    +        return self._judf_placeholder
    +
    +    def _create_judf(self):
             from pyspark.sql import SparkSession
    +
             sc = SparkContext.getOrCreate()
    -        wrapped_func = _wrap_function(sc, self.func, self.returnType)
             spark = SparkSession.builder.getOrCreate()
    +
    +        wrapped_func = _wrap_function(sc, self.func, self.returnType)
             jdt = spark._jsparkSession.parseDataType(self.returnType.json())
    -        if name is None:
    -            f = self.func
    -            name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__
             judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
    -            name, wrapped_func, jdt)
    +            self._name, wrapped_func, jdt)
             return judf
     
         def __call__(self, *cols):
    -        sc = SparkContext._active_spark_context
    +        sc = SparkContext.getOrCreate()
    --- End diff --
    
    @holdenk Sounds reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    @holdenk I have one more suggestion. Shouldn't we replace
    
    ```python
        def _create_judf(self):
            from pyspark.sql import SparkSession
    
            sc = SparkContext.getOrCreate()
            spark = SparkSession.builder.getOrCreate()
    ```
    
    with 
    
    ```python
        def _create_judf(self):
            from pyspark.sql import SparkSession
    
            spark = SparkSession.builder.getOrCreate()
            sc = spark.sparkContext
    ```
    
    I left it as is but I think it could be cleaner. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r97908047
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -459,6 +459,23 @@ def filename(path):
             row2 = df2.select(sameText(df2['file'])).first()
             self.assertTrue(row2[0].find("people.json") != -1)
     
    +    def test_udf_defers_judf_initalization(self):
    +        from pyspark.sql.functions import UserDefinedFunction
    +
    +        f = UserDefinedFunction(lambda x: x, StringType())
    +
    +        self.assertIsNone(
    +            f._judf_placeholder,
    +            "judf should not be initialized before the first call."
    +        )
    +
    +        self.assertTrue(isinstance(f("foo"), Column), "UDF call should return a Column.")
    --- End diff --
    
    there is a `assertIsInstance` function that could simplify this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98801781
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,23 +1826,35 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        # It is possible that concurent access, to newly created UDF,
    +        # will initalize multiple UserDefinedPythonFunctions.
    +        # This is unlikely, doesn't affect correctness,
    +        # and should have a minimal performance impact.
    +        if self._judf_placeholder is None:
    +            self._judf_placeholder = self._create_judf()
    +        return self._judf_placeholder
    +
    +    def _create_judf(self):
             from pyspark.sql import SparkSession
    +
             sc = SparkContext.getOrCreate()
    -        wrapped_func = _wrap_function(sc, self.func, self.returnType)
             spark = SparkSession.builder.getOrCreate()
    +
    +        wrapped_func = _wrap_function(sc, self.func, self.returnType)
             jdt = spark._jsparkSession.parseDataType(self.returnType.json())
    -        if name is None:
    -            f = self.func
    -            name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__
             judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
    -            name, wrapped_func, jdt)
    +            self._name, wrapped_func, jdt)
             return judf
     
         def __call__(self, *cols):
    -        sc = SparkContext._active_spark_context
    +        sc = SparkContext.getOrCreate()
    --- End diff --
    
    So by switching this to `getOrCreate` we put a lock acquisition in the path of `__call__` which is maybe not ideal. We could maybe fix this by getting `_judf` first (e.g. `judf = self._judf`)? (Although it should be a mostly uncontended lock so it shouldn't be that bad, but if we ended up having a multi-threaded PySpark DataFrame UDF application this could maybe degrade a little bit).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71679/testReport)** for PR 16536 at commit [`57735b2`](https://github.com/apache/spark/commit/57735b2a554235cbcc47261660795f65f4f2d238).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72034/testReport)** for PR 16536 at commit [`a5567b2`](https://github.com/apache/spark/commit/a5567b283b6e80a42eaff92b0143969c5cbbc87f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72170 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72170/testReport)** for PR 16536 at commit [`ee953a9`](https://github.com/apache/spark/commit/ee953a90bc3872e25b0e06a01ae44709d98c520a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71163/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72173/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72173 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72173/testReport)** for PR 16536 at commit [`1a8280d`](https://github.com/apache/spark/commit/1a8280d7ff18388969aa25c26ea6e4d54ddb7495).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r97907507
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    --- End diff --
    
    Maybe add a comment explaining the purposes of this, just for future readers of the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Thanks a bunch @holdenk


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72218/testReport)** for PR 16536 at commit [`cb496f3`](https://github.com/apache/spark/commit/cb496f3e5384f822b7e254a808af3ddae2dd6ff6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98048118
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    @rdblue I get this part, and this is a possible scenario. Question is if this justifies preventive lock. As far as I am aware there should be no correctness issues here. `SparkSession` already locks during initialization so we are safe there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72021/testReport)** for PR 16536 at commit [`16245e4`](https://github.com/apache/spark/commit/16245e4042242ee05f5e51ca6d7a02db74cde022).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71351/testReport)** for PR 16536 at commit [`641ec1d`](https://github.com/apache/spark/commit/641ec1d2ea7f5cf4b401423ba5a690f94eee1bb1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72219 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72219/testReport)** for PR 16536 at commit [`9332da3`](https://github.com/apache/spark/commit/9332da36c3ddff988531e408b3b2d194150df6db).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98049505
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    Yeah, I don't think it should require a lock. I think concurrent calls are very unlikely and safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72222/testReport)** for PR 16536 at commit [`923b88d`](https://github.com/apache/spark/commit/923b88d03036a8ba799325b62da241c17083f649).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71688/testReport)** for PR 16536 at commit [`abb3726`](https://github.com/apache/spark/commit/abb37269b3bbde6dd6e9ae4fdaa211a9bcf46ca9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71163/testReport)** for PR 16536 at commit [`2ca2557`](https://github.com/apache/spark/commit/2ca2557f0a31f3007640c12fa3001cc9a868cbb4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98077102
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    Ok, I'd maybe just leave a comment saying that we've left out the lock since double creation is both unlikely and OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72219/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72034/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72172/testReport)** for PR 16536 at commit [`489ef54`](https://github.com/apache/spark/commit/489ef5437d89b795903565e9be56d73af49adcfa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UDFInitializationTests(unittest.TestCase):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71679/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71688 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71688/testReport)** for PR 16536 at commit [`abb3726`](https://github.com/apache/spark/commit/abb37269b3bbde6dd6e9ae4fdaa211a9bcf46ca9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72222 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72222/testReport)** for PR 16536 at commit [`923b88d`](https://github.com/apache/spark/commit/923b88d03036a8ba799325b62da241c17083f649).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98043331
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1826,19 +1826,27 @@ class UserDefinedFunction(object):
         def __init__(self, func, returnType, name=None):
             self.func = func
             self.returnType = returnType
    -        self._judf = self._create_judf(name)
    -
    -    def _create_judf(self, name):
    +        self._judf_placeholder = None
    +        self._name = name or (
    +            func.__name__ if hasattr(func, '__name__')
    +            else func.__class__.__name__)
    +
    +    @property
    +    def _judf(self):
    +        if self._judf_placeholder is None:
    --- End diff --
    
    I think @holdenk's concern is that this would allow concurrent calls to `_create_udf`. That would create two `UserDefinedPythonFunction` objects, but I don't see anything on the Scala side that is concerning about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71688/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL][WIP] Delay _judf initializati...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71254 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71254/testReport)** for PR 16536 at commit [`c936813`](https://github.com/apache/spark/commit/c936813960dc554941ae93bb196b78b85275e18e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72172/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71350/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71679/testReport)** for PR 16536 at commit [`57735b2`](https://github.com/apache/spark/commit/57735b2a554235cbcc47261660795f65f4f2d238).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72218/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72039/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r97951125
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -459,6 +459,23 @@ def filename(path):
             row2 = df2.select(sameText(df2['file'])).first()
             self.assertTrue(row2[0].find("people.json") != -1)
     
    +    def test_udf_defers_judf_initalization(self):
    --- End diff --
    
    I thought about it but I have this impression, maybe incorrect, that we avoid creating new contexts to keep total execution time manageable. If you think this justifies a separate  `TestCase` I am more than fine with that ([SPARK-19224](https://issues.apache.org/jira/browse/SPARK-19224) and [[PYSPARK] Python tests organization ](http://apache-spark-developers-list.1001551.n3.nabble.com/PYSPARK-Python-tests-organization-td20542.html), right?).
    
    If not, we could mock this, and put assert on the number of calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71351/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71350/testReport)** for PR 16536 at commit [`b124bb4`](https://github.com/apache/spark/commit/b124bb4c7639e2cf7a7cb9efbea2afbb79bab321).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98084776
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
                 "judf should be initialized after UDF has been called."
             )
     
    +    @unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't support mocking")
    --- End diff --
    
    @holdenk I believe that for a full test `udf` would have to create `SparkContext`. But mock is cheap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72219/testReport)** for PR 16536 at commit [`9332da3`](https://github.com/apache/spark/commit/9332da36c3ddff988531e408b3b2d194150df6db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72218/testReport)** for PR 16536 at commit [`cb496f3`](https://github.com/apache/spark/commit/cb496f3e5384f822b7e254a808af3ddae2dd6ff6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72170/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    The changes look good to me, I'll take a quick pass at the formatting to make sure - but otherwise I'll try and merge this tomorrow :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Great, I'll wait for jenkins then :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98543788
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -1947,6 +1968,29 @@ def test_sparksession_with_stopped_sparkcontext(self):
             df.collect()
     
     
    +class UDFInitializationTests(unittest.TestCase):
    --- End diff --
    
    And added a separate test case checking `SparkContext` and `SparkSession` state.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98543154
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -459,6 +459,23 @@ def filename(path):
             row2 = df2.select(sameText(df2['file'])).first()
             self.assertTrue(row2[0].find("people.json") != -1)
     
    +    def test_udf_defers_judf_initalization(self):
    --- End diff --
    
    @holdenk Separate case it is. As long as implementation is correct an overhead is negligible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98543649
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -459,6 +459,23 @@ def filename(path):
             row2 = df2.select(sameText(df2['file'])).first()
             self.assertTrue(row2[0].find("people.json") != -1)
     
    +    def test_udf_defers_judf_initalization(self):
    --- End diff --
    
    I'll these tests, to make sure that `_judf` is initialized when necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72172/testReport)** for PR 16536 at commit [`489ef54`](https://github.com/apache/spark/commit/489ef5437d89b795903565e9be56d73af49adcfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72021/testReport)** for PR 16536 at commit [`16245e4`](https://github.com/apache/spark/commit/16245e4042242ee05f5e51ca6d7a02db74cde022).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72021/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72222/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98091656
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -476,6 +476,19 @@ def test_udf_defers_judf_initalization(self):
                 "judf should be initialized after UDF has been called."
             )
     
    +    @unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't support mocking")
    --- End diff --
    
    @holdenk  Do you mean something like `SparkContext._active_spark_context is None`? Then we need to make sure it is tear down if it was initialized after all, right? Isn't mocking cleaner?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r98077397
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -459,6 +459,23 @@ def filename(path):
             row2 = df2.select(sameText(df2['file'])).first()
             self.assertTrue(row2[0].find("people.json") != -1)
     
    +    def test_udf_defers_judf_initalization(self):
    --- End diff --
    
    I think a seperate test case and would able to be pretty light weight since it doesn't need to create a SparkContext or anything which traditionally takes longer to set up. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72173 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72173/testReport)** for PR 16536 at commit [`1a8280d`](https://github.com/apache/spark/commit/1a8280d7ff18388969aa25c26ea6e4d54ddb7495).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #72039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72039/testReport)** for PR 16536 at commit [`29ffc57`](https://github.com/apache/spark/commit/29ffc57a4436ef6a7ca391bd5f463cf4c4ab950f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16536#discussion_r97907785
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -459,6 +459,23 @@ def filename(path):
             row2 = df2.select(sameText(df2['file'])).first()
             self.assertTrue(row2[0].find("people.json") != -1)
     
    +    def test_udf_defers_judf_initalization(self):
    --- End diff --
    
    This seems like a good test but maybe a bit too focused on testing the implementation specifics?
    
    Maybe it might more sense to also have a test which verifies creating a UDF doesn't create a SparkSession since that is the intended purposes (we don't really care about delaying the initialization of _judfy that much per-se but we do care about verifying that we don't eagerly create the SparkSession on import). What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71350/testReport)** for PR 16536 at commit [`b124bb4`](https://github.com/apache/spark/commit/b124bb4c7639e2cf7a7cb9efbea2afbb79bab321).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initialization to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16536
  
    **[Test build #71351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71351/testReport)** for PR 16536 at commit [`641ec1d`](https://github.com/apache/spark/commit/641ec1d2ea7f5cf4b401423ba5a690f94eee1bb1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16536: [SPARK-19163][PYTHON][SQL] Delay _judf initializa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16536


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org