You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2017/10/14 14:10:59 UTC

[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/19498

    [SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction

    ## What changes were proposed in this pull request?
    
    This PR proposes to wrap the transformed rdd within `TransformFunction`. `PythonTransformFunction` looks requiring to return `JavaRDD` in `_jrdd`.
    
    https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/streaming/util.py#L67
    
    https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala#L43
    
    However, this could be `JavaPairRDD` by some APIs, for example, `zip` in PySpark's RDD API.
    `_jrdd` could be checked as below:
    
    ```python
    >>> rdd.zip(rdd)._jrdd.getClass().toString()
    u'class org.apache.spark.api.java.JavaPairRDD'
    ```
    
    So, here, I wrapped it with `map` so that it ensures returning `JavaRDD`.
    
    ```python
    >>> rdd.zip(rdd).map(lambda x: x)._jrdd.getClass().toString()
    u'class org.apache.spark.api.java.JavaRDD'
    ```
    
    I tried to elaborate some failure cases as below:
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]) \
        .transform(lambda rdd: rdd.cartesian(rdd)) \
        .pprint()
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.cartesian(rdd))
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd))
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).union(rdd.zip(rdd)))
    ssc.start()
    ```
    
    ```python
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(spark.sparkContext, 10)
    ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).coalesce(1))
    ssc.start()
    ```
    
    ## How was this patch tested?
    
    Unit tests were added in `python/pyspark/streaming/tests.py` and manually tested.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-17756

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19498
    
----
commit f5a2a884d860e9c8b3f98fc4ae5f10eaf3c1a0a4
Author: hyukjinkwon <gu...@gmail.com>
Date:   2017-10-14T13:50:49Z

    Workaround to avoid return type mispatch in PythonTransformFunction

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Would it maybe make sense to fix this in the Scala side and accept `AbstractJavaRDDLike` instead which both JavaPairRDD and JavaRDD extend?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90371/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #83997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83997/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #83997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83997/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83995/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Thanks @rxin, @tdas, @holdenk and @felixcheung.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85184/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    gentle ping @zsxwing, @rxin, @tdas and @holdenk


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Mind if I go ahead, @rxin, @tdas and @holdenk?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #83996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83996/testReport)** for PR 19498 at commit [`dc8446a`](https://github.com/apache/spark/commit/dc8446ad99b2ad315ee93f854d98e3c25aa42ccf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #91555 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91555/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #85581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85581/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #90371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90371/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19498#discussion_r146531880
  
    --- Diff: python/pyspark/streaming/util.py ---
    @@ -64,7 +64,11 @@ def call(self, milliseconds, jrdds):
                 t = datetime.fromtimestamp(milliseconds / 1000.0)
                 r = self.func(t, *rdds)
                 if r:
    -                return r._jrdd
    +                # Here, we work around to ensure `_jrdd` is `JavaRDD` by wrapping it by `map`.
    +                # org.apache.spark.streaming.api.python.PythonTransformFunction requires to return
    +                # `JavaRDD`; however, this could be `JavaPairRDD` by some APIs, for example, `zip`.
    +                # See SPARK-17756.
    +                return r.map(lambda x: x)._jrdd
    --- End diff --
    
    Yup, it guess it would but I could not think of a better and simpler way (and I saw this workaround is already used above in `python/pyspark/streaming/context.py`). If a little bit more uglier (possibly flaky imho) way is okay for this concern, I could do this, for example, as below:
    
    ```python
    if r._jrdd.getClass().toString() == "JavaRDD":
        return r._jrdd
    else:
        return r.map(lambda x: x)._jrdd
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83997/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    @rxin, I just noticed you were watching this JIRA. Could you take a look when you have some time?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19498#discussion_r151855253
  
    --- Diff: python/pyspark/streaming/util.py ---
    @@ -64,7 +64,11 @@ def call(self, milliseconds, jrdds):
                 t = datetime.fromtimestamp(milliseconds / 1000.0)
                 r = self.func(t, *rdds)
                 if r:
    -                return r._jrdd
    +                # Here, we work around to ensure `_jrdd` is `JavaRDD` by wrapping it by `map`.
    +                # org.apache.spark.streaming.api.python.PythonTransformFunction requires to return
    +                # `JavaRDD`; however, this could be `JavaPairRDD` by some APIs, for example, `zip`.
    +                # See SPARK-17756.
    +                return r.map(lambda x: x)._jrdd
    --- End diff --
    
    Thanks for review @holdenk. Let me push the change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19498#discussion_r146514175
  
    --- Diff: python/pyspark/streaming/util.py ---
    @@ -64,7 +64,11 @@ def call(self, milliseconds, jrdds):
                 t = datetime.fromtimestamp(milliseconds / 1000.0)
                 r = self.func(t, *rdds)
                 if r:
    -                return r._jrdd
    +                # Here, we work around to ensure `_jrdd` is `JavaRDD` by wrapping it by `map`.
    +                # org.apache.spark.streaming.api.python.PythonTransformFunction requires to return
    +                # `JavaRDD`; however, this could be `JavaPairRDD` by some APIs, for example, `zip`.
    +                # See SPARK-17756.
    +                return r.map(lambda x: x)._jrdd
    --- End diff --
    
    wouldnt introducing this additional map cause a slight perf regression?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Hi @zsxwing, I happened to look into this one. Could you take a look and see if it makes sense please?
    
    cc @zero323 (reporter) too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #83995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83995/testReport)** for PR 19498 at commit [`c45b701`](https://github.com/apache/spark/commit/c45b7016d8c446d023a3ca415c15d26298e61c5a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #82761 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82761/testReport)** for PR 19498 at commit [`f5a2a88`](https://github.com/apache/spark/commit/f5a2a884d860e9c8b3f98fc4ae5f10eaf3c1a0a4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #91555 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91555/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #82761 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82761/testReport)** for PR 19498 at commit [`f5a2a88`](https://github.com/apache/spark/commit/f5a2a884d860e9c8b3f98fc4ae5f10eaf3c1a0a4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #83996 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83996/testReport)** for PR 19498 at commit [`dc8446a`](https://github.com/apache/spark/commit/dc8446ad99b2ad315ee93f854d98e3c25aa42ccf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19498


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Thanks for reviewing this @rxin. Will merge this only to master given the concerns here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Will check and see if I could make a min fix soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #85184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85184/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    cc @tdas 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #90371 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90371/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91555/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    I just took a look. This looks actually going to bigger change because we will need manually deal with the different de/serializers cases in Python side as well, for example:
    
    ```
    >>> rdd = sc.parallelize(range(10)).cartesian(sc.parallelize(range(10)))
    >>> rdd._jrdd_deserializer = sc.serializer
    >>> rdd.collect()
    [0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 1, 0, 1, 1, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9, 2, 0, 2, 1, 2, 2, 2, 3, 2, 4, 2, 5, 2, 6, 2, 7, 2, 8, 2, 9, 3, 0, 4, 0, 3, 1, 4, 1, 3, 2, 4, 2, 3, 3, 3, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 6, 4, 6, 3, 7, 4, 7, 3, 8, 3, 9, 4, 8, 4, 9, 5, 0, 5, 1, 5, 2, 5, 3, 5, 4, 5, 5, 5, 6, 5, 7, 5, 8, 5, 9, 6, 0, 6, 1, 6, 2, 6, 3, 6, 4, 6, 5, 6, 6, 6, 7, 6, 8, 6, 9, 7, 0, 7, 1, 7, 2, 7, 3, 7, 4, 7, 5, 7, 6, 7, 7, 7, 8, 7, 9, 8, 0, 9, 0, 8, 1, 9, 1, 8, 2, 9, 2, 8, 3, 8, 4, 9, 3, 9, 4, 8, 5, 9, 5, 8, 6, 9, 6, 8, 7, 9, 7, 8, 8, 8, 9, 9, 8, 9, 9]
    ```
    
    Seems we better just stick to the original workaround.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83996/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    gentle ping ... 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85581/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3849/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19498#discussion_r147551966
  
    --- Diff: python/pyspark/streaming/util.py ---
    @@ -64,7 +64,11 @@ def call(self, milliseconds, jrdds):
                 t = datetime.fromtimestamp(milliseconds / 1000.0)
                 r = self.func(t, *rdds)
                 if r:
    -                return r._jrdd
    +                # Here, we work around to ensure `_jrdd` is `JavaRDD` by wrapping it by `map`.
    +                # org.apache.spark.streaming.api.python.PythonTransformFunction requires to return
    +                # `JavaRDD`; however, this could be `JavaPairRDD` by some APIs, for example, `zip`.
    +                # See SPARK-17756.
    +                return r.map(lambda x: x)._jrdd
    --- End diff --
    
    @tdas, I believe this one is something we should fix ... would you like me to keep this way, try the way above or find another way to avoid a map even if it is complex? 
    
    Otherwise, I can run a benchmark on RDD with extra map too see the slight perf diff (although I am not sure yet if I can show the meaningful diff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    To be honest, I think more correct fix might be just to `_reserialize` but just want to avoid any potential regression by avoiding class check and using existing workaround used.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Hi @brkyvz, could you take a look please?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19498#discussion_r149649246
  
    --- Diff: python/pyspark/streaming/util.py ---
    @@ -64,7 +64,11 @@ def call(self, milliseconds, jrdds):
                 t = datetime.fromtimestamp(milliseconds / 1000.0)
                 r = self.func(t, *rdds)
                 if r:
    -                return r._jrdd
    +                # Here, we work around to ensure `_jrdd` is `JavaRDD` by wrapping it by `map`.
    +                # org.apache.spark.streaming.api.python.PythonTransformFunction requires to return
    +                # `JavaRDD`; however, this could be `JavaPairRDD` by some APIs, for example, `zip`.
    +                # See SPARK-17756.
    +                return r.map(lambda x: x)._jrdd
    --- End diff --
    
    @tdas, it's fine if you feel we should find another way or make a small benchmark. I can try. But just wonder which one you'd prefer before I go too far in a different way you'd prefer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to av...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19498#discussion_r151839513
  
    --- Diff: python/pyspark/streaming/util.py ---
    @@ -64,7 +64,11 @@ def call(self, milliseconds, jrdds):
                 t = datetime.fromtimestamp(milliseconds / 1000.0)
                 r = self.func(t, *rdds)
                 if r:
    -                return r._jrdd
    +                # Here, we work around to ensure `_jrdd` is `JavaRDD` by wrapping it by `map`.
    +                # org.apache.spark.streaming.api.python.PythonTransformFunction requires to return
    +                # `JavaRDD`; however, this could be `JavaPairRDD` by some APIs, for example, `zip`.
    +                # See SPARK-17756.
    +                return r.map(lambda x: x)._jrdd
    --- End diff --
    
    Personally, I think the only applying the `map` when the result is not JavaRDD is a good incremental improvement (since otherwise the code path fails right?).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #83995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83995/testReport)** for PR 19498 at commit [`c45b701`](https://github.com/apache/spark/commit/c45b7016d8c446d023a3ca415c15d26298e61c5a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #85184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85184/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    **[Test build #85581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85581/testReport)** for PR 19498 at commit [`174ec21`](https://github.com/apache/spark/commit/174ec2139a7e0af049e2954494525fd3fff145e2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82761/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19498: [SPARK-17756][PYTHON][STREAMING] Workaround to avoid ret...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org