You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/07/23 21:54:00 UTC

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/1551

    [SPARK-1630] Turn Null of Java/Scala into None of Python

    During serializing PythonRDD, it will cause an NPE if there null in it. This patch will handle it as None of Python.
    
    This PR is based on #554, thanks to @kalpit. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark null

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1551.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1551
    
----
commit ff036d31c6adbc2cd5f2c9347c267073b673167b
Author: Kalpit Shah <sh...@gmail.com>
Date:   2014-04-25T17:44:30Z

    SPARK-1630: Make PythonRDD handle Null elements and strings gracefully

commit 8a4a0f94d34b76b44b590ca741b438393b803106
Author: Kalpit Shah <sh...@gmail.com>
Date:   2014-04-25T18:24:41Z

    SPARK-1630: Incorporated code-review feedback

commit dddda9e2858d518c916b60972a2ba0a025b38855
Author: Kalpit Shah <sh...@gmail.com>
Date:   2014-04-25T18:59:50Z

    SPARK-1630: Fixed indentation

commit 55a077a900244bb0d286927bfa2b51a049cea94b
Author: Davies Liu <da...@gmail.com>
Date:   2014-07-23T19:17:50Z

    Merge branch 'pyspark/handleNullData' of github.com:kalpit/spark into null

commit 3af8b4d4e7152bf5857201febd6e223aab4587fe
Author: Davies Liu <da...@gmail.com>
Date:   2014-07-23T19:47:33Z

    turn Null of Java into None of Python

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50550822
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies closed the pull request at:

    https://github.com/apache/spark/pull/1551


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50682428
  
    Close this PR now, will reopen if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15432269
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    --- End diff --
    
    It's header of var-length field, it's better to keep this header has fixed length, or you will need to deal with special var-length encoding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15562646
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    But this is what I didn't understand about the whole PR: user code is not meant to call PythonRDD directly. Note that the whole PythonRDD object is `private[spark]`. So where in the codebase today can we get nulls there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-69256005
  
    @matei We hit this issue when working on Python API for Kafka, it's a DStream[Array[Byte], Array[byte]], but the key in dstream is null, I will fix this in #3715  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-49926580
  
    QA tests have started for PR 1551. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17056/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50242563
  
    QA tests have started for PR 1551. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17225/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50551104
  
    QA tests have started for PR 1551. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17388/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50543696
  
    QA results for PR 1551:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17380/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50546276
  
    The failed tests cases is not related to this PR, how to retest it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15569391
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    BTW apart from the stability issue above with catching our own bugs, the reason I'm commenting is that this change also adds some moderately tricky code in a fairly important code path, increasing the chance of adding new bugs. That doesn't seem worth it to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50244477
  
    QA results for PR 1551:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17225/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by matei <gi...@git.apache.org>.
Github user matei commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50230241
  
    Hi guys,
    Could you please use the full username (e.g. @mateixx instead if @matei) when referring to someone ? I keep getting subscribed to various conversations under this project :) thanks a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15568273
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    If users want to call UDF in Java/Scala from PySpark, they have to use this private API to do it, so it's possible to have null in RDD[string] or RDD[Array[Byte]].
    
    BTW, it will be helpful if we can skip some BAD rows during map/reduce, which was mentioned in MapReduce paper. This is not MUST have feature, but it really improve the robustness of whole framework, very useful for large scale jobs.
    
    This PR try to improve the stability of PySpark, let users feel safer and happier to hack in PySpark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50442362
  
    We aren't passing completely arbitrary iterators of Java objects to writeIteratorToStream; instead, we only handle iterators of strings and byte arrays. Nulls in data read from Hadoop input formats should already be converted to None by the Java pickling code. Do you have an example where PythonRDD receives a null element and it's not due to a bug? I'm worried that this patch will mask the presence of bugs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50242375
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15567208
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    Right, but that's a private API, it doesn't matter. Does our own code do it?
    
    Basically I'm worried that this significantly complicates our code for something that shouldn't happen. I'd rather have an NPE if our own code later passes nulls here (cause it really shouldn't be doing that since we control everything we pass in).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50538401
  
    QA tests have started for PR 1551. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17380/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15432123
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    --- End diff --
    
    maybe it doesn't matter much here, but would it make sense to write a byte instead of an int?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15316298
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -524,9 +530,13 @@ private[spark] object PythonRDD extends Logging {
       }
     
       def writeUTF(str: String, dataOut: DataOutputStream) {
    -    val bytes = str.getBytes(UTF8)
    -    dataOut.writeInt(bytes.length)
    -    dataOut.write(bytes)
    +    if (str == null) {
    +      dataOut.writeInt(SpecialLengths.NULL)
    +    } else {
    +        val bytes = str.getBytes(UTF8)
    --- End diff --
    
    alignment is off here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50554987
  
    QA results for PR 1551:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17388/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15535564
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    It looks like we only have to worry about nulls when writing iterators from user-defined RDDs of strings.  So, if we see an iterator that begins with null, we can assume that the remainder of the iterator contains only nulls or strings.  Therefore, I think you can write out the first null followed by
    
    ```
    iter.asInstanceOf[Iterator[String]].foreach { str =>
      writeUTF(str, dataOut)
    }
    ```
    
    to process the remainder of the stream.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15555903
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -312,42 +314,51 @@ private[spark] object PythonRDD extends Logging {
         JavaRDD.fromRDD(sc.sc.parallelize(objs, parallelism))
       }
     
    +  @tailrec
       def writeIteratorToStream[T](iter: Iterator[T], dataOut: DataOutputStream) {
         // The right way to implement this would be to use TypeTags to get the full
         // type of T.  Since I don't want to introduce breaking changes throughout the
         // entire Spark API, I have to use this hacky approach:
    +    def writeBytes(bytes: Array[Byte]) {
    --- End diff --
    
    Array[Byte] is similar to String, null can be generated by user's functions or RDDs, just like
    
    RDD[String].map(x => if (x != null) x.toArray else x)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15562712
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    Someone opened the original PR and I had the same question there, it seemed unnecessary. The PR wasn't open to fix a bug as far as I know, or at least the person didn't report a particular case where it crashes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-49940062
  
    QA results for PR 1551:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17056/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15569680
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    OK, let's hold it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15555734
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    I think it's better to handle NPE as much as possible, until you can prove that NPE will not happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15569151
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    Again, sorry, I don't think this improves stability:
    1) Users are not supposed to call private APIs. In fact even *Scala* code can't call PythonRDD because that is private[spark] -- it's just an artifact of the way Scala implements package-private that the class becomes public in Java. If you'd like support for UDFs we need to add that as a separate, top-level feature.
    2) This change would mask bugs in the current way we write Python converters. Our current converters only pass in Strings and arrays of bytes, which shouldn't be null. (For datasets that contain null they convert it to a picked form of None already). This means that if someone introduces a bug in one of our existing code paths, that bug will be harder to fix because instead of being an NPE, it will be some weird value coming out in Python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-49943132
  
    QA tests have started for PR 1551. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17065/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1551#issuecomment-50183135
  
    @rxin @matei, could you take a look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15565797
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    It's easy to call private API in Python, log4j can easily to do this via reflection, just like we do in PySpark.
    
    For example, 
    {{{
    java_func = sc._jvm.com.xxx.XClass.XFunc
    scala_rdd = py_rdd._jrdd.rdd().map(java_func)
    java_rdd = sc._jvm.PythonRDD(scala_rdd, '', env, includes, ...).asJavaRDD()
    rdd = RDD(java_rdd, sc, UTF8Deserializer())
    }}}
    
    If java_func can return null, then it will have NPE.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15554364
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    I was wrong; this _is_ tail-recursive.  If we only expect nulls to occur in iterators of strings, then I think we should be able to remove the null checking here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15554294
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -312,42 +314,51 @@ private[spark] object PythonRDD extends Logging {
         JavaRDD.fromRDD(sc.sc.parallelize(objs, parallelism))
       }
     
    +  @tailrec
       def writeIteratorToStream[T](iter: Iterator[T], dataOut: DataOutputStream) {
         // The right way to implement this would be to use TypeTags to get the full
         // type of T.  Since I don't want to introduce breaking changes throughout the
         // entire Spark API, I have to use this hacky approach:
    +    def writeBytes(bytes: Array[Byte]) {
    --- End diff --
    
    Is there a legitimate case where a Iterator[Array[Byte]] will contain a null?  I was hoping we'd only have to worry about nulls in Iterator[String].


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15527067
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    This method isn't tail-recursive, so this will cause a StackOverflow if you try to write an iterator with thousands of consecutive nulls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---