You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by kalpit <gi...@git.apache.org> on 2014/04/25 19:48:12 UTC

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

GitHub user kalpit opened a pull request:

    https://github.com/apache/spark/pull/554

    SPARK-1630: Make PythonRDD handle NULL elements and strings gracefully

    Have added a unit test that validates the fix. We no longer NPE.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kalpit/spark pyspark/handleNullData

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #554
    
----
commit ff036d31c6adbc2cd5f2c9347c267073b673167b
Author: Kalpit Shah <sh...@gmail.com>
Date:   2014-04-25T17:44:30Z

    SPARK-1630: Make PythonRDD handle Null elements and strings gracefully

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41426593
  
    Thanks - just one more tiny thing about indent ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-54694733
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-51286613
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-50440910
  
    Hi @kalpit, 
    
    Since this PR has been superseded by #644, do you mind closing it?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12014899
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -301,15 +301,23 @@ private[spark] object PythonRDD {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            logDebug("Encountered NULL element from iterator. We skip writing NULL to stream.")
    +          } else {
    +            throw new SparkException("Unexpected element type " + first.getClass)
    +          }
           }
         }
       }
     
       def writeUTF(str: String, dataOut: DataOutputStream) {
    -    val bytes = str.getBytes(UTF8)
    -    dataOut.writeInt(bytes.length)
    -    dataOut.write(bytes)
    +    if (str == null) {
    +      logDebug("Encountered NULL string. We skip writing NULL to stream.")
    +    } else {
    +        val bytes = str.getBytes(UTF8)
    --- End diff --
    
    Can you update this one also?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12007675
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -301,15 +301,25 @@ private[spark] object PythonRDD {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          Option(other) match {
    +            case None =>
    +              logDebug("Encountered NULL element from iterator. We skip writing NULL to stream.")
    +            case Some(x) =>
    +              throw new SparkException("Unexpected element type " + first.getClass)
    +          }
           }
         }
       }
     
       def writeUTF(str: String, dataOut: DataOutputStream) {
    -    val bytes = str.getBytes(UTF8)
    -    dataOut.writeInt(bytes.length)
    -    dataOut.write(bytes)
    +    Option(str) match {
    --- End diff --
    
    here also


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by kalpit <gi...@git.apache.org>.

Github user kalpit commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12010821
  
    --- Diff: core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala ---
    @@ -29,5 +29,10 @@ class PythonRDDSuite extends FunSuite {
             PythonRDD.writeIteratorToStream(input.iterator, buffer)
         }
     
    +    test("Handle nulls gracefully") {
    +        val input: List[String] = List("a", null)
    --- End diff --
    
    I used 4-space indent since other test in this suite had that. I will re-indent to 2-space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12007684
  
    --- Diff: core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala ---
    @@ -29,5 +29,10 @@ class PythonRDDSuite extends FunSuite {
             PythonRDD.writeIteratorToStream(input.iterator, buffer)
         }
     
    +    test("Handle nulls gracefully") {
    +        val input: List[String] = List("a",null)
    --- End diff --
    
    add a space after the comma


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41445555
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12007480
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -301,15 +301,25 @@ private[spark] object PythonRDD {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          Option(other) match {
    --- End diff --
    
    It's more obvious to just do 
    ```scala
    if (other == null) {
    } else {
    }
    ```
    then a pattern matching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-54565701
  
    I've closed this since it was fixed separately. Thanks for sending a patch here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by kalpit <gi...@git.apache.org>.

Github user kalpit commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41487520
  
    I suspect that the NPEs will happen for any PySpark User who has an RDD that returns null for some input "x" based on the lambda/transform. Check out the test case I added to "PythonRDDSuite.scala" to reproduce the NPE.
    
    I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by kanzhang <gi...@git.apache.org>.

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-42604738
  
    > I considered the idea of using negative length (-4) to pass "None" to python (PythonRDD.SpecialLengths -1 to -3 are taken). The tricky part however is that the read() method returns an array of bytes based on the length. Existing code treats empty array as end of data/stream. So I am not sure how we would communicate "None" to python. Thoughts ?
    
    @kalpit  pls take a look at #644, where I propose to use ```null``` to signal end of stream instead of an empty array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41445556
  
    I'm curious, when did you get nulls in practice? Wouldn't it be better to pass a null to Python and have it display as None?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41445487
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41420838
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41609335
  
    Yeah, but in that case I think we have to figure out a way with the lengths. I haven't had time to look into it, but basically the UTF decoder in Python needs to deal with negative lengths sent from Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41448674
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41421044
  
    Thanks, @kalpit. This looks pretty good. I left a couple comments on style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12010283
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala ---
    @@ -301,15 +301,23 @@ private[spark] object PythonRDD {
                   throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + first.getClass)
    +          if (other == null) {
    +            logDebug("Encountered NULL element from iterator. We skip writing NULL to stream.")
    +          } else {
    +            throw new SparkException("Unexpected element type " + first.getClass)
    +          }
           }
         }
       }
     
       def writeUTF(str: String, dataOut: DataOutputStream) {
    -    val bytes = str.getBytes(UTF8)
    -    dataOut.writeInt(bytes.length)
    -    dataOut.write(bytes)
    +    if (str == null) {
    +      logDebug("Encountered NULL string. We skip writing NULL to stream.")
    +    } else {
    +        val bytes = str.getBytes(UTF8)
    --- End diff --
    
    the indent is off here (2-space indent)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/554#discussion_r12010287
  
    --- Diff: core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala ---
    @@ -29,5 +29,10 @@ class PythonRDDSuite extends FunSuite {
             PythonRDD.writeIteratorToStream(input.iterator, buffer)
         }
     
    +    test("Handle nulls gracefully") {
    +        val input: List[String] = List("a", null)
    --- End diff --
    
    the indent is off here (2-space indent)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by kalpit <gi...@git.apache.org>.

Github user kalpit commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41595891
  
    I see your point. I don't have a Python-only use-case that can trigger the NPE.
    
    My custom RDD implementation had a corner-case in which RDD's compute() method returned a "null" in the iterator stream. I have fixed my custom RDD implementation to not do that, so I don't run into this NPE anymore. However, should anyone else out there ever implement a custom RDD of similar nature (has nulls for some elements in a partition's iterator stream) and tries accessing such an RDD from PySpark, he/she would run into the NPE, so I thought it would be nicer if we handled nulls in the stream gracefully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41448676
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14500/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41588930
  
    Lambdas in Python that return None will work fine because we use pickling for all data after that. The only way this problem can happen is if a Java RDD has null in it. Do you have an example in Python only (with the current PySpark) where this happens?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41484107
  
    But that means that the NPEs are only happening with your custom RDD, right? They won't happen for regular Spark users.
    
    I think we should pass None here. One way to do it is to select a negative length (e.g. -3) to represent null, and pass that to Python. We already use other negative lengths for other special flags.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by kalpit <gi...@git.apache.org>.

Github user kalpit commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41474348
  
    @mateiz I ran into this when my custom RDD<String> produced nulls for some elements within a partition/split (during compute()).
    
    It would indeed be better to pass a null to Python and have it display it as None. One solution is to a pick a TOKEN that we write into the tmp file and then translate it to a "None" during read. This, however, is not failsafe because there is a remote possibility of string data being identical to the TOKEN. Perhaps we could address that by fencing regular data by a special character and treating data lacking that fence as tokens. 
    
    In any case, the above solution (or an alternative) would be a relatively larger change, and I preferred fixing at least the NPEs in PythonRDD for short term (stack trace is in JIRA ticket).
    
    What do you think ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/554#issuecomment-41445548
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1630: Make PythonRDD handle NULL element...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/554


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org