You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/11/09 13:51:37 UTC

[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19707

    [SPARK-22472][SQL] add null check for top-level primitive values

    ## What changes were proposed in this pull request?
    
    One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.
    
    For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values.
    
    However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results.
    ```
    scala> val ds = spark.read.parquet(...).as[Int]
    
    scala> ds.show()
    +----+
    |v   |
    +----+
    |null|
    |1   |
    +----+
    
    scala> ds.collect
    res0: Array[Long] = Array(0, 1)
    
    scala> ds.map(_ * 2).show
    +-----+
    |value|
    +-----+
    |-2   |
    |2    |
    +-----+
    ```
    
    This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.
    
    This PR adds null check for top-level primitive values
    
    ## How was this patch tested?
    
    new test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19707
    
----
commit dad50806b27a40ed1112d8ee29b3bd5c60164170
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-11-09T13:39:10Z

    add null check for top-level primitive values

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19707#discussion_r150015018
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           checkDataset(ds, SpecialCharClass("1", "2"))
         }
       }
    +
    +  test("SPARK-22472: add null check for top-level primitive values") {
    +    // If the primitive values are from Option, we need to do runtime null check.
    +    val ds = Seq(Some(1), None).toDS().as[Int]
    +    intercept[NullPointerException](ds.collect())
    +    val e = intercept[SparkException](ds.map(_ * 2).collect())
    +    assert(e.getCause.isInstanceOf[NullPointerException])
    +
    +    withTempPath { path =>
    +      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
    --- End diff --
    
    nit: `toDF()` also works.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    LGTM except one minor comment


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19707#discussion_r150026840
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala ---
    @@ -134,7 +134,13 @@ object ScalaReflection extends ScalaReflection {
         val tpe = localTypeOf[T]
         val clsName = getClassNameFromType(tpe)
         val walkedTypePath = s"""- root class: "$clsName"""" :: Nil
    -    deserializerFor(tpe, None, walkedTypePath)
    +    val expr = deserializerFor(tpe, None, walkedTypePath)
    +    val Schema(_, nullable) = schemaFor(tpe)
    +    if (nullable) {
    +      expr
    +    } else {
    +      AssertNotNull(expr, walkedTypePath)
    +    }
    --- End diff --
    
    Hi, @cloud-fan .
    It looks great. Can we add a test case in `ScalaReflectionSuite`, too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19707#discussion_r150069049
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           checkDataset(ds, SpecialCharClass("1", "2"))
         }
       }
    +
    +  test("SPARK-22472: add null check for top-level primitive values") {
    +    // If the primitive values are from Option, we need to do runtime null check.
    +    val ds = Seq(Some(1), None).toDS().as[Int]
    +    intercept[NullPointerException](ds.collect())
    +    val e = intercept[SparkException](ds.map(_ * 2).collect())
    +    assert(e.getCause.isInstanceOf[NullPointerException])
    +
    +    withTempPath { path =>
    +      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
    --- End diff --
    
    Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    **[Test build #83644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83644/testReport)** for PR 19707 at commit [`dad5080`](https://github.com/apache/spark/commit/dad50806b27a40ed1112d8ee29b3bd5c60164170).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83644/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83654/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    **[Test build #83654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83654/testReport)** for PR 19707 at commit [`a59c399`](https://github.com/apache/spark/commit/a59c3997426f92f7721ca3d633cf3356cc3d4130).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    Thanks! Merged to master/2.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19707#discussion_r150064768
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           checkDataset(ds, SpecialCharClass("1", "2"))
         }
       }
    +
    +  test("SPARK-22472: add null check for top-level primitive values") {
    +    // If the primitive values are from Option, we need to do runtime null check.
    +    val ds = Seq(Some(1), None).toDS().as[Int]
    +    intercept[NullPointerException](ds.collect())
    +    val e = intercept[SparkException](ds.map(_ * 2).collect())
    +    assert(e.getCause.isInstanceOf[NullPointerException])
    +
    +    withTempPath { path =>
    +      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
    --- End diff --
    
    It doesn't matter, I just need a dataframe, I can even use `Seq(Some(1), None).toDF`, but using parquet is more convincing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    **[Test build #83654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83654/testReport)** for PR 19707 at commit [`a59c399`](https://github.com/apache/spark/commit/a59c3997426f92f7721ca3d633cf3356cc3d4130).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    **[Test build #83644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83644/testReport)** for PR 19707 at commit [`dad5080`](https://github.com/apache/spark/commit/dad50806b27a40ed1112d8ee29b3bd5c60164170).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19707#discussion_r150027756
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           checkDataset(ds, SpecialCharClass("1", "2"))
         }
       }
    +
    +  test("SPARK-22472: add null check for top-level primitive values") {
    +    // If the primitive values are from Option, we need to do runtime null check.
    +    val ds = Seq(Some(1), None).toDS().as[Int]
    +    intercept[NullPointerException](ds.collect())
    +    val e = intercept[SparkException](ds.map(_ * 2).collect())
    +    assert(e.getCause.isInstanceOf[NullPointerException])
    +
    +    withTempPath { path =>
    +      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
    --- End diff --
    
    Is this PR orthogonal to data source format?
    Could you test more data source like `JSON`, here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19707


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19707#discussion_r150024246
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           checkDataset(ds, SpecialCharClass("1", "2"))
         }
       }
    +
    +  test("SPARK-22472: add null check for top-level primitive values") {
    +    // If the primitive values are from Option, we need to do runtime null check.
    +    val ds = Seq(Some(1), None).toDS().as[Int]
    +    intercept[NullPointerException](ds.collect())
    +    val e = intercept[SparkException](ds.map(_ * 2).collect())
    +    assert(e.getCause.isInstanceOf[NullPointerException])
    +
    +    withTempPath { path =>
    +      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
    --- End diff --
    
    not a big deal, but `toDF("i")` is more explicit about column name.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19707
  
    cc @gatorsmile @kiszk @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org