You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mswit-databricks <gi...@git.apache.org> on 2018/02/28 12:50:40 UTC

[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

GitHub user mswit-databricks opened a pull request:

    https://github.com/apache/spark/pull/20694

    [SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON

    ## What changes were proposed in this pull request?
    
    The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file.
    
    To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable.
    Since this changes the behavior of a public function, we need to include it in release notes.
    The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false`
    
    ## How was this patch tested?
    
    Added two new tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mswit-databricks/apache-spark SPARK-23173

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20694.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20694
    
----
commit 1cd19196cf46e15aaf1636240d053996e623370b
Author: Michał Świtakowski <mi...@...>
Date:   2018-02-28T12:43:42Z

    Fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88121/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    **[Test build #88121 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88121/testReport)** for PR 20694 at commit [`5aec68d`](https://github.com/apache/spark/commit/5aec68d1d0d97bdeaec4fbaca27d4db8618654b5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    **[Test build #88126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88126/testReport)** for PR 20694 at commit [`5aec68d`](https://github.com/apache/spark/commit/5aec68d1d0d97bdeaec4fbaca27d4db8618654b5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTestBase `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    **[Test build #88126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88126/testReport)** for PR 20694 at commit [`5aec68d`](https://github.com/apache/spark/commit/5aec68d1d0d97bdeaec4fbaca27d4db8618654b5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    LGTM except one minor comment in the test case. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20694#discussion_r171375731
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -680,4 +681,31 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
           )
         }
       }
    +
    +  test("from_json missing fields") {
    +    val conf = SQLConf.get
    +    for (forceJsonNullableSchema <- Seq(false, true)) {
    +      conf.setConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA, forceJsonNullableSchema)
    --- End diff --
    
    We have to revert the conflicts to the original value. 
    
    Thus, move this test to `org.apache.spark.sql.execution.datasources.json.JsonSuite`. Then, we can use `SQLConf`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88126/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Thanks! Merged to master/2.3
    
    I resolved the style issues when I merged the code


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87776/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    **[Test build #87776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87776/testReport)** for PR 20694 at commit [`1cd1919`](https://github.com/apache/spark/commit/1cd19196cf46e15aaf1636240d053996e623370b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    **[Test build #87776 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87776/testReport)** for PR 20694 at commit [`1cd1919`](https://github.com/apache/spark/commit/1cd19196cf46e15aaf1636240d053996e623370b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20694#discussion_r173584456
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala ---
    @@ -780,6 +781,25 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext {
           assert(option.compressionCodecClassName == "UNCOMPRESSED")
         }
       }
    +
    +  test("SPARK-23173 Writing a file with data converted from JSON with and incorrect user schema") {
    +    withTempPath { file =>
    +      val jsonData =
    +        """{
    +        |  "a": 1,
    +        |  "c": "foo"
    +        |}
    +        |"""
    --- End diff --
    
    The same here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20694#discussion_r173584348
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -680,4 +682,31 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
           )
         }
       }
    +
    +  test("from_json missing fields") {
    +    for (forceJsonNullableSchema <- Seq(false, true)) {
    +      withSQLConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA.key -> forceJsonNullableSchema.toString) {
    +        val input =
    +          """{
    +            |  "a": 1,
    +            |  "c": "foo"
    +            |}
    +            |"""
    +            .stripMargin
    --- End diff --
    
    Nit: the style. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    **[Test build #88121 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88121/testReport)** for PR 20694 at commit [`5aec68d`](https://github.com/apache/spark/commit/5aec68d1d0d97bdeaec4fbaca27d4db8618654b5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTestBase `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20694#discussion_r173584330
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -680,4 +682,31 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
           )
         }
       }
    +
    +  test("from_json missing fields") {
    +    for (forceJsonNullableSchema <- Seq(false, true)) {
    +      withSQLConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA.key -> forceJsonNullableSchema.toString) {
    +        val input =
    +          """{
    +            |  "a": 1,
    +            |  "c": "foo"
    +            |}
    +            |"""
    +            .stripMargin
    +        val jsonSchema = new StructType()
    +          .add("a", LongType, nullable = false)
    +          .add("b", StringType, nullable = false)
    +          .add("c", StringType, nullable = false)
    +        val output = InternalRow(1L, null, UTF8String.fromString("foo"))
    +        checkEvaluation(
    +          JsonToStructs(jsonSchema, Map.empty, Literal.create(input, StringType), gmtId),
    +          output
    +        )
    +        val schema = JsonToStructs(jsonSchema, Map.empty, Literal.create(input, StringType), gmtId)
    +          .dataType
    +        val schemaToCompare = if (forceJsonNullableSchema) jsonSchema.asNullable else jsonSchema
    +        assert(schemaToCompare == schema);
    --- End diff --
    
    Nit: `;` is useless. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet files ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20694
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

Posted by mswit-databricks <gi...@git.apache.org>.
Github user mswit-databricks commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20694#discussion_r172142857
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -680,4 +681,31 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
           )
         }
       }
    +
    +  test("from_json missing fields") {
    +    val conf = SQLConf.get
    +    for (forceJsonNullableSchema <- Seq(false, true)) {
    +      conf.setConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA, forceJsonNullableSchema)
    --- End diff --
    
    I tried moving the test, but it is not as easy as it may seem. `org.apache.spark.sql.execution.datasources.json.JsonSuite` lacks `checkEvaluation` that I'm using here. In general, `org.apache.spark.sql.execution.datasources.json.JsonSuite` seems to be meant for things such as `spark.read.json()` and not the `from_json()` function, so the test would be misplaced there. I think that it's better to keep the test in `JsonExpressionsSuite` and revert the config.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20694: [SPARK-23173][SQL] Avoid creating corrupt parquet...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20694


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org