You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/08/26 16:10:44 UTC

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/22237

    [SPARK-25243][SQL] Use FailureSafeParser in from_json

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to switch `from_json` on `FailureSafeParser`, and to make the function compatible to `PERMISSIVE` mode by default, and to support the `FAILFAST` and `DROPMALFORMED` modes as a consequence.
    
    ## How was this patch tested?
    
    It was tested by existing `JsonSuite`/`CSVSuite`, `JsonFunctionsSuite` and `JsonExpressionsSuite` as well as new tests for `from_json` which checks different modes.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 from_json-failuresafe

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22237
    
----
commit d639cb353b8575b66700544898148e6d65595fae
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T13:24:51Z

    Moving FailureSafeParser to catalyst

commit e20615dbfafd03e44435257c57d5930e5aaef027
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T13:47:55Z

    Support any type by FailureSafeParser

commit 5574c61fe7a369c3b549342c84b601dd98794014
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T14:56:18Z

    Use FailSafeParser in from_json

commit 3fd9ba00a213851185b28bfa88634c43f9ed00bb
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T15:02:37Z

    Fix tests

commit 0d30d0c51fb1c09cdcede1111e57e6979e00d6c8
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T15:05:16Z

    Removing unused bind variable

commit 5bd5eb39b7b6964be025aee744e948027b15a062
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T15:28:23Z

    Adding tests for different modes

commit 0452a2ff96a6690d4e99cc0463974cc8b98c56b6
Author: Maxim Gekk <ma...@...>
Date:   2018-08-26T15:43:25Z

    Improve a test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95760/testReport)** for PR 22237 at commit [`a433388`](https://github.com/apache/spark/commit/a433388ba21c0d9657a235040adc9261d8319125).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223912044
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -450,21 +456,24 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
       test("from_json - input=array of single object, schema=struct, output=single row") {
         val input = """[{"a": 1}]"""
         val schema = StructType(StructField("a", IntegerType) :: Nil)
    -    val output = InternalRow(1)
    +    val output = InternalRow(null)
         checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), output)
       }
     
    -  test("from_json - input=array, schema=struct, output=null") {
    +  test("from_json - input=array, schema=struct, output=single row") {
         val input = """[{"a": 1}, {"a": 2}]"""
    -    val schema = StructType(StructField("a", IntegerType) :: Nil)
    -    val output = null
    -    checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), output)
    +    val corrupted = "corrupted"
    +    val schema = new StructType().add("a", IntegerType).add(corrupted, StringType)
    +    StructType(StructField("a", IntegerType) :: Nil)
    +    val output = InternalRow(null, UTF8String.fromString(input))
    +    val options = Map("columnNameOfCorruptRecord" -> corrupted)
    +    checkEvaluation(JsonToStructs(schema, options, Literal(input), gmtId), output)
       }
     
       test("from_json - input=empty array, schema=struct, output=null") {
    --- End diff --
    
    `output=single row`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r222554954

--- Diff: docs/sql-programming-guide.md ---
@@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

+## Upgrading From Spark SQL 2.4 to 3.0
+
+ - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
+
--- End diff --

> Do we have a clear definition of the current behavior? It's important to let user know how the behavior changes.

I added new mode in which behavior is the same as current one.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96200 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96200/testReport)** for PR 22237 at commit [`5884173`](https://github.com/apache/spark/commit/58841735e14fe334c779b83eb915bff3a4f900e6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r216444450
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,26 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exception1 = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    +    }.getMessage
    +    assert(exception1.contains(
    +      "Malformed records are detected in record parsing. Parse Mode: FAILFAST."))
    +
    +    val exception2 = intercept[AnalysisException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))).collect()
    --- End diff --
    
    I replaced it by `AnalysisException` but I think it is wrong decision. Throwing of `AnalysisException` at run-time looks ugly:
    ```
    Caused by: org.apache.spark.sql.AnalysisException: from_json() doesn't support the DROPMALFORMED mode. Acceptable modes are PERMISSIVE and FAILFAST.;
    	at org.apache.spark.sql.catalyst.expressions.JsonToStructs.parser$lzycompute(jsonExpressions.scala:568)
    	at org.apache.spark.sql.catalyst.expressions.JsonToStructs.parser(jsonExpressions.scala:564)
    ...
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    ```
    I am going to replace it by something else or revert back to `IllegalArgumentException`. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r213885328
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,28 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    require(mode == PermissiveMode || mode == FailFastMode,
    --- End diff --
    
    I think we should move this verification into the constructor.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    The problem here is the difference between JSON functionality and JSON datasource parsing. The JSON function was first added in a way of mimicking PERMISSIVE mode but not exactly same.
    
    This PR proposes to match both. I haven't tested it by myself but isn't JSON datasource's same as the above?
    
    I actually was thinking about targeting 3.0 but looks now we are going ahead for 2.5. This bit is what I am less sure of for now actually.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95464/testReport)** for PR 22237 at commit [`fee61dc`](https://github.com/apache/spark/commit/fee61dc39d94278bcedab1fc36d538b54c91267a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95896/testReport)** for PR 22237 at commit [`26287a5`](https://github.com/apache/spark/commit/26287a5e428a7ed8319f8a08442e558c1c5ed513).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    > @MaxGekk is it difficult to add a conf?
    
    @HyukjinKwon I will try.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r224042803
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    For now the case is more concrete, we return null if `Jackson` parser doesn't find any token in the input. Not sure, this detailed info about underlying problem can help users much more.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r223254210

--- Diff: docs/sql-programming-guide.md ---
@@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

> Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.

Is this behavior same with reading json files?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223708500
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    so we may still return null in some cases, can you list them?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97958/testReport)** for PR 22237 at commit [`b2988c7`](https://github.com/apache/spark/commit/b2988c76456627e245bd8e157c76197fe4cc0ade).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97244/testReport)** for PR 22237 at commit [`d91f34f`](https://github.com/apache/spark/commit/d91f34fa3f227cef7d2c0d6e753854466b1da4ff).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223700226
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    --- End diff --
    
    shall we update the guide too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Oh wait you left a sign-off. Let me rebase it within tomorrow - wouldn't be a big job.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    jenkins, retest this, please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223263164
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    I will check that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96684/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Hi @MaxGekk ,
    I just reviewed this PR. I noticed that there is one behavior change. The column value of `from_json(corrupt_record...)` become `Row(null, nulll, ...)`, instead of `null`. 
    
    ``` 
    val df = Seq("""{"a" 1, "b": 2}""").toDS()
    val schema = new StructType().add("a", IntegerType).add("b", IntegerType)
    ```
    
    Before the code change:
    ```
    scala> df.select(from_json($"value", schema).as("col")).where("col is null").show()
    +----+
    | col|
    +----+
    |null|
    +----+
    
    scala> df.select(from_json($"value", schema).as("col")).where("col.a is null").show()
    +----+
    | col|
    +----+
    |null|
    +----+ 
    ```
    
    After the code change:
    ```
    scala> df.select(from_json($"value", schema).as("col")).where("col is null").show()
    +---+
    |col|
    +---+
    +---+
    
    
    scala> df.select(from_json($"value", schema).as("col")).where("col.a is null").show()
    +---+
    |col|
    +---+
    |[,]|
    +---+
    ```
    
    The main difference is that we can't filter the null `col` in the result column. Is there any reason for changing this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220788994
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1879,6 +1879,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 2.5
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    `Since Spark 3.0` -> `Since Spark 2.5` :-)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218584877
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records.
    --- End diff --
    
    Sure, I added a couple sentences. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95266/testReport)** for PR 22237 at commit [`0452a2f`](https://github.com/apache/spark/commit/0452a2ff96a6690d4e99cc0463974cc8b98c56b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r224069683
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,57 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
    --- End diff --
    
    It's not about the real business logic, but an idea to make the code simpler.
    ```
    val parserSchema = nullableSchema match {
      case s: StructType => s
      case other => new StructType("value", other)
    }
    new FailureSafeParser[UTF8String](
          input => rawParser.parse(input, createParser, identity[UTF8String]),
          mode,
          parserSchema,
          parsedOptions.columnNameOfCorruptRecord,
          parsedOptions.multiLine)
    ```
    
    Then we don't need to change `FailureSafeParser`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220908825
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,51 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
         rawParser: IN => Seq[InternalRow],
         mode: ParseMode,
    -    schema: StructType,
    +    dataType: DataType,
         columnNameOfCorruptRecord: String,
         isMultiLine: Boolean) {
    -
    -  private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
    -  private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
    -  private val resultRow = new GenericInternalRow(schema.length)
    -  private val nullResult = new GenericInternalRow(schema.length)
    -
       // This function takes 2 parameters: an optional partial result, and the bad record. If the given
       // schema doesn't contain a field for corrupted record, we just return the partial result or a
       // row with all fields null. If the given schema contains a field for corrupted record, we will
       // set the bad record to this field, and set other fields according to the partial result or null.
    -  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = {
    --- End diff --
    
    Just to make the review easier, backporting easier, and keep the original author of the codes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220401206
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    if (mode != PermissiveMode && mode != FailFastMode) {
    --- End diff --
    
    since we only support 2 modes, how about we use a boolean option? e.g. `failFast`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223836030
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    We shouldn't return `null`. I will replace the `null` by exceptions like I did for `from_csv`: https://github.com/apache/spark/pull/22379/files#diff-5321c01e95bffc4413c5f3457696213eR83


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r214302343

--- Diff: docs/sql-programming-guide.md ---
@@ -1897,6 +1897,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
+ - Since Spark 2.4, the from_json functions supports two modes - PERMISSIVE and FAILFAST. The modes can be set via the `mode` option. The default mode became PERMISSIVE. In previous versions, behavior of from_json did not conform to either PERMISSIVE nor FAILFAST, especially in processing of malformed JSON records.
--- End diff --

nit: from_json -> `` `from_json` ``.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222924928
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala ---
    @@ -51,6 +56,8 @@ object ParseMode extends Logging {
         case PermissiveMode.name => PermissiveMode
         case DropMalformedMode.name => DropMalformedMode
         case FailFastMode.name => FailFastMode
    +    case NullMalformedMode.name => NullMalformedMode
    --- End diff --
    
    After looking at it again, the `mode` option is already there, but `from_json` ignores it.
    
    Now this PR looks like a bug fix to me. I'm wondering if we do need to keep the previous behavior. But if we do, I think using a SQL config is more reasonable than adding a new mode.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96772/testReport)** for PR 22237 at commit [`1f052c4`](https://github.com/apache/spark/commit/1f052c47276193280bfc08028b65e32c81f0c022).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95436/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222811744
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -550,59 +550,93 @@ case class JsonToStructs(
           s"Input schema ${nullableSchema.catalogString} must be a struct, an array or a map.")
       }
     
    -  // This converts parsed rows to the desired output by the given schema.
    -  @transient
    -  lazy val converter = nullableSchema match {
    -    case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    -    case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    -    case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +  abstract class RowParser {
    --- End diff --
    
    Sure, I just wasn't going to mix it to other classes, and sub-classes should extend only `RowParser`.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96905/testReport)** for PR 22237 at commit [`ee52652`](https://github.com/apache/spark/commit/ee52652e1eb921b764037f16d2b773a1089e01b8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97016/testReport)** for PR 22237 at commit [`69bbfef`](https://github.com/apache/spark/commit/69bbfef29963641f882b428d4f248da20995e79d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97958/testReport)** for PR 22237 at commit [`b2988c7`](https://github.com/apache/spark/commit/b2988c76456627e245bd8e157c76197fe4cc0ade).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97198/testReport)** for PR 22237 at commit [`910f1c5`](https://github.com/apache/spark/commit/910f1c5d14ed9e1d7de5c33433f2fe03f3a4256c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96200 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96200/testReport)** for PR 22237 at commit [`5884173`](https://github.com/apache/spark/commit/58841735e14fe334c779b83eb915bff3a4f900e6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222814860
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala ---
    @@ -51,6 +56,8 @@ object ParseMode extends Logging {
         case PermissiveMode.name => PermissiveMode
         case DropMalformedMode.name => DropMalformedMode
         case FailFastMode.name => FailFastMode
    +    case NullMalformedMode.name => NullMalformedMode
    --- End diff --
    
    A separate boolean flag to JSONOptions or SQL config?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon Please, take a look at the PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97244/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96905/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223459186
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    This is the case when an user provided `StructType` schema but we observe an `array` in JSON input. So, JSON datasource returns row per each struct in the array. Currently `from_json` returns `null` in the case. With this PR, `from_json` returns one row with the first element of input array.
    
    Because of we cannot return multiple rows from a functions, so we have the following options:
    - return `null` but this will be the first case when for not `null` input we return `null` (this current approach before the PR)
    - return a row with one element from the input array (this PR proposes that)
    - throw an exception which is not nice option in the `PERMISSIVE` mode
    - throw `BadRecordException` internally, and return `Row(null, null, ..., null)` in the `PERMISSIVE` mode or an exception in `FAILFAST`.
    
    It seems the last option is more attractive than other. WDYT?
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222942666
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala ---
    @@ -51,6 +56,8 @@ object ParseMode extends Logging {
         case PermissiveMode.name => PermissiveMode
         case DropMalformedMode.name => DropMalformedMode
         case FailFastMode.name => FailFastMode
    +    case NullMalformedMode.name => NullMalformedMode
    --- End diff --
    
    ok. I will revert the last changes back.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223703462
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -90,6 +91,10 @@ class JacksonParser(
             // in such an array as a row, this case is possible.
             if (array.numElements() == 0) {
               Nil
    +        } else if (array.numElements() > 1 && !explodeArray) {
    --- End diff --
    
    shall we forbid array as struct type completely for `from_json`?
    
    BTW I think this feature is not needed now, since we can directly read top-level json arrays as array type. We may still need to keep it when reading json files, but I think it's ok to drop it in `from_json`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95464/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95532/testReport)** for PR 22237 at commit [`1cf4213`](https://github.com/apache/spark/commit/1cf421344ad0fb1a9e3e7cc88defcda72126a39d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223849217
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    Actually I am wrong we return empty iterator and as the consequence of that null in the case if there is no input tokens, there https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L398
    
    We can throw `BadRecordException` instead of Nil but this will change behavior of JSON/CSV datasources.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218000227
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records.
    --- End diff --
    
    Can we briefly explain the difference between updated behavior and previous one?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97198 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97198/testReport)** for PR 22237 at commit [`910f1c5`](https://github.com/apache/spark/commit/910f1c5d14ed9e1d7de5c33433f2fe03f3a4256c).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97244/testReport)** for PR 22237 at commit [`d91f34f`](https://github.com/apache/spark/commit/d91f34fa3f227cef7d2c0d6e753854466b1da4ff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95532 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95532/testReport)** for PR 22237 at commit [`1cf4213`](https://github.com/apache/spark/commit/1cf421344ad0fb1a9e3e7cc88defcda72126a39d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r213003614
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,23 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exceptionOne = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    +    }.getMessage
    +    assert(exceptionOne.contains(
    +      "Malformed records are detected in record parsing. Parse Mode: FAILFAST."))
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))),
    +      Row(null) :: Row(Row(2)) :: Nil)
    --- End diff --
    
    Nope, only possibility I raised was to make it generator expression. I haven't proposed a parse mode for this reason so far.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96693 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96693/testReport)** for PR 22237 at commit [`939e220`](https://github.com/apache/spark/commit/939e220475b2c33ab63df117ee6aec1ed4afdd2b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97958/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95436/testReport)** for PR 22237 at commit [`b172e8e`](https://github.com/apache/spark/commit/b172e8ec707a049b89185edca37ce30b1b69ecc7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97198/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Thanks all!!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95378 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95378/testReport)** for PR 22237 at commit [`c81e9ae`](https://github.com/apache/spark/commit/c81e9ae3fc763b0361a1d99ef3b600302a68adea).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Seems okay.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97174/testReport)** for PR 22237 at commit [`134cbae`](https://github.com/apache/spark/commit/134cbaea885cf913dc732e96a494d5f8e6b0d5f6).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96205/testReport)** for PR 22237 at commit [`5884173`](https://github.com/apache/spark/commit/58841735e14fe334c779b83eb915bff3a4f900e6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223545341
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    the last option LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95867/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96992/testReport)** for PR 22237 at commit [`9afd45c`](https://github.com/apache/spark/commit/9afd45caf0368fff0cd4a43a1fe731d99a3f093e).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r234792965
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1694,7 +1694,7 @@ test_that("column functions", {
       df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
       schema2 <- structType(structField("date", "date"))
       s <- collect(select(df, from_json(df$col, schema2)))
    -  expect_equal(s[[1]][[1]], NA)
    +  expect_equal(s[[1]][[1]]$date, NA)
    --- End diff --
    
    Do you mean this particular line or in general?
    
    This line was changed because in the `PERMISSIVE` mode we usually return a `Row` with null fields that we wasn't able to parse instead of just `null` for whole row.
    
    In general, to fully support the `PERMISSIVE` mode without any excuses when uniVocity parser cannot detect any JSON tokens on root level. We switched to `FailureSafeParser` in `from_json` and `PERMISSIVE` as the default mode, recently there #22237. Previously `from_json` didn't support any modes comparing to JSON datasource. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220789449
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,51 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
         rawParser: IN => Seq[InternalRow],
         mode: ParseMode,
    -    schema: StructType,
    +    dataType: DataType,
         columnNameOfCorruptRecord: String,
         isMultiLine: Boolean) {
    -
    -  private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
    -  private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
    -  private val resultRow = new GenericInternalRow(schema.length)
    -  private val nullResult = new GenericInternalRow(schema.length)
    -
       // This function takes 2 parameters: an optional partial result, and the bad record. If the given
       // schema doesn't contain a field for corrupted record, we just return the partial result or a
       // row with all fields null. If the given schema contains a field for corrupted record, we will
       // set the bad record to this field, and set other fields according to the partial result or null.
    -  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = {
    --- End diff --
    
    Looks we keep the original code and add one case for non struct type from a cursory look. Can we make the diff small?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95267/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212925256
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,23 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exceptionOne = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    --- End diff --
    
    Behavior of `JsonToStructs` is pretty close to `PERMISSIVE` actually. I have to make just a few small changes in tests that checks processing malformed inputs. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96205/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96772/testReport)** for PR 22237 at commit [`1f052c4`](https://github.com/apache/spark/commit/1f052c47276193280bfc08028b65e32c81f0c022).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r220401132

--- Diff: docs/sql-programming-guide.md ---
@@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

> In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records.

Do we have a clear definition of the current behavior? It's important to let user know how the behavior changes.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95267 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95267/testReport)** for PR 22237 at commit [`0452a2f`](https://github.com/apache/spark/commit/0452a2ff96a6690d4e99cc0463974cc8b98c56b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96205/testReport)** for PR 22237 at commit [`5884173`](https://github.com/apache/spark/commit/58841735e14fe334c779b83eb915bff3a4f900e6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218003046
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -450,7 +450,7 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
       test("from_json - input=array, schema=struct, output=null") {
         val input = """[{"a": 1}, {"a": 2}]"""
         val schema = StructType(StructField("a", IntegerType) :: Nil)
    -    val output = null
    +    val output = InternalRow(1)
         checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), output)
    --- End diff --
    
    This looks a behavior change too. Shall we also note this in migration guide?
    
    The test name is not correct for now too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212928747
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,22 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +  @transient lazy val rawParser = new JacksonParser(nullableSchema, parsedOptions)
    +  @transient lazy val createParser = CreateJacksonParser.utf8String _
    +  @transient lazy val parser = new FailureSafeParser[UTF8String](
    +    input => rawParser.parse(input, createParser, identity[UTF8String]),
    +    parsedOptions.parseMode,
    --- End diff --
    
    It is not handled by `JacksonParser`, and the behavior is somehow similar to `PermissiveMode` as @HyukjinKwon pointed out at https://github.com/apache/spark/pull/22237/files#r212850156, but not exactly the same. 
    
    Seems now the `PermissiveMode` on `FailureSafeParser` has different result on corrupted records. I noticed that some existing tests maybe changed due to that.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95867/testReport)** for PR 22237 at commit [`a433388`](https://github.com/apache/spark/commit/a433388ba21c0d9657a235040adc9261d8319125).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon Any chance the PR will be merged in the near future or I should close it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96629/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95541/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    If we can finish it before the code freeze, it will be 2.4; otherwise it is 3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95541 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95541/testReport)** for PR 22237 at commit [`9ad834f`](https://github.com/apache/spark/commit/9ad834f2579315d8ca1bad79db26fa61216b9281).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95896/testReport)** for PR 22237 at commit [`26287a5`](https://github.com/apache/spark/commit/26287a5e428a7ed8319f8a08442e558c1c5ed513).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95436/testReport)** for PR 22237 at commit [`b172e8e`](https://github.com/apache/spark/commit/b172e8ec707a049b89185edca37ce30b1b69ecc7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96992/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    LGTM, pending jenkins.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218574370
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -402,13 +402,13 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
         val schema = StructType(StructField("a", IntegerType) :: Nil)
         checkEvaluation(
           JsonToStructs(schema, Map.empty, Literal(jsonData), gmtId),
    -      null
    +      InternalRow(null)
         )
     
         // Other modes should still return `null`.
         checkEvaluation(
           JsonToStructs(schema, Map("mode" -> PermissiveMode.name), Literal(jsonData), gmtId),
    --- End diff --
    
    Changed to `FailFastMode.name`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96693 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96693/testReport)** for PR 22237 at commit [`939e220`](https://github.com/apache/spark/commit/939e220475b2c33ab63df117ee6aec1ed4afdd2b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r213885777
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,28 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    require(mode == PermissiveMode || mode == FailFastMode,
    --- End diff --
    
    Also, can we use `AnalysisException` instead of `require`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96684/testReport)** for PR 22237 at commit [`4e196f6`](https://github.com/apache/spark/commit/4e196f654b3b921787ca5a936f0a5635d26d75d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212849989
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,23 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exceptionOne = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    +    }.getMessage
    +    assert(exceptionOne.contains(
    +      "Malformed records are detected in record parsing. Parse Mode: FAILFAST."))
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))),
    +      Row(null) :: Row(Row(2)) :: Nil)
    --- End diff --
    
    How does it work for DROPMALFORMED mode? This doesn't actually drop the record like JSON datasource.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r213980110
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,28 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    require(mode == PermissiveMode || mode == FailFastMode,
    --- End diff --
    
    ok, thansk!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220793666
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    if (mode != PermissiveMode && mode != FailFastMode) {
    --- End diff --
    
    ah i see. If the `mode` option already exist, let's keep it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95760/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97157/testReport)** for PR 22237 at commit [`c7ea45f`](https://github.com/apache/spark/commit/c7ea45fd780381c6c81f464398d80e9b6f686309).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223703894
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -90,6 +91,10 @@ class JacksonParser(
             // in such an array as a row, this case is possible.
             if (array.numElements() == 0) {
               Nil
    +        } else if (array.numElements() > 1 && !explodeArray) {
    --- End diff --
    
    the new parameter can be named as `allowArrayAsStructs`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223911576
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    we don't have to do this PR, but it would be great to document when this expression will return null, in the class doc of `JsonToStructs`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222653808
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala ---
    @@ -51,6 +56,8 @@ object ParseMode extends Logging {
         case PermissiveMode.name => PermissiveMode
         case DropMalformedMode.name => DropMalformedMode
         case FailFastMode.name => FailFastMode
    +    case NullMalformedMode.name => NullMalformedMode
    --- End diff --
    
    I think we should only add the new mode for `from_json`, not the entire json data source?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218585584
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -450,7 +450,7 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
       test("from_json - input=array, schema=struct, output=null") {
         val input = """[{"a": 1}, {"a": 2}]"""
         val schema = StructType(StructField("a", IntegerType) :: Nil)
    -    val output = null
    +    val output = InternalRow(1)
         checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), output)
    --- End diff --
    
    I updated the migration guide and test's title


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212887790
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,22 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +  @transient lazy val rawParser = new JacksonParser(nullableSchema, parsedOptions)
    --- End diff --
    
    How about this?
    ```
    
      @transient lazy val parser = {
        val parsedOptions = new JSONOptions(options, timeZoneId.get)
        val rawParser = new JacksonParser(nullableSchema, parsedOptions)
        val createParser = CreateJacksonParser.utf8String _
        new FailureSafeParser[UTF8String](
          input => rawParser.parse(input, createParser, identity[UTF8String]),
          parsedOptions.parseMode,
          schema,
          parsedOptions.columnNameOfCorruptRecord,
          parsedOptions.multiLine)
      }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96200/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95532/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97037/testReport)** for PR 22237 at commit [`69bbfef`](https://github.com/apache/spark/commit/69bbfef29963641f882b428d4f248da20995e79d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95760/testReport)** for PR 22237 at commit [`a433388`](https://github.com/apache/spark/commit/a433388ba21c0d9657a235040adc9261d8319125).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222651484
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -550,59 +550,93 @@ case class JsonToStructs(
           s"Input schema ${nullableSchema.catalogString} must be a struct, an array or a map.")
       }
     
    -  // This converts parsed rows to the desired output by the given schema.
    -  @transient
    -  lazy val converter = nullableSchema match {
    -    case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    -    case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    -    case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +  abstract class RowParser {
    --- End diff --
    
    can it be `trait`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95378/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon are you working on it? @gengliangwang do you want to take over?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223696096
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -547,4 +548,27 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
               Map("pretty" -> "true"))),
           Seq(Row(expected)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    --- End diff --
    
    @MaxGekk, can we add a test with manual malformed column in the schema? For instance, 
    
    https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L1125-L1133


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95269/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon I re-targeted the changes for Spark 3.0. Please, take a look at it one more time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222812531
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala ---
    @@ -51,6 +56,8 @@ object ParseMode extends Logging {
         case PermissiveMode.name => PermissiveMode
         case DropMalformedMode.name => DropMalformedMode
         case FailFastMode.name => FailFastMode
    +    case NullMalformedMode.name => NullMalformedMode
    --- End diff --
    
    I would add it to `from_json` but the function shares the same common classes as JSON data source, and the mode is used in the common classes. Any ideas how to restrict the mode value to `from_json` only?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212922787
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,22 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +  @transient lazy val rawParser = new JacksonParser(nullableSchema, parsedOptions)
    +  @transient lazy val createParser = CreateJacksonParser.utf8String _
    +  @transient lazy val parser = new FailureSafeParser[UTF8String](
    +    input => rawParser.parse(input, createParser, identity[UTF8String]),
    +    parsedOptions.parseMode,
    --- End diff --
    
    Previous settings of `FailFastMode` didn't impact on the behavior because the `mode` option wasn't handled at all. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r224042251
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    > we don't have to do it in this PR, but it would be great to document when this expression will return null ...
    
    We already state in the docs for `from_json()`:  `Returns null, in the case of an unparseable string.`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97037/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    If @gengliangwang find some time to work on this, yea please go ahead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96993/testReport)** for PR 22237 at commit [`d3a36d5`](https://github.com/apache/spark/commit/d3a36d5625355d9df0d14cd98aeaea25751b24fe).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95541 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95541/testReport)** for PR 22237 at commit [`9ad834f`](https://github.com/apache/spark/commit/9ad834f2579315d8ca1bad79db26fa61216b9281).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97157/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    I think one thing we could do this for now is, only to support both FAILFAST and PERMISSIVE mode (and throws an exception otherwise, to match the current behaviour to PERMISSIVE mode, explain that in the release note.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212924389
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,23 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exceptionOne = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    +    }.getMessage
    +    assert(exceptionOne.contains(
    +      "Malformed records are detected in record parsing. Parse Mode: FAILFAST."))
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))),
    +      Row(null) :: Row(Row(2)) :: Nil)
    --- End diff --
    
    The `DROPMALFORMED` mode returns `null` for malformed JSON lines. User can filter them out later. @HyukjinKwon Do you know how to drop rows in `UnaryExpression`s?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @MaxGekk is it difficult to add a conf?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97174/testReport)** for PR 22237 at commit [`134cbae`](https://github.com/apache/spark/commit/134cbaea885cf913dc732e96a494d5f8e6b0d5f6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
     https://github.com/apache/spark/pull/22237/files#r223707899 makes sense to me. Addressed. LGTM from my side as well


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97966/testReport)** for PR 22237 at commit [`b2988c7`](https://github.com/apache/spark/commit/b2988c76456627e245bd8e157c76197fe4cc0ade).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220400637
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    --- End diff --
    
    2.5


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220668461
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    if (mode != PermissiveMode && mode != FailFastMode) {
    --- End diff --
    
    The `JSONOptions` is shared among build-in json functions like `from_json` and JSON datasource. And the formal one use 3 modes - `FAILFAST`, `DROPMALFORMED` and `PERMISSIVE`. I am not sure how the `mode` mode can be replaced. The approach that I could image is to inherit from `JSONOptions` and add new `val`. The `mode` itself cannot be removed because it is used in `FailureSafeParser` for example, in particular `DropMalformedMode` is handled explicitly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    > to match the current behaviour to PERMISSIVE mode, explain that in the migration guide.
    
    @HyukjinKwon Should I target to Spark 3.0 or 2.4?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95867/testReport)** for PR 22237 at commit [`a433388`](https://github.com/apache/spark/commit/a433388ba21c0d9657a235040adc9261d8319125).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r220400809

--- Diff: docs/sql-programming-guide.md ---
@@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

> In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.

Why this change?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96629/testReport)** for PR 22237 at commit [`4e196f6`](https://github.com/apache/spark/commit/4e196f654b3b921787ca5a936f0a5635d26d75d4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223707899
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,57 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
    --- End diff --
    
    instead of changing this file, can we update `JsonToStructs.parser`, to wrap array or map with a struct before creating `FailureSafeParser`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    jenkins, retest this, please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95269 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95269/testReport)** for PR 22237 at commit [`47dbe6b`](https://github.com/apache/spark/commit/47dbe6bc903aa26672d47a4ac2c84d6f84458a98).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95266 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95266/testReport)** for PR 22237 at commit [`0452a2f`](https://github.com/apache/spark/commit/0452a2ff96a6690d4e99cc0463974cc8b98c56b6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223144685
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1879,6 +1879,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 2.5
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    Reverted back to 3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96873/testReport)** for PR 22237 at commit [`7e7af88`](https://github.com/apache/spark/commit/7e7af88cba38d8f5c1dbb9c1fc8204cada9d0f69).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class YarnShuffleServiceMetrics implements MetricsSource `
      * `case class SchemaOfJson(`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r224044435
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,57 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
    --- End diff --
    
    Frankly speaking I don't fully understand the idea. Let's look at an example. We should parser JSON arrays (one array per row) like:
    ```
    [1, 2, 3]
    [4, 5]
    ```
    and an user provided the schema `ArrayType(IntegerType, true)`. So, you propose to wrap the array by `StructType(Seq(StructField(ArrayType(IntegerType, ...))))`, right? And use the code inside of `JacksonParser` which we disabled by `allowArrayAsStructs` for now? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96905/testReport)** for PR 22237 at commit [`ee52652`](https://github.com/apache/spark/commit/ee52652e1eb921b764037f16d2b773a1089e01b8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r220486164

--- Diff: docs/sql-programming-guide.md ---
@@ -1877,6 +1877,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

We have only particular cases when existing tests are failing in the `PERMISSIVE` mode. I tried to describe the cases here in the migration guide.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97016/testReport)** for PR 22237 at commit [`69bbfef`](https://github.com/apache/spark/commit/69bbfef29963641f882b428d4f248da20995e79d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon Thank you for following up work on the PR.  @cloud-fan @viirya @maropu Thanks for your reviews.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r234789275
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1694,7 +1694,7 @@ test_that("column functions", {
       df <- as.DataFrame(list(list("col" = "{\"date\":\"21/10/2014\"}")))
       schema2 <- structType(structField("date", "date"))
       s <- collect(select(df, from_json(df$col, schema2)))
    -  expect_equal(s[[1]][[1]], NA)
    +  expect_equal(s[[1]][[1]]$date, NA)
    --- End diff --
    
    What is the reason we made this change?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96993/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223911795
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
    --- End diff --
    
    ... and with some tests to verify it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218577348
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -595,10 +607,7 @@ case class JsonToStructs(
         if (json.toString.trim.isEmpty) return null
     
         try {
    -      converter(parser.parse(
    -        json.asInstanceOf[UTF8String],
    -        CreateJacksonParser.utf8String,
    -        identity[UTF8String]))
    +      converter(parser.parse(json.asInstanceOf[UTF8String]))
         } catch {
           case _: BadRecordException => null
    --- End diff --
    
    `BadRecordException` shouldn't propagate here any more. I will remove the exception handler and re-run tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222971397
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,51 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
         rawParser: IN => Seq[InternalRow],
         mode: ParseMode,
    -    schema: StructType,
    +    dataType: DataType,
         columnNameOfCorruptRecord: String,
         isMultiLine: Boolean) {
    -
    -  private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
    -  private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
    -  private val resultRow = new GenericInternalRow(schema.length)
    -  private val nullResult = new GenericInternalRow(schema.length)
    -
       // This function takes 2 parameters: an optional partial result, and the bad record. If the given
       // schema doesn't contain a field for corrupted record, we just return the partial result or a
       // row with all fields null. If the given schema contains a field for corrupted record, we will
       // set the bad record to this field, and set other fields according to the partial result or null.
    -  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = {
    -    if (corruptFieldIndex.isDefined) {
    -      (row, badRecord) => {
    -        var i = 0
    -        while (i < actualSchema.length) {
    -          val from = actualSchema(i)
    -          resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, from.dataType)).orNull
    -          i += 1
    +  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = dataType match {
    +    case struct: StructType =>
    +      val corruptFieldIndex = struct.getFieldIndex(columnNameOfCorruptRecord)
    --- End diff --
    
    Do you mind if I ask to make a private function for this one (L38 to L54) and define this around here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220533794
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    if (mode != PermissiveMode && mode != FailFastMode) {
    --- End diff --
    
    Do you mean by introducing new val `failFast` to JSONOptions? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r213886899
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,51 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
         rawParser: IN => Seq[InternalRow],
         mode: ParseMode,
    -    schema: StructType,
    +    schema: DataType,
    --- End diff --
    
    `schema` -> `dataType`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220905353
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,51 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
         rawParser: IN => Seq[InternalRow],
         mode: ParseMode,
    -    schema: StructType,
    +    dataType: DataType,
         columnNameOfCorruptRecord: String,
         isMultiLine: Boolean) {
    -
    -  private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
    -  private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
    -  private val resultRow = new GenericInternalRow(schema.length)
    -  private val nullResult = new GenericInternalRow(schema.length)
    -
       // This function takes 2 parameters: an optional partial result, and the bad record. If the given
       // schema doesn't contain a field for corrupted record, we just return the partial result or a
       // row with all fields null. If the given schema contains a field for corrupted record, we will
       // set the bad record to this field, and set other fields according to the partial result or null.
    -  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = {
    --- End diff --
    
    > Can we make the diff small?
    
    just wondering what is the reason to make the diff small?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212847031
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,22 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +  @transient lazy val rawParser = new JacksonParser(nullableSchema, parsedOptions)
    +  @transient lazy val createParser = CreateJacksonParser.utf8String _
    +  @transient lazy val parser = new FailureSafeParser[UTF8String](
    +    input => rawParser.parse(input, createParser, identity[UTF8String]),
    +    parsedOptions.parseMode,
    --- End diff --
    
    I think we should keep using previous default mode `FailFastMode`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223833101
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ---
    @@ -90,6 +91,10 @@ class JacksonParser(
             // in such an array as a row, this case is possible.
             if (array.numElements() == 0) {
               Nil
    +        } else if (array.numElements() > 1 && !explodeArray) {
    --- End diff --
    
    > ... but I think it's ok to drop it in from_json
    > ... the new parameter can be named as allowArrayAsStructs
    
    Is it ok if I'll do:
    
    ```
    case START_ARRAY if allowArrayAsStructs =>
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r222926650
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala ---
    @@ -51,6 +56,8 @@ object ParseMode extends Logging {
         case PermissiveMode.name => PermissiveMode
         case DropMalformedMode.name => DropMalformedMode
         case FailFastMode.name => FailFastMode
    +    case NullMalformedMode.name => NullMalformedMode
    --- End diff --
    
    I suggested to keep the previous behaviour only because I thought we're in transition to 2.5. Since we are going ahead for 3.0, I think I am okay without keeping it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97174/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22237


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r218000572
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -595,10 +607,7 @@ case class JsonToStructs(
         if (json.toString.trim.isEmpty) return null
     
         try {
    -      converter(parser.parse(
    -        json.asInstanceOf[UTF8String],
    -        CreateJacksonParser.utf8String,
    -        identity[UTF8String]))
    +      converter(parser.parse(json.asInstanceOf[UTF8String]))
         } catch {
           case _: BadRecordException => null
    --- End diff --
    
    Do we still need this catching of `BadRecordException`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95378 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95378/testReport)** for PR 22237 at commit [`c81e9ae`](https://github.com/apache/spark/commit/c81e9ae3fc763b0361a1d99ef3b600302a68adea).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    LGTM except https://github.com/apache/spark/pull/22237/files#r223707899 , do you think it's reasonable?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97157/testReport)** for PR 22237 at commit [`c7ea45f`](https://github.com/apache/spark/commit/c7ea45fd780381c6c81f464398d80e9b6f686309).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96684/testReport)** for PR 22237 at commit [`4e196f6`](https://github.com/apache/spark/commit/4e196f654b3b921787ca5a936f0a5635d26d75d4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97016/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223852220
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    --- End diff --
    
    updated


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96693/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22237#discussion_r222651368

--- Diff: docs/sql-programming-guide.md ---
@@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

+## Upgrading From Spark SQL 2.4 to 3.0
+
+ - Since Spark 3.0, the `from_json` function supports three modes - `PERMISSIVE`, `FAILFAST` and `NULLMALFORMED`. The modes can be set via the `mode` option. `PERMISSIVE` became the default mode. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`. To restore the previous behavior, set the JSON option `mode` to `NULLMALFORMED`.
--- End diff --

how about `PERMISSIVE`, `FAILFAST` and `LEGACY`?

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95464/testReport)** for PR 22237 at commit [`fee61dc`](https://github.com/apache/spark/commit/fee61dc39d94278bcedab1fc36d538b54c91267a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon Please, take a look at it again.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96873/testReport)** for PR 22237 at commit [`7e7af88`](https://github.com/apache/spark/commit/7e7af88cba38d8f5c1dbb9c1fc8204cada9d0f69).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97966/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Ah, yea I have a direct access to this branch. Let me just rebase/address the comment tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @viirya @maropu @HyukjinKwon May I ask you to look at this one more time, please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r213975610
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,28 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    require(mode == PermissiveMode || mode == FailFastMode,
    --- End diff --
    
    I didn't put `require` to the constructor body directly because of `timeZoneId`. If I move the checking up, I need to move `val parsedOptions = new JSONOptions(options, timeZoneId.get)` too (lazy or not lazy). Checking will force getting of `timeZoneId.get` which will raise an exception. I will check this today or tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r220658815
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ---
    @@ -554,18 +554,30 @@ case class JsonToStructs(
       @transient
       lazy val converter = nullableSchema match {
         case _: StructType =>
    -      (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null
         case _: ArrayType =>
    -      (rows: Seq[InternalRow]) => rows.head.getArray(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null
         case _: MapType =>
    -      (rows: Seq[InternalRow]) => rows.head.getMap(0)
    +      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null
       }
     
    -  @transient
    -  lazy val parser =
    -    new JacksonParser(
    -      nullableSchema,
    -      new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
    +  @transient lazy val parser = {
    +    val parsedOptions = new JSONOptions(options, timeZoneId.get)
    +    val mode = parsedOptions.parseMode
    +    if (mode != PermissiveMode && mode != FailFastMode) {
    --- End diff --
    
    yea, instead the "mode" option.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @HyukjinKwon Got it. I won't take it since you will do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95896/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Will take a look soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #97037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97037/testReport)** for PR 22237 at commit [`69bbfef`](https://github.com/apache/spark/commit/69bbfef29963641f882b428d4f248da20995e79d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96873/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212850156
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,23 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exceptionOne = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    --- End diff --
    
    `JsonToStructs` resembles PERMISSIVE mode (from the first place) although their behaviours are slightly different. This is going to be different with PERMISSIVE and also FAILFAST modes. They are actually behaviour changes if we just use PERMISSIVE mode here by default (as @viirya pointed out).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r212891660
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala ---
    @@ -15,50 +15,51 @@
      * limitations under the License.
      */
     
    -package org.apache.spark.sql.execution.datasources
    +package org.apache.spark.sql.catalyst.util
     
     import org.apache.spark.SparkException
     import org.apache.spark.sql.catalyst.InternalRow
     import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
    -import org.apache.spark.sql.catalyst.util._
    -import org.apache.spark.sql.internal.SQLConf
    -import org.apache.spark.sql.types.StructType
    +import org.apache.spark.sql.types.{DataType, StructType}
     import org.apache.spark.unsafe.types.UTF8String
     
     class FailureSafeParser[IN](
         rawParser: IN => Seq[InternalRow],
         mode: ParseMode,
    -    schema: StructType,
    +    schema: DataType,
         columnNameOfCorruptRecord: String,
         isMultiLine: Boolean) {
    -
    -  private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
    -  private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
    -  private val resultRow = new GenericInternalRow(schema.length)
    -  private val nullResult = new GenericInternalRow(schema.length)
    -
       // This function takes 2 parameters: an optional partial result, and the bad record. If the given
       // schema doesn't contain a field for corrupted record, we just return the partial result or a
       // row with all fields null. If the given schema contains a field for corrupted record, we will
       // set the bad record to this field, and set other fields according to the partial result or null.
    -  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = {
    -    if (corruptFieldIndex.isDefined) {
    -      (row, badRecord) => {
    -        var i = 0
    -        while (i < actualSchema.length) {
    -          val from = actualSchema(i)
    -          resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, from.dataType)).orNull
    -          i += 1
    +  private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = schema match {
    +    case struct: StructType =>
    +      val corruptFieldIndex = struct.getFieldIndex(columnNameOfCorruptRecord)
    +      val actualSchema = StructType(struct.filterNot(_.name == columnNameOfCorruptRecord))
    +      val resultRow = new GenericInternalRow(struct.length)
    +      val nullResult = new GenericInternalRow(struct.length)
    +      if (corruptFieldIndex.isDefined) {
    --- End diff --
    
    Can we move `actualSchema` and `resultRow` into `if (corruptFieldIndex.isDefined) {` inside?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by dilipbiswal <gi...@git.apache.org>.

Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95267 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95267/testReport)** for PR 22237 at commit [`0452a2f`](https://github.com/apache/spark/commit/0452a2ff96a6690d4e99cc0463974cc8b98c56b6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r214218549
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
    @@ -469,4 +470,26 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {
     
         checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
       }
    +
    +  test("from_json invalid json - check modes") {
    +    val df = Seq("""{"a" 1}""", """{"a": 2}""").toDS()
    +    val schema = new StructType().add("a", IntegerType)
    +
    +    checkAnswer(
    +      df.select(from_json($"value", schema, Map("mode" -> "PERMISSIVE"))),
    +      Row(Row(null)) :: Row(Row(2)) :: Nil)
    +
    +    val exception1 = intercept[SparkException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "FAILFAST"))).collect()
    +    }.getMessage
    +    assert(exception1.contains(
    +      "Malformed records are detected in record parsing. Parse Mode: FAILFAST."))
    +
    +    val exception2 = intercept[AnalysisException] {
    +      df.select(from_json($"value", schema, Map("mode" -> "DROPMALFORMED"))).collect()
    --- End diff --
    
    Can you fix the code to throw an analysis exception in analysis phases instead of execution phases (`.collect()` called)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96629/testReport)** for PR 22237 at commit [`4e196f6`](https://github.com/apache/spark/commit/4e196f654b3b921787ca5a936f0a5635d26d75d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    > I actually was thinking about targeting 3.0 but looks now we are going ahead for 2.5. This bit is what I am less sure of for now actually.
    
    The PR definitely changes existing behavior of `from_json`. If there is no chance of merging it soon, let's close it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #95269 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95269/testReport)** for PR 22237 at commit [`47dbe6b`](https://github.com/apache/spark/commit/47dbe6bc903aa26672d47a4ac2c84d6f84458a98).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r217995920
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -402,13 +402,13 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
         val schema = StructType(StructField("a", IntegerType) :: Nil)
         checkEvaluation(
           JsonToStructs(schema, Map.empty, Literal(jsonData), gmtId),
    -      null
    +      InternalRow(null)
         )
     
         // Other modes should still return `null`.
         checkEvaluation(
           JsonToStructs(schema, Map("mode" -> PermissiveMode.name), Literal(jsonData), gmtId),
    --- End diff --
    
    The default mode is `PermissiveMode` now, so this looks necessary to change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    I agree with the current approach but wanna make sure if we want this in 2.4.0 or 3.0.0 since there's no way to keep the previous behaviour. I actually prefer to go ahead in 3.0.0.
    
    @gatorsmile and @cloud-fan, WDYT? I think this will likely break existing user apps.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    **[Test build #96992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96992/testReport)** for PR 22237 at commit [`9afd45c`](https://github.com/apache/spark/commit/9afd45caf0368fff0cd4a43a1fe731d99a3f093e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223550790
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    Last option sounds better to me but can we fill the corrupt row when the corrupt field name is specified in the schema? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96772/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95266/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    I manually ran some queries.
    My point was about the value of the column `from_json`, `null` vs `Row(null, null, ...)`.
    
    Now I am +1 with the proposal of this PR, making the column value as `Row` makes more sense. Also this is expected behavior for the permissive mode.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223675223
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1890,6 +1890,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
     
     # Migration Guide
     
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. In version 2.4 and earlier, arrays of JSON objects are considered as invalid and converted to `null` if specified schema is `StructType`. Since Spark 3.0, the input is considered as a valid JSON array and only its first element is parsed if it conforms to the specified `StructType`.
    --- End diff --
    
    Introduced `explodeArray` for `JacksonParser`, and modified a test to check the situation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22237#discussion_r223911893
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala ---
    @@ -450,21 +456,24 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
       test("from_json - input=array of single object, schema=struct, output=single row") {
         val input = """[{"a": 1}]"""
         val schema = StructType(StructField("a", IntegerType) :: Nil)
    -    val output = InternalRow(1)
    +    val output = InternalRow(null)
         checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), output)
       }
     
    -  test("from_json - input=array, schema=struct, output=null") {
    +  test("from_json - input=array, schema=struct, output=single row") {
         val input = """[{"a": 1}, {"a": 2}]"""
    -    val schema = StructType(StructField("a", IntegerType) :: Nil)
    -    val output = null
    -    checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId), output)
    +    val corrupted = "corrupted"
    +    val schema = new StructType().add("a", IntegerType).add(corrupted, StringType)
    +    StructType(StructField("a", IntegerType) :: Nil)
    --- End diff --
    
    unnecessary line?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    > Will take a look soon.
    
    @HyukjinKwon Thank you. Waiting for you feedback.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22237: [SPARK-25243][SQL] Use FailureSafeParser in from_json

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22237
  
    @maropu May I ask you to look at it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org