You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/09 09:58:42 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

HyukjinKwon opened a new pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854
 
 
   ### What changes were proposed in this pull request?
   
   This PR proposes two things:
   
   1. Convert `null` to `string` type during schema inference of `schema_of_json` as JSON datasource does. This is a bug fix as well because `null` string is not the proper DDL formatted string and it is unable for SQL parser to recognise it as a type string. We should match it to JSON datasource and return a string type so `schema_of_json` returns a proper DDL formatted string.
   
   2. Let `schema_of_json` respect `dropFieldIfAllNull` option during schema inference.
   
   
   ### Why are the changes needed?
   
   To let `schema_of_json` return a proper DDL formatted string, and respect `dropFieldIfAllNull` option.
   
   ### Does this PR introduce any user-facing change?
   Yes, it does.
   
   ```scala
   import collection.JavaConverters._
   import org.apache.spark.sql.functions._
   
   spark.range(1).select(schema_of_json("""{"id": ""}""")).show()
   spark.range(1).select(schema_of_json(lit("""{"id": "a", "drop": {"drop": null}}"""), Map("dropFieldIfAllNull" -> "true").asJava)).show(false)
   ```
   
   **Before:**
   
   ```
   struct<id:null>
   struct<drop:struct<drop:null>,id:string> 
   ```
   
   
   **After:**
   
   ```
   struct<id:string>
   struct<id:string>
   ```
   
   
   ### How was this patch tested?
   
   Manually tested, and unittests were added.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596435791
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24294/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596435078
 
 
   **[Test build #119563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119563/testReport)** for PR 27854 at commit [`4ca013b`](https://github.com/apache/spark/commit/4ca013b68dca9a97c24997752218736348458492).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-597024507
 
 
   Got it. No problem~

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596566995
 
 
   **[Test build #119563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119563/testReport)** for PR 27854 at commit [`4ca013b`](https://github.com/apache/spark/commit/4ca013b68dca9a97c24997752218736348458492).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596435791
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24294/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389861619
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   Yep. I agree.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389849211
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   In these days, the behavior change becomes a tricky issue although it's a bug fix. I'm wondering if this PR is safe (`not a silent behavior change`) in this case because Apache Spark 2.4.x ~ 3.0.0-preview2 returns like the following as mentioned in the PR description. To make it sure,  ping @gatorsmile , @cloud-fan , @marmbrus , @srowen .
   ```scala
   scala> spark.range(1).select(schema_of_json("""{"id": null}""")).show
   +----------------------------+
   |schema_of_json({"id": null})|
   +----------------------------+
   |             struct<id:null>|
   +----------------------------+
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596568232
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596568253
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119563/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389560513
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ##########
 @@ -777,7 +777,18 @@ case class SchemaOfJson(
   override def eval(v: InternalRow): Any = {
     val dt = Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, json)) { parser =>
       parser.nextToken()
-      jsonInferSchema.inferField(parser)
+      // To match with schema inference from JSON datasource.
+      jsonInferSchema.inferField(parser) match {
+        case st: StructType =>
+          jsonInferSchema.canonicalizeType(st, jsonOptions).getOrElse(StructType(Nil))
+        case at: ArrayType if at.elementType.isInstanceOf[StructType] =>
+          jsonInferSchema
+            .canonicalizeType(at.elementType, jsonOptions)
+            .map(ArrayType(_, containsNull = at.containsNull))
+            .getOrElse(ArrayType(StructType(Nil), containsNull = at.containsNull))
+        case other: DataType =>
+          jsonInferSchema.canonicalizeType(other, jsonOptions).getOrElse(StringType)
+      }
 
 Review comment:
   Here is about the actual fix.
   
   The reason why there are differences compared to JSON datasource is:
   
   1. JSON datasource always expects `StructType`. Array of JSON objects is still inferred as `StructType`. Other types are disallowed as a root type.
   2. `schema_of_json` infers the type as is. Array of JSON objects is inferred as `ArrayType(StructType)`. Other types are allowed as a root type.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
nchammas commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389766263
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ##########
 @@ -198,7 +196,8 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
    * Recursively canonicalizes inferred types, e.g., removes StructTypes with no fields,
    * drops NullTypes or converts them to StringType based on provided options.
    */
-  private def canonicalizeType(tpe: DataType, options: JSONOptions): Option[DataType] = tpe match {
+  private[catalyst] def canonicalizeType(
 
 Review comment:
   Scala noob question: What's the significance of `private[catalyst]` vs. `private`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596947302
 
 
   Could you make a backporting PR for `branch-2.4`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r390060767
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   @dongjoon-hyun, the problem here is that `struct<id:null>` can't be used as a schema. The main purpose of `schema_of_json` is an alternative for schema inference for `from_json`.
   
   Currently, the codes you mentioned don't work with it:
   
   ```scala
   val schemaExpr = schema_of_json(lit("""{"id": ""}"""))
   spark.range(1).select(from_json(lit("""{"id": ""}"""), schemaExpr)).show()
   ```
   
   **Before:**
   
   ```
   org.apache.spark.sql.catalyst.parser.ParseException:
   ...
   == SQL ==
   struct<id:null>
   ------^^^
   ```
   
   **After:**
   
   ```
   +---------------------+
   |from_json({"id": ""})|
   +---------------------+
   |                   []|
   +---------------------+
   ```
   
   `struct<id:null>` can't be used anywhere as we know `null` isn't supported as DDL formatted string.
   
   It's unlikely users depend on this behaviour so I personally don't think it's worthwhile to add a configuration and I would even doubt about the release note if this fix only lands to Spark 3.0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596568232
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596435775
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389560513
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ##########
 @@ -777,7 +777,18 @@ case class SchemaOfJson(
   override def eval(v: InternalRow): Any = {
     val dt = Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, json)) { parser =>
       parser.nextToken()
-      jsonInferSchema.inferField(parser)
+      // To match with schema inference from JSON datasource.
+      jsonInferSchema.inferField(parser) match {
+        case st: StructType =>
+          jsonInferSchema.canonicalizeType(st, jsonOptions).getOrElse(StructType(Nil))
+        case at: ArrayType if at.elementType.isInstanceOf[StructType] =>
+          jsonInferSchema
+            .canonicalizeType(at.elementType, jsonOptions)
+            .map(ArrayType(_, containsNull = at.containsNull))
+            .getOrElse(ArrayType(StructType(Nil), containsNull = at.containsNull))
+        case other: DataType =>
+          jsonInferSchema.canonicalizeType(other, jsonOptions).getOrElse(StringType)
+      }
 
 Review comment:
   The reason why there are differences compared to JSON datasource is:
   
   1. JSON datasource always respects `StructType`. Array of JSON objects is still inferred as `StructType`. Other types are disallowed as a root type.
   2. `schema_of_json` infers the type as is. Array of JSON objects is inferred as `ArrayType(StructType)`. Other types are allowed as a root type.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596986664
 
 
   Thank you so much @dongjoon-hyun! To me, I think it's fine to don't backport but I don't mind if somebody demands.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389560513
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ##########
 @@ -777,7 +777,18 @@ case class SchemaOfJson(
   override def eval(v: InternalRow): Any = {
     val dt = Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, json)) { parser =>
       parser.nextToken()
-      jsonInferSchema.inferField(parser)
+      // To match with schema inference from JSON datasource.
+      jsonInferSchema.inferField(parser) match {
+        case st: StructType =>
+          jsonInferSchema.canonicalizeType(st, jsonOptions).getOrElse(StructType(Nil))
+        case at: ArrayType if at.elementType.isInstanceOf[StructType] =>
+          jsonInferSchema
+            .canonicalizeType(at.elementType, jsonOptions)
+            .map(ArrayType(_, containsNull = at.containsNull))
+            .getOrElse(ArrayType(StructType(Nil), containsNull = at.containsNull))
+        case other: DataType =>
+          jsonInferSchema.canonicalizeType(other, jsonOptions).getOrElse(StringType)
+      }
 
 Review comment:
   Here is about the actual fix.
   
   The reason why there are differences compared to JSON datasource is:
   
   1. JSON datasource always respects `StructType`. Array of JSON objects is still inferred as `StructType`. Other types are disallowed as a root type.
   2. `schema_of_json` infers the type as is. Array of JSON objects is inferred as `ArrayType(StructType)`. Other types are allowed as a root type.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r390132726
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   Thanks. Got it. It makes sense to me, too.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389889339
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ##########
 @@ -198,7 +196,8 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
    * Recursively canonicalizes inferred types, e.g., removes StructTypes with no fields,
    * drops NullTypes or converts them to StringType based on provided options.
    */
-  private def canonicalizeType(tpe: DataType, options: JSONOptions): Option[DataType] = tpe match {
+  private[catalyst] def canonicalizeType(
 
 Review comment:
   private = visible to class only
   private[foo] = visible to package foo and subpackages only

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596568253
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119563/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596434301
 
 
   cc @cloud-fan, @nchammas, @MaxGekk 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596435775
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#issuecomment-596435078
 
 
   **[Test build #119563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119563/testReport)** for PR 27854 at commit [`4ca013b`](https://github.com/apache/spark/commit/4ca013b68dca9a97c24997752218736348458492).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r390060767
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   @dongjoon-hyun, the problem here is that `struct<id:null>` can't be used as a schema. The main purpose of `schema_of_json` is an alternative for schema inference for `from_json`.
   
   Currently, the codes you mentioned don't work with it:
   
   ```scala
   val schemaExpr = schema_of_json(lit("""{"id": null}"""))
   spark.range(1).select(from_json(lit("""{"id": null}"""), schemaExpr)).show()
   ```
   
   **Before:**
   
   ```
   org.apache.spark.sql.catalyst.parser.ParseException:
   ...
   == SQL ==
   struct<id:null>
   ------^^^
   ```
   
   **After:**
   
   ```
   +-----------------------+
   |from_json({"id": null})|
   +-----------------------+
   |                     []|
   +-----------------------+
   ```
   
   `struct<id:null>` can't be used anywhere as `null` isn't supported as DDL formatted string.
   
   It's unlikely users depend on this behaviour so I personally don't think it's worthwhile to add a configuration and I would even doubt about the release note if this fix only lands to Spark 3.0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r389852012
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   From what I read here, the OP issue looks like a clean bug fix. Yes a behavior change but bug fixes are. The change you highlight here is subtler, yes. I think it's reasonable to infer string type rather than null type, but would only do it at 3.0, not Spark 2.x. It would be a release notes item.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27854: [SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
URL: https://github.com/apache/spark/pull/27854#discussion_r390060767
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
 ##########
 @@ -674,4 +674,40 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
       spark.range(1).select(schema_of_json(input)),
       Seq(Row("struct<id:bigint,price:double>")))
   }
+
+  test("SPARK-31065: schema_of_json - null and empty strings as strings") {
+    Seq("""{"id": null}""", """{"id": ""}""").foreach { input =>
+      checkAnswer(
+        spark.range(1).select(schema_of_json(input)),
+        Seq(Row("struct<id:string>")))
 
 Review comment:
   @dongjoon-hyun, the problem here is that `struct<id:null>` can't be used as a schema. The main purpose of `schema_of_json` is an alternative for schema inference for `from_json`.
   
   Currently, the codes you mentioned don't work with it:
   
   ```scala
   val schemaExpr = schema_of_json(lit("""{"id": null}"""))
   spark.range(1).select(from_json(lit("""{"id": null}"""), schemaExpr)).show()
   ```
   
   **Before:**
   
   ```
   org.apache.spark.sql.catalyst.parser.ParseException:
   ...
   == SQL ==
   struct<id:null>
   ------^^^
   ```
   
   **After:**
   
   ```
   +-----------------------+
   |from_json({"id": null})|
   +-----------------------+
   |                     []|
   +-----------------------+
   ```
   
   `struct<id:null>` can't be used anywhere as we know `null` isn't supported as DDL formatted string.
   
   It's unlikely users depend on this behaviour so I personally don't think it's worthwhile to add a configuration and I would even doubt about the release note if this fix only lands to Spark 3.0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org