You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/04 20:29:02 UTC

[GitHub] [spark] viirya opened a new pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

viirya opened a new pull request #29942:
URL: https://github.com/apache/spark/pull/29942

### What changes were proposed in this pull request?

This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`.

### Why are the changes needed?

Simplify complex expression tree that could be produced by query optimization or user.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Unit test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703947053






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703841962


   **[Test build #129420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129420/testReport)** for PR 29942 at commit [`e40118a`](https://github.com/apache/spark/commit/e40118a4487fd53ca533d86f6c865643fe5d17d5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703416332


   **[Test build #129402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129402/testReport)** for PR 29942 at commit [`849fc50`](https://github.com/apache/spark/commit/849fc50f4cd02022ccb7c4db98754500e1a996a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703438300


   **[Test build #129404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129404/testReport)** for PR 29942 at commit [`430d915`](https://github.com/apache/spark/commit/430d91581d611be69c70e3d0c8686f4160db1b48).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704593806


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34076/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703465388


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34012/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703474393


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34012/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704663338






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703414487


   Thanks for quick response. Addressed the comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703319906


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34001/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703947053






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499830227



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala
##########
@@ -199,4 +199,51 @@ class OptimizeJsonExprsSuite extends PlanTest with ExpressionEvalHelper {
         JsonToStructs(prunedSchema2, options, 'json), field2, 0, 1, false).as("b")).analyze
     comparePlans(optimized2, expected2)
   }
+
+  test("SPARK-33007: simplify named_struct + from_json") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c long, d string")
+
+    val query1 = testRelation2
+      .select(namedStruct(
+        "a", GetStructField(JsonToStructs(schema, options, 'json), 0),
+        "b", GetStructField(JsonToStructs(schema, options, 'json), 1)).as("struct"))
+    val optimized1 = Optimizer.execute(query1.analyze)
+
+    val prunedSchema1 = StructType.fromDDL("a int, b int")
+    val nullStruct = namedStruct("a", Literal(null, IntegerType), "b", Literal(null, IntegerType))
+    val expected1 = testRelation2
+      .select(
+        If(IsNull('json),
+          nullStruct,
+          KnownNotNull(JsonToStructs(prunedSchema1, options, 'json))).as("struct")).analyze
+    comparePlans(optimized1, expected1)
+
+    // Skip it if `namedStruct` aliases field name.
+    val field1 = StructType.fromDDL("a int")
+    val field2 = StructType.fromDDL("b int")
+    val query2 = testRelation2
+      .select(namedStruct(
+        "a1", GetStructField(JsonToStructs(schema, options, 'json), 0),
+        "b", GetStructField(JsonToStructs(schema, options, 'json), 1)).as("struct"))
+    val optimized2 = Optimizer.execute(query2.analyze)

Review comment:
       seems this is a bit repetitive - perhaps we can create a util method for the comparison? we can test evaluation in the method too.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names and there is not duplicated fields in the struct.

Review comment:
       nit: "there is not duplicated fields" -> "there is no duplicated field"

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.

Review comment:
       For a fresh eye with no context this is still a bit confusing - does the list `col1`, `col2` etc have to represent all columns in the `json` struct? 

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,43 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       Perhaps explain a little bit on what this does? without any context I'm assuming `col1`, `col2`, `col3` etc are all columns for `from_json`? 

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       perhaps explain a bit more on what this does? with no context I'm assuming `from_json` contains all columns `col1`, `col2`, `col3` etc?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703444692


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34009/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703312270


   **[Test build #129394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129394/testReport)** for PR 29942 at commit [`3eb2947`](https://github.com/apache/spark/commit/3eb29472724d769cf211daa031e1abffb1d246e5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499359176



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name

Review comment:
       Oh, this is kind of corner case. We cannot simplify this case. Added test case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703571655


   **[Test build #129405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129405/testReport)** for PR 29942 at commit [`430d915`](https://github.com/apache/spark/commit/430d91581d611be69c70e3d0c8686f4160db1b48).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704593826


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34076/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704624912






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703444714






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703870236


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34027/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704624912






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704606071


   **[Test build #129477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129477/testReport)** for PR 29942 at commit [`73320e8`](https://github.com/apache/spark/commit/73320e89f512b896ceee9eebe283046db3dda15a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704614306


   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703449446


   **[Test build #129405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129405/testReport)** for PR 29942 at commit [`430d915`](https://github.com/apache/spark/commit/430d91581d611be69c70e3d0c8686f4160db1b48).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703980643


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34037/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500633104



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length

Review comment:
       `c.names.map(_.toString).distinct.length` seems to ignore case-insensitive situation. Could you add a test case for column `A` and `a` in case-insensitive mode?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703870255






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704033899






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703946557


   **[Test build #129420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129420/testReport)** for PR 29942 at commit [`e40118a`](https://github.com/apache/spark/commit/e40118a4487fd53ca533d86f6c865643fe5d17d5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704606071


   **[Test build #129477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129477/testReport)** for PR 29942 at commit [`73320e8`](https://github.com/apache/spark/commit/73320e89f512b896ceee9eebe283046db3dda15a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703317294


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34001/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703318853


   cc @sunchao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704617359


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34084/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703433082


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34009/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499308242



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))

Review comment:
       nit: `_.children(0)` -> `_.children.head` my IDE suggested.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))

Review comment:
       Can this check be merged with L39? https://github.com/apache/spark/pull/29942/files#diff-f9d27e3c9c32aaf07bb038c779309414R39

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case (_, _) => false
+        }
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names.
+        if (semanticEqual && sameFieldName) {
+          val fromJson = jsonToStructs.head.asInstanceOf[JsonToStructs].copy(schema = c.dataType)
+          val nullFields = c.children.grouped(2).map {
+            case Seq(name, value) => Seq(name, Literal(null, value.dataType))
+          }.flatten.toSeq
+
+          If(IsNull(fromJson.child), c.copy(children = nullFields), KnownNotNull(fromJson))

Review comment:
       Is this related to this optimization? This looks more general to me.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case (_, _) => false
+        }
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names.
+        if (semanticEqual && sameFieldName) {
+          val fromJson = jsonToStructs.head.asInstanceOf[JsonToStructs].copy(schema = c.dataType)
+          val nullFields = c.children.grouped(2).map {

Review comment:
       `map` -> `flatMap`

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name

Review comment:
       Does this work correctly if multiple values refer to the same ordinal?
   ```
   scala> sql("""SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE') v""").createOrReplaceTempView("t")
   scala> sql("select named_struct('a', t.v.a, 'a', t.v.a) from t").explain(true)
   == Parsed Logical Plan ==
   'Project [unresolvedalias('named_struct(a, 't.v.a, a, 't.v.a), None)]
   +- 'UnresolvedRelation [t], [], false
   
   == Analyzed Logical Plan ==
   named_struct(a, v.a AS `a`, a, v.a AS `a`): struct<a:int,a:int>
   Project [named_struct(a, v#128.a, a, v#128.a) AS named_struct(a, v.a AS `a`, a, v.a AS `a`)#133]
   +- SubqueryAlias t
      +- Project [from_json(StructField(a,IntegerType,true), StructField(b,DoubleType,true), {"a":1, "b":0.8}, Some(Asia/Tokyo)) AS v#128]
         +- OneRowRelation
   
   == Optimized Logical Plan ==
   Project [[1,1] AS named_struct(a, v.a AS `a`, a, v.a AS `a`)#133]
   +- OneRowRelation
   
   == Physical Plan ==
   *(1) Project [[1,1] AS named_struct(a, v.a AS `a`, a, v.a AS `a`)#133]
   +- *(1) Scan OneRowRelation[]
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703465781


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499359590



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case (_, _) => false
+        }
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names.
+        if (semanticEqual && sameFieldName) {
+          val fromJson = jsonToStructs.head.asInstanceOf[JsonToStructs].copy(schema = c.dataType)
+          val nullFields = c.children.grouped(2).map {
+            case Seq(name, value) => Seq(name, Literal(null, value.dataType))
+          }.flatten.toSeq
+
+          If(IsNull(fromJson.child), c.copy(children = nullFields), KnownNotNull(fromJson))

Review comment:
       Ah, ok. I see.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703851082

> @viirya just to clarify, is it to avoid calling the same `from_json` multiple times? How does it relate to [SPARK-32939](https://issues.apache.org/jira/browse/SPARK-32939) and [SPARK-32943](https://issues.apache.org/jira/browse/SPARK-32943)?

This patch targets specifically for a special pattern `CreateNamedStruct` + multiple `GetStructField` of same `JsonToStructs`, it could be produced by the optimizer or by users manually.

Sometimes the query optimizer can optimize a query to have many duplicated expressions e.g. `JsonToStructs`. This is SPARK-32943 wants to fix. It targets a broader problem.

For SPARK-32939, because it was not reported by me, some details I might not get from its description. We don't de-duplicate expressions in whole-stage codegen overall (but only in specified operator). If we disable whole-stage codegen, interpreted Project will de-duplicate expressions for some cases (`GenerateUnsafeProjection`), but not always (we could also fallback to `InterpretedUnsafeProjection` possibly). For specified expressions like `CaseWhen`, we have a chance to de-duplicate the condition expressions, if we want.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703319908






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703979440


   No more comments from me too. I am okay with this given that we have a plan for related tickets (https://github.com/apache/spark/pull/29942#issuecomment-703851082).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500650676



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.

Review comment:
       Ok, revised the comment here. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499945296



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names and there is not duplicated fields in the struct.

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499363266



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))

Review comment:
       hm I see. I noticed that it looped `c.valExprs` `{3 x len(c.valExprs)}` times to check the condition. Minor optimization though, I thought it would be nice if it could stop early if the condition not met.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500633104



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length

Review comment:
       `c.names.map(_.toString).distinct.length` seems to ignore case-insensitive situation. Could you add a test case for column `A` and `a` in case-insensitive mode? If we don't need to consider that case because this is `Json`, could you add more comments about that please?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703347993






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703964432


   **[Test build #129430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129430/testReport)** for PR 29942 at commit [`a1b464f`](https://github.com/apache/spark/commit/a1b464f5b23589068772536e569b3f9431645c5f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499375051



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))

Review comment:
       Moved the condition to top.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703465788


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34011/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500633104



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length

Review comment:
       ~`c.names.map(_.toString).distinct.length` seems to ignore case-insensitive situation. Could you add a test case for column `A` and `a` in case-insensitive mode? If we don't need to consider that case because this is `Json`, could you add more explanation about that, please?~ Never mind. I misread it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499945084



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala
##########
@@ -199,4 +199,51 @@ class OptimizeJsonExprsSuite extends PlanTest with ExpressionEvalHelper {
         JsonToStructs(prunedSchema2, options, 'json), field2, 0, 1, false).as("b")).analyze
     comparePlans(optimized2, expected2)
   }
+
+  test("SPARK-33007: simplify named_struct + from_json") {
+    val options = Map.empty[String, String]
+    val schema = StructType.fromDDL("a int, b int, c long, d string")
+
+    val query1 = testRelation2
+      .select(namedStruct(
+        "a", GetStructField(JsonToStructs(schema, options, 'json), 0),
+        "b", GetStructField(JsonToStructs(schema, options, 'json), 1)).as("struct"))
+    val optimized1 = Optimizer.execute(query1.analyze)
+
+    val prunedSchema1 = StructType.fromDDL("a int, b int")
+    val nullStruct = namedStruct("a", Literal(null, IntegerType), "b", Literal(null, IntegerType))
+    val expected1 = testRelation2
+      .select(
+        If(IsNull('json),
+          nullStruct,
+          KnownNotNull(JsonToStructs(prunedSchema1, options, 'json))).as("struct")).analyze
+    comparePlans(optimized1, expected1)
+
+    // Skip it if `namedStruct` aliases field name.
+    val field1 = StructType.fromDDL("a int")
+    val field2 = StructType.fromDDL("b int")
+    val query2 = testRelation2
+      .select(namedStruct(
+        "a1", GetStructField(JsonToStructs(schema, options, 'json), 0),
+        "b", GetStructField(JsonToStructs(schema, options, 'json), 1)).as("struct"))
+    val optimized2 = Optimizer.execute(query2.analyze)

Review comment:
       Ok.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703442589






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500618515



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.

Review comment:
       Do we have a test coverage for this where accessing all columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499821006



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,43 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       Yeah, changed the comment.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,43 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case (_, _) => false

Review comment:
       fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703474405






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703552602


   @viirya just to clarify, is it to avoid calling the same `from_json` multiple times? How does it relate to SPARK-32939 and  SPARK-32943?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703849487


   Thanks @dongjoon-hyun for pinging and left some comments @viirya (sorry some comments are stale so pls ignore them).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703862251


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34027/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703964432


   **[Test build #129430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129430/testReport)** for PR 29942 at commit [`a1b464f`](https://github.com/apache/spark/commit/a1b464f5b23589068772536e569b3f9431645c5f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704662659


   **[Test build #129469 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129469/testReport)** for PR 29942 at commit [`2c76a91`](https://github.com/apache/spark/commit/2c76a91047cb93576e28a33761e1e63beda7b779).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500633104



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length

Review comment:
       `c.names.map(_.toString).distinct.length` seems to ignore case-insensitive situation. Could you add a test case for column `A` and `a` in case-insensitive mode? If we don't need to consider that because this is `Json`, could you add more comments about that please?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703474405






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703870255






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703455366


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34011/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499834347



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.

Review comment:
       No, it could be part of the json struct. In the case, we will prune unnecessary columns in `JsonToStructs`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704692242






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703841962


   **[Test build #129420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129420/testReport)** for PR 29942 at commit [`e40118a`](https://github.com/apache/spark/commit/e40118a4487fd53ca533d86f6c865643fe5d17d5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500610743



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names and there is not duplicated field in the struct.

Review comment:
       `not` -> `no`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703980651






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704033899






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704663338






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703347460


   **[Test build #129394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129394/testReport)** for PR 29942 at commit [`3eb2947`](https://github.com/apache/spark/commit/3eb29472724d769cf211daa031e1abffb1d246e5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703312574


   cc @HyukjinKwon @maropu @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704572569


   **[Test build #129469 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129469/testReport)** for PR 29942 at commit [`2c76a91`](https://github.com/apache/spark/commit/2c76a91047cb93576e28a33761e1e63beda7b779).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun closed pull request #29942:
URL: https://github.com/apache/spark/pull/29942


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704624899


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34084/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703444714






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499355316



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))

Review comment:
       We can but L39 condition will look ugly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704572569


   **[Test build #129469 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129469/testReport)** for PR 29942 at commit [`2c76a91`](https://github.com/apache/spark/commit/2c76a91047cb93576e28a33761e1e63beda7b779).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703962881


   Thanks @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703312270


   **[Test build #129394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129394/testReport)** for PR 29942 at commit [`3eb2947`](https://github.com/apache/spark/commit/3eb29472724d769cf211daa031e1abffb1d246e5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703959479


   No more comment and it looks okay.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703318830


   Thank you for pining me, @viirya .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499497878



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,43 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case (_, _) => false

Review comment:
       nit: `case (_, _) =>` -> `case _ =>`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704587982


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34076/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703442619






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704692242






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703980651






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703465769


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34011/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499440791



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,43 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       It's a nit but can we match the comment style? 1. and 2. used the expression class names but 3. used function-ish name.
   
   BTW, `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst` is a private package that doesn't generate Javadoc. So it should be safe to use `[[...]]` as we want (because using `[[..]]` in Scaladoc causes some problems sometimes when it's converted into Javadoc in some cases such as `trait`s).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500644314



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.

Review comment:
       @viirya . It would be great if we describe `If(IsNull(json),  nullStruct, KnownNotNull(JsonToStructs(prunedSchema1, options, json, ..))` pattern here because technically it's a new optimized form but not a simplified one. The existing (1), (2), (3) of this AS-IS PR doesn't cover it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703572710






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500611716



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names and there is not duplicated field in the struct.

Review comment:
       oh, will fix it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703449446


   **[Test build #129405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129405/testReport)** for PR 29942 at commit [`430d915`](https://github.com/apache/spark/commit/430d91581d611be69c70e3d0c8686f4160db1b48).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704033308


   **[Test build #129430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129430/testReport)** for PR 29942 at commit [`a1b464f`](https://github.com/apache/spark/commit/a1b464f5b23589068772536e569b3f9431645c5f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499355958



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case (_, _) => false
+        }
+
+        // If we create struct from various fields of the same `JsonToStructs` and we don't
+        // alias field names.
+        if (semanticEqual && sameFieldName) {
+          val fromJson = jsonToStructs.head.asInstanceOf[JsonToStructs].copy(schema = c.dataType)
+          val nullFields = c.children.grouped(2).map {
+            case Seq(name, value) => Seq(name, Literal(null, value.dataType))
+          }.flatten.toSeq
+
+          If(IsNull(fromJson.child), c.copy(children = nullFields), KnownNotNull(fromJson))

Review comment:
       `JsonToStructs`'s `nullable` is true. If the input json to `JsonToStructs` is null, we will get a null output. But `CreateNamedStruct`'s `nullable` is false, so here we need to keep nullability unchanged by wrapping them with a `If(IsNull...)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703976059


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34037/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500633104



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+          // If we create struct from various fields of the same `JsonToStructs`.
+          if c.valExprs.forall { v =>
+            v.isInstanceOf[GetStructField] &&
+              v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs] &&
+              v.children.head.semanticEquals(c.valExprs.head.children.head)
+          } =>
+        val jsonToStructs = c.valExprs.map(_.children.head)
+        val sameFieldName = c.names.zip(c.valExprs).forall {
+          case (name, valExpr: GetStructField) =>
+            name.toString == valExpr.childSchema(valExpr.ordinal).name
+          case _ => false
+        }
+
+        // Although `CreateNamedStruct` allows duplicated field names, e.g. "a int, a int",
+        // `JsonToStructs` does not support parsing json with duplicated field names.
+        val duplicateFields = c.names.map(_.toString).distinct.length != c.names.length

Review comment:
       `c.names.map(_.toString).distinct.length` seems to ignore case-insensitive situation. Could you add a test case for column `A` and `a` in case-insensitive mode? If we don't need to consider that case because this is `Json`, could you add more explanation about that, please?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703416332






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703445803


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499440791



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,43 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       It's a nit but can we match the comment style? 1. and 2. used the expression class names but 3. used function-ish name.
   
   BTW, `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst` is a private package that doesn't generate Javadoc. So it should be safe to use `[[...]]` as we want (because using `[[...]]` in Scaladoc causes some problems sometimes when it's converted into Javadoc in some cases such as `trait`s).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703572710






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499359235



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       fixed. thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704593821






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704691663


   **[Test build #129477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129477/testReport)** for PR 29942 at commit [`73320e8`](https://github.com/apache/spark/commit/73320e89f512b896ceee9eebe283046db3dda15a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500644314



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.

Review comment:
       @viirya . It would be great if we describe `If(IsNull(json),  nullStruct, KnownNotNull(JsonToStructs(prunedSchema1, options, json, ..))` pattern here because technically it's a new optimized form but not a simplified one. The existing (1), (2), (3) of this PR doesn't cover it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703442688


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129402/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703465781






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703319908






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703980833


   Thanks @HyukjinKwon 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499308508



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)

Review comment:
       `3` -> `3.`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r500618515



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,45 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) =>
+ *      CreateNamedStruct(JsonToStructs(json)) if JsonToStructs(json) is shared among all
+ *      fields of CreateNamedStruct.

Review comment:
       Do we have a test coverage for this where accessing all columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-704593821


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #29942:
URL: https://github.com/apache/spark/pull/29942#discussion_r499369030



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
##########
@@ -28,10 +28,36 @@ import org.apache.spark.sql.types.{ArrayType, StructType}
  * The optimization includes:
  * 1. JsonToStructs(StructsToJson(child)) => child.
  * 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs.
+ * 3  struct(from_json.col1, from_json.col2, from_json.col3...) => struct(from_json)
  */
 object OptimizeJsonExprs extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
     case p => p.transformExpressions {
+
+      case c: CreateNamedStruct
+        if c.valExprs.forall(v => v.isInstanceOf[GetStructField] &&
+          v.asInstanceOf[GetStructField].child.isInstanceOf[JsonToStructs]) =>
+        val jsonToStructs = c.valExprs.map(_.children(0))
+        val semanticEqual = jsonToStructs.tail.forall(jsonToStructs.head.semanticEquals(_))

Review comment:
       Ok, let me change it and see how it looks like.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703442619






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703347993






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #29942: [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #29942:
URL: https://github.com/apache/spark/pull/29942#issuecomment-703442680






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org